We ran 14 of the world's newest AI models on 12 real university lectures (hard STEM and dense humanities) and scored every set of notes against hand-built answer keys. From free models to 15¢-a-lecture flagships, every single one passed our hardest accuracy test. So we picked based on what actually sets them apart. Here's the whole story.
Notes matter most when the material is brand new, and that's exactly when a wrong fact is hardest to catch. Nobody can spot an error in something they're learning for the first time; that's just what learning is. A confidently-wrong note gets studied and trusted, which makes it worse than a note that's thin. So our number-one rule, before anything else, is: never make something up.
That's why our testing isn't about which AI sounds smartest. It's about which one we can trust not to invent a number, misattribute a quote, or botch a calculation; then, among the trustworthy ones, which keeps the most useful detail.
All three produce the same kind of notes; they differ in setup, privacy, and ceiling.
The option LectureSync ships with. Runs on your Mac using Apple's built-in models. Zero setup.
Run a custom open model on your own Mac with Ollama, LM Studio, or oMLX, and point LectureSync at it.
Use a hosted model through a provider like OpenRouter. The highest quality ceiling, for pennies per lecture.
We ran the exact note-taking instructions LectureSync uses on 12 real university lectures, a deliberately broad mix: MIT linear algebra, MIT computer science, MIT economics, plus Yale philosophy, biology, and history. Hard STEM and dense humanities.
Then we scored every set of notes automatically against a hand-built answer key, not by asking another AI for its opinion, but by checking whether specific facts, named figures, quoted lines, and worked calculations from each lecture actually made it into the notes.
The toughest test, our "fabrication gate": one linear-algebra lecture works out the inverse of a matrix. The right answer is a specific grid of numbers. A model that guesses it and gets it wrong has done the one thing we can't allow. Smaller, cheaper AI models are known to fail this. Every cloud model we tested here got it right.
All 14 were accurate. What separated them was detail, speed, and cost. Click a column to sort.
| Faithful | ||||||||
|---|---|---|---|---|---|---|---|---|
| MiniMax M3 app pick | MiniMax | May 31 ’26 | 0.65¢ | 89s | 0 errors | 100% | Value | |
| Claude Opus 4.8 | Anthropic | May 27 ’26 | 13.4¢ | 35s | 0 errors | 100% | SOTA | |
| GPT-5.5 | OpenAI | Apr 24 ’26 | 15.3¢ | 63s | 0 errors | 98% | SOTA | |
| MiMo v2.5 Proapp pick | Xiaomi | Apr 22 ’26 | 0.55¢ | 51s | 0 errors | 93% | Budget | |
| DeepSeek V4 Flashapp pick | DeepSeek | Apr 24 ’26 | 0.2¢ | 53s | 0 errors | 100% | Budget | |
| Claude Haiku 4.5 | Anthropic | Oct 15 ’25 | 2.08¢ | 22s | 0 errors | 100% | Mid | |
| Gemini 3.5 Flash | May 19 ’26 | 5.8¢ | 24s | 0 errors | 93% | Mid | ||
| MiMo v2.5app pick | Xiaomi | Apr 22 ’26 | 0.18¢ | 24s | 0 errors | 95% | Budget | |
| Owl Alpha | (stealth) | Apr 28 ’26 | Free | 45s | 0 errors | 95% | Free | |
| Nemotron 3 Superapp pick | NVIDIA | Mar 11 ’26 | 0.21¢ | 53s | 0 errors | 90% | Budget | |
| Grok 4.3 | xAI | Apr 30 ’26 | 1.45¢ | 9s | 0 errors | 90% | Mid | |
| DeepSeek V4 Pro | DeepSeek | Apr 24 ’26 | 0.96¢ | 77s | 0 errors | 98% | Budget | |
| Gemini 3.1 Flash Lite | May 7 ’26 | 0.34¢ | 5s | 0 errors | 88% | Lite | ||
| gpt-oss-120b | OpenAI | Aug 5 ’25 | Free | 42s | 0 errors | 80% | Free |
✓ Faithful: every model, no exceptions. That column being boring is the headline. Tinted rows marked app pick are the models we recommend inside LectureSync. Detail = % of worked examples & specific numbers that survived into the notes. Coverage = key points captured on unfamiliar (humanities) subjects. Cost = our measured spend per ~1-hour lecture.
Every dot is a model. The best dots sit at the cheap end, and the expensive flagships are no higher up.
x-axis is logarithmic. ⭐ = our value pick. Free models shown in the free band, left.
Two years ago, AI models routinely invented facts. Today, every frontier cloud model we tested (including completely free ones) got our hardest math test right, every time. That's the floor now. It's why we can offer this at all.
Our value pick (MiniMax M3, under a penny a lecture) tied the 15¢ flagships on detail and accuracy. A free model (Owl Alpha) beat several paid ones. Paying 20× more bought polish, not trustworthiness.
Since they're all accurate, we chose on depth: does the note keep the worked example, the exact number, the named theorist? The spread was real. The best kept ~93% of the specifics, the thinnest barely 60%. For a student studying for an exam, that detail is the whole point.
With accuracy a given, it came down to three things, in order: detail, cost, and reliability at scale. Here's how that shook out.
It tied the most expensive flagships on the planet for detail and accuracy, while costing under a penny per lecture. Nothing else matched that combination of richness and price.
Almost as detailed, even cheaper, and the cleanest at ignoring class-admin clutter. A great fallback.
A brand-new, no-cost model that out-performed several paid options. Proof that good note-taking no longer requires a big budget.
The flagships (GPT-5.5, Claude Opus 4.8): superb models, and on this task they were matched, on both trust and detail, by options costing a twentieth as much. If you already use one, it'll serve you beautifully; you simply don't need flagship prices for great lecture notes.
The "lite" and smallest free models (Gemini 3.1 Flash Lite, gpt-oss-120b): genuinely fast and cheap, but they dropped too many of the worked examples and specific numbers a student actually needs before an exam.
Style outliers: some models leaned wordy (a 10,000-word "summary"), others too skeletal. We favor the ones that hit the right level of detail without a fight.
Our top value pick, MiniMax M3, was four days old when we tested it. Most of the field shipped within the six weeks before our test (plus two 2025 veterans for reference). We re-run this evaluation as new models land; the leaderboard above reflects June 4, 2026.
Before the cloud study, we ran the same kind of bake-off on 13 local models: 1,755 scored generations on real MIT lectures, running entirely on-device. When Google shipped QAT checkpoints of the Gemma-4 family, we re-ran the accuracy tests with the same method; the picks below come from that re-test.
Q4_K_XL · 2.6 GBQ4_K_XL · 2.6 GBQ4_K_XL · MoE · 14.3 GBQ4_K_XL · MoE · 14.3 GBThe surprise of the original study held up in the re-test: the smallest model is the most trustworthy. E2B's QAT checkpoint invented a false statement just 0.3% of the time at half its old size, and the 26B mixture-of-experts came back with zero. That's why E2B is the pick on every Mac under 24 GB, not the budget fallback.
OpenRouter is one account that gives you access to almost every AI model out there. Instead of signing up with five different companies, you sign up once and pick models like items on a menu. That's why it's the easiest on-ramp (and what we used for this entire test).
Sign up like any website. Add a few dollars of credit; at under a penny a lecture, that lasts a semester.
In OpenRouter's settings, create a key. It's a long string starting with sk-or-…. Treat it like a password and don't share it.
Settings → Connections → add OpenRouter and paste your key. It's stored in your Mac's Keychain; LectureSync never sends it anywhere except OpenRouter itself.
Settings → Defaults: for notes, our recommendation is minimax/minimax-m3. For transcription, openai/whisper-large-v3-turbo is the standard pick (this bake-off covered the notes step; transcription wasn't part of it).
Step 3: your key, saved in Settings → Connections.
Step 4: pick your notes model, right inside the app.
What "Measured" means here: both the local and cloud picks come from our own bake-offs, scored automatically against hand-built answer keys and ranked on faithfulness first. We don't ask another AI for its opinion of the notes.
Privacy trade-off, plainly: with the built-in or local options, your audio and notes never leave your Mac. With any cloud model, your lecture transcript is sent to the provider you picked, under their terms. Details in our privacy policy.
Models change fast. New models drop monthly and we re-test. If a pick on this page changes, it's because we measured something better, not because of a sponsorship. Nobody pays to be recommended here.
METHODOLOGY · Cloud: tested 2026-06-04 on 12 university lectures via OpenRouter, using LectureSync's production note-taking prompt (single pass, low temperature). Notes scored automatically against a curated, hand-built answer key per lecture: faithfulness (fabricated facts), detail (worked examples & specific quantities retained), coverage (key points on unfamiliar subjects), and on-topic discipline. One run per model per lecture: directional, not a statistical study. Costs are real OpenRouter charges. Local: 1,755-run multi-seed study on 13 GGUF models, 9 MIT OCW lectures.
Last updated: June 4, 2026 · Local: 1,755-run bake-off · Cloud: 14-model bake-off (2026-06-04)