Recommended models

Which AI should write your notes? We tested it.

We ran 13 of the world’s newest AI models on 12 real university lectures (hard STEM and dense humanities) and scored every set of notes against hand-built answer keys. From free models to 15¢-a-lecture flagships, every single one passed our hardest accuracy test. So we picked based on what actually sets them apart. Here's what we found.

fabricated facts across all 13 models, on our hardest accuracy test

Just want the answer?

Start with the built-in option. Free, private, zero setup. Upgrade only if you want more.
Using a cloud model? Our recommendation is MiniMax M3: it tied the priciest flagships on detail and accuracy for under a penny a lecture.
Got 16 GB+ of RAM and like tinkering? Run a local model; our tested picks are below.

Never make something up.

Notes matter most when the material is brand new, and that's exactly when a wrong fact is hardest to catch. Nobody can spot an error in something they're learning for the first time; that's just what learning is. A confidently-wrong note gets studied and trusted, which makes it worse than a note that's thin. So our number-one rule, before anything else, is: never make something up.

That's why our testing isn't about which AI sounds smartest. It's about which one we can trust not to invent a number, misattribute a quote, or botch a calculation; then, among the trustworthy ones, which keeps the most useful detail.

Accuracy≫ Depth of detail› Clean structure› Brevity

Three ways to run it

All three produce the same kind of notes; they differ in setup, privacy, and ceiling.

Built-in Default

The option LectureSync ships with. Runs on your Mac using Apple's built-in models. Zero setup.

Free, foreverNothing leaves your MacGood notes for most classes

Local power-up Measured

Run a custom open model on your own Mac with Ollama, LM Studio, or oMLX, and point LectureSync at it.

Free, still 100% on your MacNeeds 8–32 GB RAMBacked by our 1,755-run bake-off

Cloud Measured

Use a hosted model through a provider like OpenRouter. The highest quality ceiling, for pennies per lecture.

Best-quality notesCosts cents, not dollarsBacked by our 14-model bake-off

How we tested

We ran the exact note-taking instructions LectureSync uses on 12 real university lectures, a deliberately broad mix: MIT linear algebra, MIT computer science, MIT economics, plus Yale philosophy, biology, and history. Hard STEM and dense humanities.

Then we scored every set of notes automatically against a hand-built answer key, not by asking another AI for its opinion, but by checking whether specific facts, named figures, quoted lines, and worked calculations from each lecture actually made it into the notes.

The toughest test, our "fabrication gate": one linear-algebra lecture works out the inverse of a matrix. The right answer is a specific grid of numbers. A model that guesses it and gets it wrong has done the one thing we can't allow. Smaller, cheaper AI models are known to fail this. Every cloud model we tested here got it right.

Faithfulnessno invented facts, ever

Detailworked examples kept

Coveragekey points across any subject

On-topicno class-admin clutter

The results

All 13 were accurate. What separated them was detail, speed, and cost. Click a column to sort.


MiniMax M3 app pick	MiniMax	May 31 ’26	0.65¢	89s	93%	100%	Value
Claude Opus 4.8	Anthropic	May 27 ’26	13.4¢	35s	93%	100%	SOTA
GPT-5.5	OpenAI	Apr 24 ’26	15.3¢	63s	90%	98%	SOTA
MiMo v2.5 Proapp pick	Xiaomi	Apr 22 ’26	0.55¢	51s	88%	93%	Budget
DeepSeek V4 Flashapp pick	DeepSeek	Apr 24 ’26	0.2¢	53s	86%	100%	Budget
Claude Haiku 4.5	Anthropic	Oct 15 ’25	2.08¢	22s	85%	100%	Mid
Gemini 3.5 Flash	Google	May 19 ’26	5.8¢	24s	85%	93%	Mid
MiMo v2.5app pick	Xiaomi	Apr 22 ’26	0.18¢	24s	83%	95%	Budget
Nemotron 3 Superapp pick	NVIDIA	Mar 11 ’26	0.21¢	53s	81%	90%	Budget
Grok 4.3	xAI	Apr 30 ’26	1.45¢	9s	79%	90%	Mid
DeepSeek V4 Pro	DeepSeek	Apr 24 ’26	0.96¢	77s	76%	98%	Budget
Gemini 3.1 Flash Lite	Google	May 7 ’26	0.34¢	5s	64%	88%	Lite
gpt-oss-120b	OpenAI	Aug 5 ’25	Free	42s	61%	80%	Free

✓ Faithful: every model, no exceptions. That column being boring is the headline. Tinted rows marked app pick are the models we recommend inside LectureSync. Detail = % of worked examples & specific numbers that survived into the notes. Coverage = key points captured on unfamiliar (humanities) subjects. Cost = our measured spend per ~1-hour lecture.

Price does not predict quality.

Every dot is a model. The best dots sit at the cheap end, and the expensive flagships are no higher up.

x-axis is logarithmic. ⭐ = our value pick. Free models shown in the free band, left.

What we learned

Accuracy is (finally) a solved problem at the top.

Two years ago, AI models routinely invented facts. Today, every frontier cloud model we tested (including completely free ones) got our hardest math test right, every time. That's the floor now. It's why we can offer this at all.

Price does not predict quality.

Our value pick (MiniMax M3, under a penny a lecture) tied the 15¢ flagships on detail and accuracy. Paying 20× more bought polish, not trustworthiness.

The real difference is how much detail survives.

Since they're all accurate, we chose on depth: does the note keep the worked example, the exact number, the named theorist? The spread was real. The best kept ~93% of the specifics, the thinnest barely 60%. For a student studying for an exam, that detail is the whole point.

Why these picks

With accuracy a given, it came down to three things, in order: detail, cost, and reliability at scale. Here's how that shook out.

⭐ Our cloud recommendation

MiniMax M3

minimax/minimax-m3 · 0.65¢/lecture · 93% detail

It tied the most expensive flagships on the planet for detail and accuracy, while costing under a penny per lecture. Nothing else matched that combination of richness and price.

The budget champion

DeepSeek V4 Flash

deepseek/deepseek-v4-flash · 0.20¢/lecture · 86% detail

Almost as detailed, even cheaper, and the cleanest at ignoring class-admin clutter. A great fallback.

And the famous names? Excellent, just more than you need.

The flagships (GPT-5.5, Claude Opus 4.8): superb models, and on this task they were matched, on both trust and detail, by options costing a twentieth as much. If you already use one, it'll serve you beautifully; you simply don't need flagship prices for great lecture notes.

The "lite" and smallest free models (Gemini 3.1 Flash Lite, gpt-oss-120b): genuinely fast and cheap, but they dropped too many of the worked examples and specific numbers a student actually needs before an exam.

Style outliers: some models leaned wordy (a 10,000-word "summary"), others too skeletal. We favor the ones that hit the right level of detail without a fight.

How current is this?

Our top value pick, MiniMax M3, was four days old when we tested it. Most of the field shipped within the six weeks before our test (plus two 2025 veterans for reference). We re-run this evaluation as new models land; the leaderboard above reflects June 4, 2026.

Prefer local? Picks by your Mac's RAM Measured · 1,755 runs

Before the cloud study, we ran the same kind of bake-off on 13 local models: 1,755 scored generations on real MIT lectures, running entirely on-device. When Google shipped QAT checkpoints of the Gemma-4 family, we re-ran the accuracy tests with the same method; the picks below come from that re-test. Want every number? See the full local leaderboard: all 35 models we've tested, every score, and the rejects.

8 GBMac RAM

Gemma-4-E2B QAT Q4_K_XL · 2.6 GB

The QAT checkpoint matches the old 8-bit E2B's trustworthiness (0.3% fabrication) at roughly half the size, which frees about 2 GB on the smallest Macs.

Measured

16 GBMac RAM

Gemma-4-E2B QAT Q4_K_XL · 2.6 GB

Still the pick: nothing bigger earned its keep here, and the extra memory goes to long-lecture context instead of model weights.

Measured

24 GBMac RAM

Gemma-4-26B-A4B QAT Q4_K_XL · MoE · 14.3 GB

QAT shrank the big mixture-of-experts from 16.9 to 14.3 GB, so it now fits this tier with room for long-lecture context. Zero fabrications in our re-test.

Measured

32 GBMac RAM

Gemma-4-26B-A4B QAT Q4_K_XL · MoE · 14.3 GB

Same pick, more headroom: the model runs comfortably alongside everything else you keep open.

Measured

The surprise of the original study held up in the re-test: the smallest model is the most trustworthy. E2B's QAT checkpoint invented a false statement just 0.3% of the time at half its old size, and the 26B mixture-of-experts came back with zero. That's why E2B is the pick on every Mac under 24 GB, not the budget fallback.

New to this? OpenRouter in four steps

OpenRouter is one account that gives you access to almost every AI model out there. Instead of signing up with five different companies, you sign up once and pick models like items on a menu. That's why it's the easiest on-ramp (and what we used for this entire test).

Create an account at openrouter.ai

Make an API key

In OpenRouter's settings, create a key. It's a long string starting with sk-or-…. Treat it like a password and don't share it.

Paste it into LectureSync

Settings → Connections → add OpenRouter and paste your key. It's stored in your Mac's Keychain; LectureSync never sends it anywhere except OpenRouter itself.

Pick your models

Settings → Notes: for notes, our recommendation is minimax/minimax-m3. For transcription, openai/whisper-large-v3-turbo is the standard pick (this bake-off covered the notes step; transcription wasn't part of it).

LectureSync's Connections settings showing on-device options alongside cloud and local-server connections, with a local oMLX server selected and a Test Connection button.

Step 3: your key, saved in Settings → Connections.

LectureSync's OpenRouter model picker showing 'Top picks for notes' with MiniMax M3 marked as Top pick, plus Xiaomi MiMo, DeepSeek, and NVIDIA Nemotron options, searchable across 340 models.

Step 4: pick your notes model, right inside the app.

Before you ask

Does an AI write my notes?

Yes. By default everything runs on your Mac; if you choose a cloud model, your transcript goes to the provider you picked. Either way, we test obsessively to make sure the notes stick to your lecture and never invent anything.

Could it still get something wrong?

Our hardest test caught zero fabricated facts across 13 models. No system is perfect, which is why our notes stay faithful to what was actually said and flag uncertainty rather than guess.

Why not just use the biggest, most famous model?

We tested them (GPT-5.5, Claude Opus 4.8). They're excellent, and they were matched, on accuracy and detail, by a model costing a twentieth as much. We chose value without compromise.

Do you use free models?

We test them. One free model placed mid-pack and is genuinely good; others were too thin. We pick on results, not price tags.

The honest part

What "Measured" means here: both the local and cloud picks come from our own bake-offs, scored automatically against hand-built answer keys and ranked on faithfulness first. We don't ask another AI for its opinion of the notes.

Privacy trade-off, plainly: with the built-in or local options, your audio and notes never leave your Mac. With any cloud model, your lecture transcript is sent to the provider you picked, under their terms. Details in our privacy policy.

Models change fast. New models drop monthly and we re-test. If a pick on this page changes, it's because we measured something better, not because of a sponsorship. Nobody pays to be recommended here.

METHODOLOGY · Cloud: tested 2026-06-04 on 12 university lectures via OpenRouter, using LectureSync's production note-taking prompt (single pass, low temperature). Notes scored automatically against a curated, hand-built answer key per lecture: faithfulness (fabricated facts), detail (worked examples & specific quantities retained), coverage (key points on unfamiliar subjects), and on-topic discipline. One run per model per lecture: directional, not a statistical study. Costs are real OpenRouter charges. Local: multi-seed study on GGUF models, MIT OCW lectures; full scores on the local model leaderboard.

Last updated: June 4, 2026 · Local: 1,755-run bake-off · Cloud: 14-model bake-off (2026-06-04)