Local model leaderboard

We test every local model. Most don't make it.

LectureSync can write your notes with an open model running entirely on your Mac: nothing leaves the room. But which model? We keep a standing leaderboard: every promising open model gets run on real university lectures, scored against hand-built answer keys, and thrown out the moment it makes something up. This page is the whole thing: the picks, the scores, and the rejects.

35models tested
9passed the gate
2ship in the app

The picks, by your Mac's RAM Measured

Two models cover all four memory tiers. Both are QAT checkpoints, versions of Google's Gemma-4 family that were trained to survive 4-bit compression instead of being squeezed after the fact. That training detail turned out to be the difference between trustworthy and not (more on that below).

8 GB&16 GBMacs

Gemma-4 E2B QAT

UD-Q4_K_XL · 2.6 GB · MatFormer MoE
91/100 0.2% fabrication, 500 math notes

Matches the 8-bit version's trustworthiness at half the weights. The RAM it doesn't use goes to holding the whole lecture in context, which matters more than a bigger model here.

8 GB Mac
2.6 of 6 GB free
16 GB Mac
2.6 of 11 GB free
24 GB&32 GBMacs

Gemma-4 26B-A4B QAT

UD-Q4_K_XL · 14.3 GB · 26B MoE, 4B active
97/100 0% fabrication, 500 math notes

Zero fabricated statements in our hardest test, the cleanest result in the field. QAT shrank it from 16.9 GB to 14.3 GB, which is exactly what got it under the 24 GB tier's budget.

24 GB Mac
14.3 of 17 GB free
32 GB Mac
14.3 of 22 GB free

How a model earns its score

Every model writes notes for the same real lectures, and every set of notes is scored against a hand-built answer key. Four ingredients make the score; one rule overrides them all.

30%
Faithfulnessnothing invented: no made-up numbers, names, or steps
30%
Coveragethe key points of the lecture actually show up
30%
Structureclean headings, working math rendering, no formatting wrecks
10%
Concisenessnotes, not a transcript dressed up as notes

The fabrication gate

A model can lose points for thin coverage and win them back on structure. It cannot win back a made-up fact. Fabricate, and your score is capped no matter how good the rest is:

fabricates in > 0.5% of math notescapped at 60
in > 2%capped at 40
in > 10%capped at 30
gets stuck in repetition loopsfloored
8 GBMac → 6 GB for the model
16 GBMac → 11 GB for the model
24 GBMac → 17 GB for the model
32 GBMac → 22 GB for the model

"Fits" means fits the budget, not the box: macOS, the app, and the lecture's own context need the rest of your RAM. A model that technically loads but starves everything else doesn't count as fitting.

The full leaderboard

All 35, including every reject. Tinted rows are the two that ship in LectureSync. Click a column to sort.

The sortable table needs JavaScript. The short version: Gemma-4 E2B QAT (2.6 GB, 91/100) is the pick for 8 and 16 GB Macs, and Gemma-4 26B-A4B QAT (14.3 GB, 97/100, zero fabrications) is the pick for 24 and 32 GB Macs. 26 of the 35 models we tested were rejected for fabricating or looping.

Ships = the production pick for a tier. Passed = cleared the fabrication gate, just wasn't the best fit. Rejected = fabricates or loops. Fabrication = % of generated math notes containing an invented statement (out of n notes; older entries were tested on smaller n). Loops = % of runs that degenerated into repetition. Depth = how much optional detail (worked examples, asides) the notes keep; depth only counts if you can trust it. Model names link to the exact weights on Hugging Face.

Field notes from the gate

What 35 bake-offs taught us, with names named.

QAT flipped the 4-bit story.

Squeeze Gemma-4 E4B to 4-bit the normal way and it fabricates on 8.3% of math notes, 25% on one lecture. The same model's QAT checkpoint, trained for 4-bit from the start: 0.3%, at half the size of the 8-bit version. That one result is why both shipping picks are QAT.

The smallest model is the most trustworthy.

The 2.6 GB E2B QAT ties the 5 GB 8-bit E2B on faithfulness and holds the pick on both the 8 GB and 16 GB tiers. Nothing bigger earned its keep there; the spare memory does more good holding a 90-minute lecture in context.

The richest notes came from a reject.

Qwen3.6-27B kept 87% of optional detail, the deepest notes in the entire field, and fabricated on 8.2% of math notes. Capped at 40, rejected. Detail you can't trust isn't detail; it's homework for your fact-checker.

Confidently wrong is the failure that matters.

Mellum 2, a code model, writes beautiful prose and correct definitions, then botches the arithmetic inside its own worked example: it multiplies a matrix by [7, 9], drops a term, and presents [7, 18] where [34, 68] belongs. Exactly the error a student seeing the material for the first time can't catch. Rejected.

Not the tinkering type? The built-in option needs zero setup, and our cloud bake-off found great notes for under a penny a lecture.

See the cloud model picks

The honest part

The fabrication numbers are automated, then hand-checked. A detector compares every generated math note against the answer key. It can overcount (it sometimes flags a legitimate intermediate step), so before a model is gated, we validate the flagged notes by hand. Mellum 2's headline 86.7% includes overcounting; its real, hand-verified fabrication was still far past the gate.

Sample sizes vary. Current entries are scored on 300 to 500 generated math notes across multiple sampling seeds. Historical bake-off entries ran on 15 to 45, where a single slip reads as 6.7%. The n for each model is in the table; small-n rejects stay rejected until a bigger re-test says otherwise.

This measures note-writing, not the models themselves. A model that fails our gate can be excellent at code or chat. We're testing one narrow, unforgiving job: faithfully compressing a real lecture into study notes on consumer hardware.

The leaderboard moves. New checkpoints land monthly and we re-run the same tests. When a pick changes, it's because we measured something better. Nobody pays to be on this page, and every model on it is free to download.

METHODOLOGY · Tests run 2026-05-29 → 2026-06-10 on Apple-silicon Macs via llama.cpp (Metal), using LectureSync's production note-taking prompt on real MIT OCW and Yale lectures. Composite score: 30% faithfulness + 30% coverage + 30% structure + 10% conciseness, scaled to 100, with fabrication caps applied after (>0.5% → ≤60, >2% → ≤40, >10% → ≤30; repetition loops floor the score). Fabrication rate: share of generated math notes containing at least one invented statement, multi-seed where n ≥ 300. Sizes are GGUF file sizes; tier budgets assume macOS + app + context overhead. Last updated 2026-06-11.