Gemma-4 E2B QAT
Matches the 8-bit version's trustworthiness at half the weights. The RAM it doesn't use goes to holding the whole lecture in context, which matters more than a bigger model here.
LectureSync can write your notes with an open model running entirely on your Mac: nothing leaves the room. But which model? We keep a standing leaderboard: every promising open model gets run on real university lectures, scored against hand-built answer keys, and thrown out the moment it makes something up. This page is the whole thing: the picks, the scores, and the rejects.
Two models cover all four memory tiers. Both are QAT checkpoints, versions of Google's Gemma-4 family that were trained to survive 4-bit compression instead of being squeezed after the fact. That training detail turned out to be the difference between trustworthy and not (more on that below).
Matches the 8-bit version's trustworthiness at half the weights. The RAM it doesn't use goes to holding the whole lecture in context, which matters more than a bigger model here.
Zero fabricated statements in our hardest test, the cleanest result in the field. QAT shrank it from 16.9 GB to 14.3 GB, which is exactly what got it under the 24 GB tier's budget.
Every model writes notes for the same real lectures, and every set of notes is scored against a hand-built answer key. Four ingredients make the score; one rule overrides them all.
A model can lose points for thin coverage and win them back on structure. It cannot win back a made-up fact. Fabricate, and your score is capped no matter how good the rest is:
"Fits" means fits the budget, not the box: macOS, the app, and the lecture's own context need the rest of your RAM. A model that technically loads but starves everything else doesn't count as fitting.
All 35, including every reject. Tinted rows are the two that ship in LectureSync. Click a column to sort.
Ships = the production pick for a tier. Passed = cleared the fabrication gate, just wasn't the best fit. Rejected = fabricates or loops. Fabrication = % of generated math notes containing an invented statement (out of n notes; older entries were tested on smaller n). Loops = % of runs that degenerated into repetition. Depth = how much optional detail (worked examples, asides) the notes keep; depth only counts if you can trust it. Model names link to the exact weights on Hugging Face.
What 35 bake-offs taught us, with names named.
Squeeze Gemma-4 E4B to 4-bit the normal way and it fabricates on 8.3% of math notes, 25% on one lecture. The same model's QAT checkpoint, trained for 4-bit from the start: 0.3%, at half the size of the 8-bit version. That one result is why both shipping picks are QAT.
The 2.6 GB E2B QAT ties the 5 GB 8-bit E2B on faithfulness and holds the pick on both the 8 GB and 16 GB tiers. Nothing bigger earned its keep there; the spare memory does more good holding a 90-minute lecture in context.
Qwen3.6-27B kept 87% of optional detail, the deepest notes in the entire field, and fabricated on 8.2% of math notes. Capped at 40, rejected. Detail you can't trust isn't detail; it's homework for your fact-checker.
Mellum 2, a code model, writes beautiful prose and correct definitions, then botches the arithmetic inside its own worked example: it multiplies a matrix by [7, 9], drops a term, and presents [7, 18] where [34, 68] belongs. Exactly the error a student seeing the material for the first time can't catch. Rejected.
Not the tinkering type? The built-in option needs zero setup, and our cloud bake-off found great notes for under a penny a lecture.
See the cloud model picksThe fabrication numbers are automated, then hand-checked. A detector compares every generated math note against the answer key. It can overcount (it sometimes flags a legitimate intermediate step), so before a model is gated, we validate the flagged notes by hand. Mellum 2's headline 86.7% includes overcounting; its real, hand-verified fabrication was still far past the gate.
Sample sizes vary. Current entries are scored on 300 to 500 generated math notes across multiple sampling seeds. Historical bake-off entries ran on 15 to 45, where a single slip reads as 6.7%. The n for each model is in the table; small-n rejects stay rejected until a bigger re-test says otherwise.
This measures note-writing, not the models themselves. A model that fails our gate can be excellent at code or chat. We're testing one narrow, unforgiving job: faithfully compressing a real lecture into study notes on consumer hardware.
The leaderboard moves. New checkpoints land monthly and we re-run the same tests. When a pick changes, it's because we measured something better. Nobody pays to be on this page, and every model on it is free to download.
METHODOLOGY · Tests run 2026-05-29 → 2026-06-10 on Apple-silicon Macs via llama.cpp (Metal), using LectureSync's production note-taking prompt on real MIT OCW and Yale lectures. Composite score: 30% faithfulness + 30% coverage + 30% structure + 10% conciseness, scaled to 100, with fabrication caps applied after (>0.5% → ≤60, >2% → ≤40, >10% → ≤30; repetition loops floor the score). Fabrication rate: share of generated math notes containing at least one invented statement, multi-seed where n ≥ 300. Sizes are GGUF file sizes; tier budgets assume macOS + app + context overhead. Last updated 2026-06-11.