once every two hundred runs. Is that good?
Turn a messy lecture
transcript into notes a
struggling student can trust.
Summarize ASR-noisy, unpunctuated college lectures into structured study notes — run locally, on a Mac, in GGUF. The ranking priority is brutal and specific:
A wrong fact a student can't catch is worse than a missing one. So the whole study turns on a single metric: how often does a model invent something false?
The faithfulness leaderboard
Lower is better. Each bar = % of runs that stated something false on the lecture that breaks models. Disk size is stamped on every bar — watch it not predict the result.
Bigger did not mean safer.
Two models share the exact same 4.7 GB footprint — and land at opposite ends of the trust axis. The only model alone in the safe zone is also the smallest.
One matrix inverse decides everything.
nearby singular matrix [1 3 / 2 6]
Make it smaller,
make it safer.
Same model recipe, more precision → ~3× less fabrication. Then the twist: switching to the smaller E2B at 8-bit cut it another 18×.
The thinnest faithful model is the most accurate — because it stays method-level instead of attempting the hard computation. The bigger models fabricate by trying.
A fresh 12B dropped.
We tested it. Rejected.
The brand-new dense Gemma-4-12B has flawless format and coverage — and walks straight into the trap at every precision. Even 8-bit fabricates the inverse 27% of the time.
The same wrong matrix [7 −3 / −6 1] appears at Q4, Q5, Q6 and Q8 — so it isn't quantization noise. It's the dense model's reasoning. Higher precision doesn't fix it.
The lesson: "Gemma-4 lineage" doesn't transfer faithfulness across architecture. The effective-param & MoE variants stay safe; the dense mid-size gambles.
Tested the day it dropped. Sometimes the honest result is just "no" — and a clean, well-measured "no" is still a discovery.
Q3 reads 0% only because it abstains — 3-bit wrecks the format instead (drops sections, emits non-words). Pick your failure.
The shipping ladder
The real limit is Apple's Metal working-set cap — the GPU gets only ~67–75% of unified RAM. Under it, one tiny family carries four tiers.
★ Composite quality /100 (faithfulness-gated). E2B Q8 is the accuracy-first swap at 8 GB — same 4.7 GB, lowest fabrication of all.
Three things we refuse to oversell.
No config is truly 0%.
E2B Q8's first 100 runs were clean. A fresh 100 surfaced exactly one fabrication → 0.5%. A single 0/n is never a green light. We ship to a bounded-low standard, not proven-zero.
"Every harder look found more. All rates are lower bounds."
Bigger ≠ better.
Faithfulness plateaus by ~14–26B. Qwen3-32B (97) never beat the 26B-A4B MoE (98) — and was impractically slow. Capability past the knee buys speed loss, not trust.
26B MoE · 4B active · matched or beat every dense 14–32B
The benchmark proxy lied.
A popular faithfulness leaderboard flagged the Qwen3.5/3.6 line as "avoid" (Band-B, 10.5%). Run on the real task, Qwen3.6-35B-A3B scored 97 — co-best. We rank on our own bake-off, not a short-doc RAG proxy. Measure, don't assume.
The thinnest faithful model
is the most accurate.
Gemma-4-E2B at Q8 — 2B-effective, 4.7 GB — out-trusts a 26B MoE and a dense 12B more than twice its size. Not because it's clever. Because it knows when not to answer.
12 Unsloth GGUFs × 9 MIT OpenCourseWare lectures · llama.cpp b9200 · RTX 3090
objective scoring: wrong-answer parser · garble detector · format & coverage anchors
built with genuine delight · Claude + a borrowed RTX 3090
◆ faithful-notes bake-off · 2026 · psst — type "e2b" anywhere