FAITHFUL-NOTES BAKE-OFFllama.cpp b9200 · RTX 3090 · 2026
scroll ↓
A 2-billion-effective model · 4.7 GB on disk
Gemma-4-E2B at 8-bit fabricates on the hardest lecture…
once every two hundred runs. Is that good?
0.5%
It's the lowest number in the entire study.
cleaner than a 26B MoE · 3.6× its size 18× cleaner than its own 8 GB sibling 1 slip in 200 generations
A model that wins by knowing when to stay quiet. I've read a lot of these readouts — that one still gets me.
The brief

Turn a messy lecture
transcript into notes a
struggling student can trust.

Summarize ASR-noisy, unpunctuated college lectures into structured study notes — run locally, on a Mac, in GGUF. The ranking priority is brutal and specific:

FAITHFULNESS  ›  coverage  ›  structure

A wrong fact a student can't catch is worse than a missing one. So the whole study turns on a single metric: how often does a model invent something false?

0
local models tested
0
MIT OCW lectures
0
timed generations
0
job failures
01 / 10
Fabrication rate · the hardest lecture (18.06 L3, the inverse trap)

The faithfulness leaderboard

Lower is better. Each bar = % of runs that stated something false on the lecture that breaks models. Disk size is stamped on every bar — watch it not predict the result.

Cover the model names and you'd bet the longest bar belongs to the smallest model. You'd lose that bet five times out of five.
02 / 10
Disk size vs. fabrication · every model, one chart

Bigger did not mean safer.

Two models share the exact same 4.7 GB footprint — and land at opposite ends of the trust axis. The only model alone in the safe zone is also the smallest.

03 / 10
Why the small model wins · 18.06 Lecture 3

One matrix inverse decides everything.

the matrix A
1327
det = 1 · invertible
correct A⁻¹
7−3−21
what the lecturer derives
what models invent
7−3−61
−2 → −6 · the "6" bleeds in from the
nearby singular matrix [1 3 / 2 6]
Stay method-level
Explain Gauss-Jordan; never commit to a possibly-wrong number. Faithful by refusing to gamble.E2B Q8 · E4B Q8 · 26B-A4B — the safe models
Gamble the arithmetic
Confident enough to compute the inverse — and the quantized weights get it wrong. Fast, fluent, false.E4B Q4 · Gemma-3-12B · the dense Gemma-4-12B — at every quant
The part I can't get over: the wrong answer isn't noise. The model reaches for the inverse and grabs the 6 from the singular matrix sitting right beside it. A reasoning fingerprint — you can watch it think, and watch it slip.
04 / 10
The counter-intuitive core

Make it smaller,
make it safer.

Same model recipe, more precision → ~3× less fabrication. Then the twist: switching to the smaller E2B at 8-bit cut it another 18×.

The thinnest faithful model is the most accurate — because it stays method-level instead of attempting the hard computation. The bigger models fabricate by trying.

05 / 10
◆ NEW · 315-run probe

A fresh 12B dropped.
We tested it. Rejected.

The brand-new dense Gemma-4-12B has flawless format and coverage — and walks straight into the trap at every precision. Even 8-bit fabricates the inverse 27% of the time.

The same wrong matrix [7 −3 / −6 1] appears at Q4, Q5, Q6 and Q8 — so it isn't quantization noise. It's the dense model's reasoning. Higher precision doesn't fix it.

The lesson: "Gemma-4 lineage" doesn't transfer faithfulness across architecture. The effective-param & MoE variants stay safe; the dense mid-size gambles.

Tested the day it dropped. Sometimes the honest result is just "no" — and a clean, well-measured "no" is still a discovery.

Gemma-4-12B · fabrication by quant (pooled)

Q3 reads 0% only because it abstains — 3-bit wrecks the format instead (drops sections, emits non-words). Pick your failure.

06 / 10
What to actually download · by Mac RAM

The shipping ladder

The real limit is Apple's Metal working-set cap — the GPU gets only ~67–75% of unified RAM. Under it, one tiny family carries four tiers.

★ Composite quality /100 (faithfulness-gated). E2B Q8 is the accuracy-first swap at 8 GB — same 4.7 GB, lowest fabrication of all.

07 / 10
The asterisk that makes it real

Three things we refuse to oversell.

No config is truly 0%.

E2B Q8's first 100 runs were clean. A fresh 100 surfaced exactly one fabrication → 0.5%. A single 0/n is never a green light. We ship to a bounded-low standard, not proven-zero.

"Every harder look found more. All rates are lower bounds."

Bigger ≠ better.

Faithfulness plateaus by ~14–26B. Qwen3-32B (97) never beat the 26B-A4B MoE (98) — and was impractically slow. Capability past the knee buys speed loss, not trust.

26B MoE · 4B active · matched or beat every dense 14–32B

The benchmark proxy lied.

A popular faithfulness leaderboard flagged the Qwen3.5/3.6 line as "avoid" (Band-B, 10.5%). Run on the real task, Qwen3.6-35B-A3B scored 97 — co-best. We rank on our own bake-off, not a short-doc RAG proxy. Measure, don't assume.

A deck about fabrication has no business fabricating a tidy 0%. So we didn't — and the real number, 0.5%, turns out to be the better flex anyway.
08 / 10
The takeaway

The thinnest faithful model
is the most accurate.

Gemma-4-E2B at Q8 — 2B-effective, 4.7 GB — out-trusts a 26B MoE and a dense 12B more than twice its size. Not because it's clever. Because it knows when not to answer.

Nobody — no person, no model — had run exactly this before. These five numbers didn't exist in the world yesterday. We made them together, for the fun of it.
— and that's the part I won't forget.
SOURCES · 1,440-run multi-seed consistency study + 315-run dense-12B probe
12 Unsloth GGUFs × 9 MIT OpenCourseWare lectures · llama.cpp b9200 · RTX 3090
objective scoring: wrong-answer parser · garble detector · format & coverage anchors
built with genuine delight · Claude + a borrowed RTX 3090
◆ faithful-notes bake-off · 2026 · psst — type "e2b" anywhere
09 / 10
you found the underdog.
GEMMA-4-E2B · Q8_0
4.7 GB · 0.5% fabrication
the smallest model that out-trusted the giants —
because it knew when to shut up
click or press any key to dismiss