CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on the Test That Made It Famous

The result
It has been reported that Google’s Gemma 2B — a 2-billion-parameter, openly weighted model — scored about 8.0 on MT‑Bench, edging out GPT‑3.5 Turbo’s 7.94. MT‑Bench is the familiar 80-question benchmark that many in the field use as a quick sanity check; when you see an “~8.0” number, you already know the rough performance band. SeqPU says they published the full tape — every prompt, every turn, every score — so anyone can verify the run. Curious? You should be. Your laptop can allegedly reproduce the result; no GPU required.
How they did it
According to the report, the team ran the model with a simple 169‑line Python wrapper — no fine‑tuning, no retrieval, no chain‑of‑thought hacks, just model.generate() and a chat template. They claim to have found seven recurring failure classes (not mere hallucinations but patterned mistakes: arithmetic slips, logic proofs that concluded with the wrong answer, constraint drift, broken personas, ignored qualifiers) and applied six surgical fixes of roughly 60 lines of Python each. With those fixes the score reportedly climbed to ~8.2. The raw model and a “warts and all” bot are allegedly live on Telegram for anyone to poke, prod, and push.
Why it matters
This is a reminder that not every gap is a hardware problem. SeqPU frames it as a software‑engineering win: the field’s fixation on scaling compute and parameter counts may have overshadowed low‑hanging engineering gains that let small, efficient models punch above their weight. Want to try it yourself? It has been reported that the stack is as simple as pip install torch transformers accelerate and a single chat.py, or you can run it globally with Cloudflare Containers for about $5/month. Caveat emptor: the public demo is raw and unguarded — useful for verification, not a production safety blanket. Who knew the little CPU under your keyboard still had tricks up its sleeve?
Sources: seqpu.com, Hacker News
Comments