GLM-5.1: a model designed to keep going where others stall

What’s new
It has been reported that GLM-5.1 is z.ai’s next‑generation flagship for “agentic engineering,” with markedly stronger coding capabilities than its predecessor. The company says it sets new state‑of‑the‑art marks on SWE‑Bench Pro and outperforms GLM‑5 by a wide margin on NL2Repo (repo generation) and Terminal‑Bench 2.0 (real‑world terminal tasks). Short version: the model writes better code. Longer version: it keeps improving when other models plateau.
Built for the long haul
Previous models—GLM‑5 included—tend to burn their best tricks early and then stall. Give them more time and nothing changes. GLM‑5.1, allegedly, was engineered to resist that early exhaustion. It reportedly breaks ambiguous problems down, runs experiments, reads the results, identifies blockers and then iterates—hundreds of rounds, thousands of tool calls. The claim is simple but striking: the longer it runs, the better the result. That’s the emotional pivot here — persistence over brilliance.
VectorDBBench: a dramatic example
One concrete example is VectorDBBench, an open‑source challenge for building high‑performance approximate nearest‑neighbor search in Rust under a strict evaluation regime. It has been reported that the best one‑shot result under a 50‑turn budget was 3,547 QPS (Claude Opus 4.6). Reframed as an outer optimization loop with no fixed tool‑call cap, GLM‑5.1 allegedly kept iterating and climbed to roughly 21.5k QPS after 600+ iterations and 6,000+ tool calls — about six times the prior best. The trajectory reportedly followed a “staircase” pattern: long stretches of tuning punctuated by structural shifts (e.g., switching from full‑scan to IVF with f16 compression, then adding a two‑stage u8 prescore + f16 rerank), with temporary dips in recall during exploration.
Broader tests and takeaways
GLM‑5.1 was also evaluated on kernel optimization and open‑ended app builds: KernelBench asks whether a model can convert a reference PyTorch implementation into faster GPU kernels across increasing scope, and the web‑app task supplies essentially no external metric—only the model’s own judgment of progress. It has been reported that GLM‑5.1 sustains useful, self‑directed improvement across these less‑structured settings. The bigger question now: if models get better the longer you let them run, should benchmarks stop being stopwatches? Bench budgets matter. So do patience and tooling. Who knew persistence could be a feature?
Sources: z.ai, Hacker News
Comments