Lean proved this program was correct; then an AI found a bug

April 13, 2026
A hand using correction fluid to fix a spelling error on paper.
Photo by cottonbro studio on Pexels

The promise — and a proof

A team in the Lean ecosystem produced lean-zip: a fully mechanized, machine-checked implementation of zlib, DEFLATE, gzip, ZIP and tar handling. It has been reported that 10 autonomous agents built and proved the implementation end-to-end, and that Leo De Moura described the result as a Lean-verified guarantee that decompression is the exact inverse of its compressor for inputs under 1 GB. That’s the emotional moment: math says it’s correct. The theorems are clean, the proof scripts tick green. Proofs like this are the dream of formal methods — ironclad assurances against whole classes of bugs.

The counterexample — found the hard way

But it has been reported that a researcher pointed a Claude agent at the stripped-down lean-zip binary, armed it with AFL++, AddressSanitizer, Valgrind, UBSan and a pile of crafted inputs, and let it loose. Over roughly 105 million fuzzing executions across six attack surfaces, the run exposed crashes and memory errors — not in the verified algorithmic code, but in the Lean runtime. The culprit: lean_alloc_sarray, the runtime allocator for scalar arrays. For a ByteArray of capacity n the runtime computes 24 + n bytes; when n is near SIZE_MAX that addition wraps and a tiny buffer is allocated, enabling a heap buffer overflow. In short: the proof stood, the ground shifted beneath it.

Why this stings — and what to do

This is a classic lesson in assumptions. Formal verification gives you guarantees about the model you proved. But what if the model relies on a runtime that can lie? Then the guarantees don’t cover that layer. The striking, slightly comic truth here is that a mathematically ironclad implementation can still be undone by an unverified allocator — who watches the watchers, indeed. With AI-driven fuzzers collapsing the cost of discovery, the industry needs to treat runtimes, FFI boundaries and toolchains as first-class targets for verification and scrutiny. Verified algorithms plus unverified infrastructure is a half-measure; expect more audits, verified runtimes, and stricter proofs of the whole stack, not just the top.

The bigger picture

There’s another takeaway: automated provers and automated attack tools are converging. It has been reported that Anthropic declined to release a vulnerability-discovery model for being “too dangerous,” and that sounds less like fearmongering and more like an industry-wide canary. If finding bugs becomes cheap and fast, software built in an era that assumed obscurity will be exposed. Formal methods remain one powerful tool — but this incident is a reminder that security is a system property, end to end.

Sources: kirancodes.me, Lobsters