The value of a performance oracle

April 7, 2026
The word 'VALUE' in bold letters on a textured pink background.
Photo by Ann H on Pexels

The surprise twist

Matt Keeter ported his Raven bytecode VM to a tail-calling style and found something that made performance folks sit up: the tail-calling Rust interpreter beat a switch-based one and, on some platforms, even outpaced hand-written assembly. It has been reported that he also concluded tail-calling in WebAssembly performed poorly — 1.2× slower on Firefox, 3.7× slower on Chrome, and 4.6× slower in Wasmtime — and allegedly the JITs aren’t smart enough to lower the pattern to optimal machine code. Ouch. Expectations met reality, and reality was blunt.

Re-runs that change the story

Another developer re-ran Keeter’s experiment on an AMD Ryzen Threadripper PRO 5955WX using nightly toolchains from 7 April 2026 and three toolchains: native rustc, Wasmtime, and Wastrel. The results confirm Keeter’s native and Wasmtime numbers, but Wastrel tells a different story: where Wasmtime showed 4.3× overhead for the switch VM and 6.5× for the tail-calling version versus the fastest assembler run, Wastrel produced only 2.4× overhead for the switch VM and 2.3× for the tail-calling VM. In short: Wasm still costs time, but the scale of the penalty depends on the implementation — and tail-calling isn’t the villain it was made out to be.

Where the time goes

Digging into generated code, the re-run author found the usual patterns in Wastrel but noticed repeated reloads of a struct containing the memory address and size — something that could and should live in registers. It has been reported that about 98% of time is spent in the single interpreter function under Wastrel, which feels like low-hanging fruit for an optimizing backend such as Cranelift. Pre-compilation in Wasmtime didn’t help, and the tail-calling case is further complicated by calling-convention differences: Keeter’s native build uses preserve_none to let LLVM hand more registers to opcodes, a luxury Wasm’s stack machine mapping and current JITs may not grant.

So what now?

The emotional sting here is familiar: a neat low-level trick gets punished by an implementation accident, not by theory. Who’s to blame — the language, the VM, or the JIT? Maybe all of the above, maybe none. The takeaway is practical: performance experiments are oracles. They reveal where compilers and runtimes miss opportunities and where engineers should focus their tuning. Want predictable Wasm performance? Keep running the tests, feed the results back to JIT authors, and don’t assume the first result is gospel.

Sources: wingolog.org, Lobsters