Mark's Magic Multiply

Quick take
A deep-dive post on hackernews, originally hosted at wren.wtf, drills into single-precision floating-point multiplication for 32-bit embedded cores — and it has been reported that the whole thing started with an “absolutely ridiculous” trick by Mark Owen. The author frames the work around Xh3sfx, a custom RISC‑V extension meant to be a middle ground between a full FPU and pure software emulation: “firm floating point,” if you will. It has been reported that Xh3sfx can deliver single-precision addition in 14 cycles and multiplication in 16 cycles for a nominal few hundred gates — a level of performance that, according to the post, turns FP from “oh god why is this so slow” into something that just works for general firmware and light DSP.
Why embedded people care
Why tinker here at all? Because many embedded targets lack hardware FP and compilers instead call runtime library routines (libgcc, compiler-rt) to emulate IEEE 754 behavior. Replace those slow runtime stubs with a small acceleration library and you get a huge practical win without forking toolchains. The author walks through actual implementations, explains how a handful of specialized ALU ops handle the nasty corner cases, and shows why saving a couple cycles matters when every instruction counts. The post also lays out Hazard3’s three multiplier configurations — ascending in area and speed — and why choosing the right multiply hardware changes what algorithm is optimal.
The trick and the tradeoffs
So what was Mark Owen’s trick? The post dissects it and, allegedly, it’s clever enough to make long-time firmware folks smile and then rub their temples. The write-up explains a baseline multiply that assumes equally fast mul and mulh and gives a path to shave cycles if you’re willing to relax exact rounding (the author coyly leaves a ~0.5 ulp fast-path “as an exercise for the reader”). There’s real craftsmanship here: small, well-placed hardware primitives plus algebraic shuffling of bits to avoid heavyweight operations. Floating point remains famously treacherous — the post even riffs on the State of California’s warning about FP causing confusion — but for engineers who’ve banged their heads on slow software FP, these tricks read like relief, not legerdemain. Want speed without an FPU? This is the kind of engineering that makes it possible.
Sources: wren.wtf, Hacker News
Comments