Compiler fusion rediscovers carry‑save addition — and the math is messier than you’d think

The setup
It has been reported that during a visit to Cornell, Sam presented a set of hardware optimisations for fusing arithmetic operators, and that Nate — coming from a compiler background — asked a simple, provocative question: can these be seen as ordinary compiler transformations? TLDR: yes, you can, but it gets pretty ugly. The punchline is neat: a classic hardware trick, carry‑save addition, can be recovered by a compiler optimisation known as loop fusion. Surprising? A little. Satisfying? Very.
The trick
Carry‑save addition replaces two sequential wide additions (x+y, then +z) with a single layer of full‑adders running in parallel: for each bit i, a full adder maps x[i], y[i], z[i] to sum and carry bits s and c with 2*c + s = x[i]+y[i]+z[i]. Do n of those in parallel and you reduce three addends to two in constant time (instead of propagating carries across the whole width twice). It’s a staple in hardware; the blog shows how programmatically fusing two arrays of full‑adders can yield the same local, parallel behaviour.
Loop fusion — and the proof
Loop fusion is the compiler move that combines two passes over an array into one, often saving memory traffic and enabling further local optimisations. Apply fusion to a pair of ripple‑carry adder loops (the straightforward two‑step x+y then +z), and you get a fused loop that resembles the carry‑save network. But do they produce the same s bitvector? The author doesn’t appeal to high‑level semantics; instead they dive into bit‑level algebra over F2 (XOR as addition, AND as multiplication), note MAJ(x,y,z)=xy+yz+xz, and set up an inductive invariant: show a[i]+b[i]=c[i]+d[i] for all i. Base case trivial; inductive step reduces to showing a relation for the next carry term.
Verdict
The conclusion: local, algebraic reasoning — an inductive proof over bit equations — shows the fused program is equivalent to the sequential one. It’s satisfying: a purely compiler‑level transformation reproduces a well‑known hardware speedup. But be warned — the work is fiddly. As the author puts it, the derivation “gets pretty ugly”; you trade neat intuition for a hair‑raising bitwise induction. For anyone interested in hardware/software co‑design or the boundaries of compiler reasoning, the write‑up is worth a read.
Sources: natetyoung.github.io, Hacker News
Comments