Optimising a Pipelined RISC-V Core: From Naive Pipeline to Near-Superscalar Performance

What the author did
A new deep-dive on a RISC-V blog walks through taking a plain five-stage RV32I(M) core and squeezing performance out of it step by step — all the way to a version that, it has been reported, runs within 2.3% of a superscalar design on the CoreMark benchmark. This wasn’t hand-wavy theory. The post allegedly reports cycle counts from real simulations of the author’s implementation, showing the practical gains from targeted microarchitectural work rather than speculative claims.
The journey is familiar to anyone who’s tuned a pipeline: add smarter forwarding, tighten hazard handling, rethink stalls, and pick low-hanging fruits in decode and execute timing. The author makes the important point that the M extension (hardware multiply/divide) matters a lot for CoreMark; without it, multiplications blow up instruction counts into long software sequences. Small pragmatic choices added up — and they added up to a surprising result.
Why it matters
So why care? After all, modern high-end CPUs are wide, out-of-order beasts with speculative tricks that dwarf a tiny RV32 core. The point here is not to dethrone Apple or Intel. The point is to map the headroom inside a single-issue design: how far can you push a simple pipeline before the one-instruction-per-cycle ceiling bites? The post gives a useful, empirical answer and a roadmap for designers working in constrained environments — embedded, low-power, or open-source silicon projects.
There’s a little thrill in the result. You don’t always expect a humble pipelined core to come within a hair’s breadth of a superscalar competitor on a real benchmark. It’s a reminder that careful microarchitecture and solid measurement still matter. For engineers building open-source silicon or teaching computer architecture, the write-up is a practical case study: not just what to do, but what actually moved the needle.
Sources: mummanajagadeesh.github.io, Lobsters
Comments