New ARM trick promises faster character-matching for JSON parsing

The problem — tiny, boring, embarrassingly parallel
Parsing JSON is mostly busy work: find quotes, skip whitespace, spot structural characters like {},[],:,,. The trick is doing that for many bytes at once. For example, in a 16‑byte slice you might want two bitmasks — one marking whitespace and one marking structural characters — e.g. 0b1000000100000001 (33025) and 0b0000000010000010 (130). Simple to state. Hard to do at blazing speed across gigabytes of input.
Until recently, fast libraries such as simdjson used ARM NEON and a clever “split-the-byte-into-nibbles and table-lookup” trick (Langdale & Lemire, 2019). Compare low and high nibbles via SIMD table lookups, combine the resulting masks, and voilà: branch-free classification. Nice and portable across AArch64 chips, but it’s still tied to NEON’s 128‑bit lanes. Want wider or nicer mask ops? Tough luck.
What’s new — SVE/SVE2 brings AVX‑512 style masks to ARM
Enter SVE and SVE2. These extensions change the model: vector length is scalable and, crucially, they provide predicate/mask operations similar to AVX‑512. That lets you load and operate only on active lanes and do bitwise masking in ways that simplify and accelerate per-byte classification. On recent servers — Graviton4, Microsoft Cobalt 100, Google Axion, NVIDIA Grace, and many mobile SoCs — SVE2 is now available, and it has been reported that this allows a simpler, faster implementation of the same character-matching task than the old NEON nibble-table dance. It’s not magic; it’s fewer instructions and more expressive masks. Faster, cleaner, and less fiddly.
A caveat: SVE’s “scalable” register width is a feature for silicon makers but a nuisance for coders — you don’t always know the lane count at compile time. Also, it has been reported that Apple has so far not adopted SVE2, which leaves a big chunk of the ecosystem sitting out this particular speed-up. Still, among cloud and server ARM chips the new instructions are real, and the practical result appears to be measurable improvements for vectorized classification used in simdjson, DNS parsing, and similar workloads.
Why it matters
Why care? Because character classification is a microkernel inside parsers and network stacks. Slice off a few cycles per byte and you win back latency or scale throughput. For people who tune parsers like athletes shave milliseconds, SVE2 is the kind of hardware nudge that turns clean ideas into real-world wins. Performance folks will love the symmetry: fewer branches, better masks, and simpler assembly. For everyone else — well, you’ll just notice JSON parsing getting faster, quietly and efficiently.
Comments