Zero-Copy GPU Inference from WebAssembly on Apple Silicon

The short version
It has been reported that on Apple Silicon a WebAssembly module's linear memory can be shared directly with the GPU — no copies, no serialization, no intermediate buffers. The CPU and GPU allegedly read and write the same physical bytes: a Wasm guest fills a matrix in its linear memory, Metal reads it, computes, writes back, and the guest sees the result through the same pointer. Sounds like witchcraft? It's really just different hardware physics.
How the chain is built
The trick is three small, surgical links that line up. First: mmap on ARM64 macOS returns page-aligned regions (16 KB), which matters because Metal buffers expect that alignment. Second: Metal's MTLDevice.makeBuffer(bytesNoCopy:length:) can wrap an existing pointer and, it has been reported, returns an MTLBuffer whose contents() pointer equals the original mmap pointer — no hidden copy. Third: Wasmtime's MemoryCreator trait lets you supply that same mmap region as the runtime's linear memory. Combine them and both Wasm and GPU operate on the same bytes.
Proof and numbers
Measurements are blunt instruments and the author did the hard work: it has been reported that the RSS change when using the zero-copy path was essentially noise (≈0.03 MB) versus ~16.78 MB for the explicit-copy path, and compute latency was comparable. Pointer identity was verified. That little moment — the guest reading back GPU-written results through the very same pointer — is the emotional payoff here. No bus, no buffer juggling. Neat.
Why it matters (and the caveats)
This is the foundation for a project called Driftwood that aims to exploit the pattern for stateful AI inference, and it has been reported that the work is still early and exploratory. The upshot: on unified-memory machines, Wasm could be the control plane and the GPU the compute plane with near-zero overhead — a compelling model for low-latency, sandboxed AI at the edge. Caveats remain: this relies on Apple’s Unified Memory Architecture and specific runtime/OS behavior, so it isn't a universal recipe for discrete-GPU systems. Still — if it scales and the safety questions get sorted — this could change how we think about deploying accelerated sandboxed code.
Sources: abacusnoir.com, Hacker News
Comments