ElevenLabs’ Mati Staniszewski on how voice AI actually works — and where it’s headed

How the audio models work
Mati Staniszewski, co-founder of ElevenLabs, walked John Collison through the nuts and bolts of modern speech generation: not mystical black boxes, but pipelines that move from text to Mel spectrograms to waveforms, and increasingly into models that predict audio tokens directly. Think Bell Labs’ old signal engineering, phoneme stitching, then a jump to WaveNet/Tacotron-era neural synthesis — and now transformer- and diffusion-style tricks applied to audio. Short answer: the same probabilistic ideas that cracked text are being adapted to the messy, emotional world of speech. But speech is stickier. Tone, timing, breathing — those tiny human things make conversational audio a tougher nut to crack than static text.
Business model, scale and the product roadmap
Staniszewski described ElevenLabs as moving fast from research to product: APIs, creator tools, licensing and — yes — agentic workflows and music production. It has been reported that the company has reached an $11 billion valuation since its 2022 founding. They’re expanding beyond one-off text-to-speech into systems that can act, translate, and hold context over long interactions. The promise is seductive: natural-sounding voices that developers can drop into apps, services and workflows. The challenge? Turning an impressive demo into durable revenue without breaking trust.
The conversational Turing Test and real-world uses
Why has AI “won” text but not conversation? Staniszewski points to latency, context retention, and the unpredictability of live speech. The interview digs into the so-called conversational Turing Test — can a system hold genuinely human-like back-and-forth? — and contrasts cascaded pipelines with end-to-end speech-to-speech approaches. He also foregrounded practical applications: voice agents for farming, healthcare, and customer service, and universal translation that doesn’t sound robotic. It has been reported that Ukraine is using ElevenLabs’ tech for digital government services, a reminder that this technology has immediate, consequential uses.
Promise and pitfalls
There’s a human core to all this: people expect voices to feel alive. Staniszewski admitted we still can’t get phones to read a PDF “properly” — a small frustration that belies huge technical gaps. As ElevenLabs and others race forward in the post-LLM era, the question isn’t just who builds the best model, but who builds the most responsible one. Voice can be magic — or a weapon. The design choices companies make now will shape whether the next wave of AI feels like a helpful neighbor, or a convincingly empty echo.
Sources: cheekypint.substack.com
Comments