Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

April 6, 2026
A confident woman delivers a speech on stage with a microphone and laptop.
Photo by Igreja Dimensão on Pexels

What it is

Parlor is an on-device, real-time multimodal AI demo that listens, sees, and speaks — all on your machine. It has been reported that the system uses Google’s Gemma 4 E2B for multimodal understanding and Kokoro TTS for speech output, letting you talk to a model while showing it your camera feed, and hear a spoken reply back locally. Research preview only: expect rough edges.

How it works (and how fast)

Audio (PCM) and JPEG frames stream from the browser over WebSocket to a FastAPI server. Gemma 4 E2B runs via LiteRT-LM on the GPU for speech+vision understanding, and Kokoro provides platform-aware TTS (MLX on macOS, ONNX on Linux). The frontend handles VAD (Silero) so it’s hands-free; the demo supports barge-in and sentence-level streaming so audio can start before the full response finishes. On an Apple M3 Pro the reported timings are roughly 1.8–2.2s for speech+vision understanding, ~0.3s to generate ~25 tokens, and ~0.3–0.7s for TTS — about 2.5–3.0s end-to-end. Requirements: Python 3.12+, macOS Apple Silicon or Linux with a supported GPU, and roughly 3 GB free RAM for the model.

Why it matters

The project’s author says they run a free voice AI to help people learn English, and it has been reported that the service has hundreds of monthly users; allegedly, six months ago they needed an RTX 5090 just to hit real-time for voice. Now, smaller models like Gemma 4 E2B make real-time multimodal inference feasible on an M3 Pro — a game-changer for accessibility. Imagine pointing your phone at objects and chatting about them, practicing a language locally without sending audio to the cloud. Sounds like the kind of future OpenAI teased a while back, doesn’t it?

Caveats and where to try it

This is an early experiment and not production-ready. It has been reported that models (~2.6 GB for Gemma 4 E2B plus TTS assets) download on first run; the repo is on GitHub (Parlor) under Apache 2.0. If you want to poke at it: clone the project, install uv, run the FastAPI server, open localhost:8000, grant mic/camera access and start talking. Tinkerers only — bring patience, and maybe a snack.

Sources: github.com/fikrikarim, Hacker News