All articles
Voice·November 11, 2025·8 min read

Inside the Audio Pipeline: 48kHz Mic to Sub-Second Replies

A round-trip tour of the audio plumbing behind an instant-feeling voice agent: 48kHz mic capture, 16kHz PCM chunks, 24kHz playback, and barge-in.

By Matrix Team

A voice agent lives or dies on one number: the gap between when you stop talking and when it starts. Get it under a second and the thing feels alive. Miss by half a second and every reply lands like a bad phone connection — people start talking over it, the model barges in on itself, and the magic evaporates.

That number isn't set by the model alone. Most of voice ai latency is plumbing: how mic audio gets captured, resampled, chunked, and shipped; how the model's audio gets scheduled back into your speakers; and how fast both sides can stop on a dime when you interrupt. This post traces one round trip through the Matrix browser-direct voice path, end to end, and explains why each knob matters.

If you want the wire-protocol side — the snake_case quirks and the WebSocket auth gotchas — read Barge-In and the Wire-Protocol Gotchas of Real-Time Voice Agents. For the telephony version of all this, see We Put Gemini Live on a Phone Line. Here we stay in the browser, where there's zero server in the audio path and latency is yours to win or lose.

The round trip, at a glance

mic ─▶ AudioContext (~48kHz) ─▶ AudioWorklet ─▶ 16kHz PCM16LE chunks
        ─▶ base64 ─▶ realtime_input.audio ─▶ Gemini Live (WebSocket)
                                                      │
speakers ◀─ AudioBufferSourceNode ◀─ AudioContext(24kHz) ◀─ inlineData (24kHz PCM)

Two AudioContexts, two sample rates, one WebSocket. The browser holds the socket straight to Gemini Live; the backend only mints an ephemeral token to open it. Nothing in the audio path is server-side, so every millisecond you don't waste here is a millisecond off the perceived reply time.

Capture: 48kHz mic to 16kHz PCM

Browsers don't let you pick the mic sample rate. getUserMedia hands you whatever the device runs — usually around 48kHz, but it varies. Gemini Live wants something specific: 16kHz mono, 16-bit PCM, little-endian. So the first job is a resample, and it has to happen off the main thread or it competes with React for the same event loop and adds jitter.

That's what web/public/worklets/pcm-capture.js is for — an AudioWorklet running on the audio render thread, fed the raw mic stream at the input context's native rate. It does three things per block:

  1. Linear-downsample from the input rate to 16kHz. It walks a fractional read position at ratio = sampleRate / 16000, interpolating between adjacent samples.
  2. Pack to Int16 LE — clamp each float sample to [-1, 1], then scale to the signed 16-bit range (* 0x8000 for negatives, * 0x7fff for positives).
  3. Chunk and post — accumulate into a fixed buffer and postMessage each completed chunk's ArrayBuffer (transferred, not copied) to the main thread, which base64-encodes it and ships it as realtime_input.audio.
const ratio = sampleRate / this.targetRate; // e.g. 48000 / 16000 = 3
let pos = this.resamplePos;
while (pos < channel.length) {
  const i0 = Math.floor(pos);
  const i1 = Math.min(i0 + 1, channel.length - 1);
  const frac = pos - i0;
  const s = channel[i0] * (1 - frac) + channel[i1] * frac; // linear interp
  const clamped = Math.max(-1, Math.min(1, s));
  this.outBuf[this.outLen++] = clamped < 0 ? clamped * 0x8000 : clamped * 0x7fff;
  if (this.outLen >= this.chunkSamples) {
    const copy = new Int16Array(this.outBuf);
    this.port.postMessage(copy.buffer, [copy.buffer]); // transfer, zero-copy
    this.outBuf = new Int16Array(this.chunkSamples);
    this.outLen = 0;
  }
  pos += ratio;
}
this.resamplePos = pos - channel.length; // carry the fraction across blocks

Two details that aren't obvious from the snippet. First, the worklet must be served as a static file (/worklets/pcm-capture.js) and loaded with audioContext.audioWorklet.addModule("/worklets/pcm-capture.js"). If you let a bundler touch it, it emits ESM that the worklet runtime — which has no module system — can't load. Second, resamplePos carries the leftover fraction across render blocks, so the resampler doesn't drift or click at block boundaries.

Why chunk size is a latency knob

The original design described 100ms (1600-sample) chunks at 16kHz. The shipped worklet went smaller — 480 samples, 30ms — and that's a deliberate latency trade.

Here's the reasoning. Every chunk you hold back is audio the server's voice-activity detector (VAD) hasn't seen yet. Bigger chunks mean fewer messages and less per-frame overhead, but they also mean the server learns later that you've started — or stopped — speaking. Google's own Live API guidance recommends 20–40ms chunks, and 480 samples / 16kHz lands at 30ms, right in that band. Smaller chunks let the server's VAD detect speech start and end faster, shaving real perceived latency at both ends of every turn — faster to notice you talking, faster to notice you've finished and start replying.

Go too small and you pay it back in message overhead and base64 bloat. Thirty milliseconds is the sweet spot the current code settled on: enough frames per second for snappy VAD, few enough to keep the socket calm.

Playback: a dedicated 24kHz context, scheduled tail-to-tail

The model talks back at a different rate: 24kHz mono PCM16LE, delivered as serverContent.modelTurn.parts[].inlineData. Resampling that to match the 48kHz capture context would be wasted CPU and a fresh source of clicks. So playback gets its own context, constructed at exactly the rate the model speaks:

const playback = new AudioContext({ sampleRate: 24000 });

No resampling, no rate mismatch, no artifacts. Each inbound chunk becomes an AudioBuffer and is scheduled with AudioBufferSourceNode.start(t), where t is the end time of the previous source. The buffers are stitched head-to-tail so playback is gapless — there's no setTimeout-driven pacing, no polling, just a running cursor that says "the next sound starts exactly where the last one ends."

This is the cheapest way to keep audio smooth under variable network arrival. Chunks can land bursty or late; as long as the next one arrives before the cursor catches up, the listener hears one continuous voice. Scheduling against endTime instead of wall-clock time is what makes that hold.

Barge-in: stop the sources, not the counter

The hardest part of feeling instant isn't starting fast — it's stopping fast. When the caller interrupts, the agent has to go quiet immediately, or it talks over the human and the conversation falls apart.

Gemini Live signals this with serverContent.interrupted: true. The server clears its own turn, but that does nothing for audio the browser has already scheduled. Those AudioBufferSourceNodes are committed to the audio clock; they'll play out regardless. The first naive implementation just reset the internal cursor (nextStartTime, the active count) — and the agent appeared to repeat itself two or three times on every interruption, because every queued buffer still fired.

The fix is to keep a Set of every queued source and, on interrupt, hard-stop all of them:

function reset() {
  for (const src of activeSources) {
    src.stop();        // halt playback now
    src.disconnect();  // detach from the graph
  }
  activeSources.clear();
  nextStartTime = 0;
}

That cuts audio within about 50ms. But there's a race left: the server may have already pipelined one more stale chunk before it processed the interrupt. If you play it, the listener hears a fragment of the old turn after they've spoken. If you blindly drop everything, you risk clipping the model's next turn.

The resolution is a ~60ms drop window. For 60ms after interrupted=true, inbound audio chunks are swallowed — long enough to eat the in-flight stale chunk, short enough that the model's next reply still plays immediately. That next reply matters: every persona is instructed to open its post-interrupt turn with a brief verbal acknowledgement — a "haan boliye?", "haan ji bataiye", a short nod — so the user gets audible confirmation they were heard before the substance arrives. The drop window is tuned to let that acknowledgement through while killing the ghost of the previous turn.

Why all three numbers matter together

Perceived latency is the sum of small decisions, and they compound:

  • 30ms capture chunks → the server's VAD sees speech-start and speech-end sooner, so the model starts and stops on tighter boundaries.
  • Native 24kHz playback → no resample stage between socket and speaker, no buffering delay, no clicks to mask.
  • Tail-to-tail scheduling → bursty network arrival never turns into audible gaps, so you never have to over-buffer "just in case."
  • Source-level barge-in + 60ms drop window → interruptions cut clean instead of stacking three turns of stale audio.

Drop any one and the others can't save you. Ship 100ms chunks and you've added perceptible lag at both ends of every turn. Resample 24kHz audio through a 48kHz context and you've bought yourself clicks. Reset a counter instead of stopping the sources and the agent stutters over every interruption. The whole thing has to be tuned as a pipeline, not a stack of independent parts — and most of these settings cost real iterations to find, all catalogued in docs/LEARNINGS.md alongside the rest of the voice journey.

Takeaway

A voice agent feels instant when the plumbing gets out of the model's way. Capture off the main thread, downsample to 16kHz in an AudioWorklet, chunk at ~30ms so the VAD reacts fast, play back at the model's native 24kHz with tail-to-tail scheduling, and make barge-in stop the actual audio sources behind a tiny drop window. None of it is exotic — but every shortcut shows up as latency or stutter, and the listener feels it before they can name it.

In Matrix the browser-direct voice path puts all of this client-side, with the backend only minting the ephemeral token. There's no server in the audio loop to add a hop, which is exactly why sub-second replies are achievable.

Build a voice agent on it

Want to hear the round trip yourself? Create a workspace, spin up an agent in /orgs/{slug}/admin/agents, and open its /voice page — the same pcm-capture.js worklet and 24kHz playback path described here drives every call. Then read the wire-protocol gotchas before you start tuning, because most of the obvious fixes have already been tried.

#voice ai latency#audio pipeline#pcm#audioworklet

Build your first agent on Matrix

Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.

Keep reading