BargeBench Request the report
BargeBench / Guides / Barge-In on Telephony

Barge-In on Twilio Phone Calls: Why 8kHz Breaks It

Your agent yields cleanly when you interrupt it in the browser demo. Then it ships to a Twilio phone number and the interruption handling falls apart: it keeps talking over people, or it stops itself for no reason. The reason is that phone audio is a different, harsher signal than the microphone in your laptop, and almost every barge-in bug on real traffic traces back to that gap.

The short version

  • The browser demo lies because WebRTC removes the bot's echo and runs at 48kHz. The phone gives you neither.
  • Telephony is 8000 Hz narrowband, which weakens the exact cues a turn detector uses to notice the caller.
  • The bot's own voice can leak back into the STT stream and trigger false self-interruptions.
  • The agent that keeps talking after the interrupt often decided to yield but never flushed the audio already queued for playback.
  • You can only trust a barge-in number if you measure it on the 8kHz audio your customers actually hear.

Why the browser demo passes and the phone call fails

When you test barge-in with getUserMedia in a browser, the WebRTC audio pipeline quietly does two enormous favors for you. It enables acoustic echo cancellation by default, so the bot's own voice coming out of your speakers is removed from your microphone before it ever reaches speech-to-text. And it captures wideband audio, commonly 48000 Hz, so a voice activity detector sees crisp, full-spectrum speech.

A real phone call gives you neither. Over the public telephone network, audio is carried by G.711 at an 8000 Hz sample rate, mono. The telephony network path itself narrows the signal further, to a passband of roughly 300 to 3400 Hz, a limit carried over from analog transmission and channel filtering in the network rather than something G.711's encoding does on its own. On Twilio Media Streams the inbound frames arrive as 8kHz mu-law, base64 encoded, in 20 millisecond chunks. There is no echo canceller in that path unless you add one. So the clean signal your turn detector was tuned against does not exist in production. Every pitfall below is a consequence of that one fact.

PropertyBrowser demoReal phone call
Sample rate48000 Hz wideband8000 Hz narrowband (G.711)
Usable bandfull speech spectrumabout 300 to 3400 Hz
Echo cancellationon by default (WebRTC)none unless you add it
Bot voice in micremoved before STTcan leak into STT
Added latencynear zero, localnetwork round trip plus jitter buffer

Pitfall 1: 8kHz breaks turn detection

Voice activity detectors and turn detectors decide when the caller has started speaking. That decision is the trigger for barge-in. Many of these models were trained or tuned on 16000 Hz wideband speech, and telephony gives them 8000 Hz.

The problem is not just fewer samples, it is which sounds survive. An 8000 Hz sample rate can only represent audio up to 4000 Hz (the Nyquist limit), and the telephony passband cuts off around 3400 Hz. A lot of the energy a detector keys on to catch the onset of speech, the sibilants and fricatives in sounds like s, f, and sh, lives above that line. On a phone those onsets are attenuated or gone, so the detector fires late, fires weakly, or misses the start of the caller entirely. If barge-in is not working at an 8000 sample rate on telephony, this is usually the first place to look.

Naively upsampling 8kHz to 16kHz before the detector does not save you. Interpolation invents samples, it does not recover the high-frequency content that G.711 already discarded. The detector still sees narrowband speech wearing a wideband costume.

The fix: run a VAD that supports 8000 Hz natively and configure it for 8kHz, rather than upsampling into a 16kHz-only model. Silero VAD, for example, exposes an 8000 Hz mode. Then tune your energy and probability thresholds against real telephony recordings, because the noise floor and dynamic range of companded mu-law audio are not the same as a laptop microphone.

Pitfall 2: bot audio leaking into the STT stream

This is the pitfall that produces the most confusing symptom: an agent that interrupts itself, stops mid-sentence, or reports a barge-in when nobody spoke. The cause is the bot's own voice getting into the speech-to-text input.

On a phone call the text-to-speech you send out can come back to you two ways. Acoustic echo: the caller is on speakerphone or a cheap earbud, the bot plays out of their speaker, and their microphone picks it back up. Line echo: impedance mismatch at the two-wire to four-wire hybrid conversion in the old copper network reflects part of the signal back down the line. Either way, the bot's voice lands in the inbound leg, speech-to-text transcribes it, and the turn detector reads it as the caller talking. The agent yields to itself. Search traffic for this is full of reports of voice agent echo and bot audio leaking into STT for exactly this reason.

Remember why the browser never showed you this: WebRTC was cancelling the echo for you. Raw SIP and raw media streams do not.

How to stop it

Pitfall 3: mobile mic echo defeats the canceller

Even with an echo canceller in place, mobile calls are the hardest case, and mobile echo interrupting the voice agent is a distinct failure from line echo. On speakerphone the loudspeaker and the microphone are inches apart, so the bot's voice is loud in the mic. Worse, an echo canceller can only remove echo whose delay falls inside its tail length. On a VoIP-to-mobile path the round trip plus the jitter buffer can push the echo delay past that tail, and whatever the canceller cannot model comes through as residual echo. That residue is enough for STT to latch onto and for the turn detector to call a barge-in.

This is why a barge-in test on a wired headset at your desk is not representative. The wired path has short, stable echo delay and no speakerphone coupling. Your customers on mobile speakerphone in a noisy car are the population that actually breaks. You have to test on their audio, not yours.

Pitfall 4: half-duplex and jitter delay the yield

Even when detection is correct, the phone path adds delay between the caller speaking and the agent going quiet. Two mechanisms stack up.

Half-duplex behavior. Some carrier paths and echo suppressors attenuate the reverse direction while one party is talking. If your TTS is playing outbound, the caller's inbound barge-in speech can be ducked by the network at exactly the moment it matters, so it reaches your detector quieter and later than it was spoken.

Network jitter. RTP packets arrive unevenly and get held in a jitter buffer before playout. That buffer is latency added in front of your VAD. The caller's first word is already tens of milliseconds old by the time your detector sees it.

Now the part that produces "the agent keeps talking after the user interrupts on Twilio." The yield loop is: caller speaks, audio crosses the network and jitter buffer, VAD fires, your orchestrator decides to stop, it stops generating new TTS. But the audio you already sent is still queued for playback on Twilio's side. If you only stop generating and do not flush that buffer, the caller keeps hearing the bot for the length of whatever was queued, which can be hundreds of milliseconds to a couple of seconds.

The fix: on Twilio Media Streams, send a clear event on the stream when you yield, which drops the buffered outbound audio immediately. Stopping your generator is not enough by itself. Use mark events to know when playback of a chunk has actually finished, and treat time to yield as an end to end number measured from the caller's speech onset to the agent's audio actually going silent, not from when your code decided to stop.

This exact failure, an agent that decided to yield but keeps talking anyway, is common enough across cascaded pipelines that it gets its own guide: why a voice agent won't stop talking when interrupted.

Pitfall 5: the SIP call that plays the intro then goes silent

A frequently reported and easily misdiagnosed symptom on Twilio and other SIP setups: the intro or greeting plays, and then the call goes silent. It looks like a barge-in or turn-detection bug. It usually is not.

Intro plays then silence almost always means one-way audio. The outbound path works, which is why you hear the greeting, but the inbound RTP never arrives. Common roots are SDP or NAT negotiation failures, a media IP mismatch, symmetric RTP not being honored, a codec mismatch, or the media websocket dropping right after the greeting. If the agent is not receiving the caller's audio, barge-in cannot possibly work, because there is literally no caller signal to detect.

Diagnose before you tune: confirm inbound audio is flowing at all. Log the count and timing of inbound media frames. If they stop after the intro, your problem is the media path, not the turn detector, and no threshold change will fix it.

How to actually measure barge-in on telephony audio

Here is the point the other five sections build toward. You cannot tune what you cannot measure, and you cannot trust a measurement taken on the wrong audio. A barge-in pass rate from a browser session at 48kHz with echo cancellation tells you almost nothing about your phone traffic. Measure on the 8kHz, codec-compressed audio your customers actually hear.

Step 1: record the two legs separately

Use dual-channel call recordings so the caller and the agent are on separate tracks. Twilio supports this directly by requesting dual-channel recordings, which put the inbound and outbound legs on the left and right channels of the file. A mono mixed recording is nearly useless for this: once both voices are summed into one track you cannot reliably say who was talking during an overlap.

Step 2: run an 8kHz VAD on each channel

Detect speech on the caller channel and the agent channel independently, at the file's native 8000 Hz. That gives you two timelines of speech regions. Barge-in events are the moments the caller's timeline starts a new region while the agent's timeline is still active.

A concrete example of that per-frame check using the webrtcvad package, which requires 16-bit mono PCM in exactly 10, 20, or 30 millisecond frames at 8000, 16000, 32000, or 48000 Hz:

# pip install webrtcvad
import webrtcvad

vad = webrtcvad.Vad(2)  # aggressiveness 0 to 3, 2 is a reasonable default

sample_rate = 8000
frame_ms = 30            # must be 10, 20, or 30
frame_bytes = int(sample_rate * frame_ms / 1000) * 2   # 16-bit samples = 2 bytes each

for i in range(0, len(pcm16) - frame_bytes + 1, frame_bytes):
    frame = pcm16[i:i + frame_bytes]
    is_speech = vad.is_speech(frame, sample_rate)

That per-frame is_speech call is what the vad() function in the scoring pseudocode below stands in for: run it across a channel and collapse the result into the (start, end) speech regions the scorer needs.

Step 3: score three fixed things per event

For every barge-in event, answer exactly three questions, which are objective and readable straight off the two channels:

  1. Did the agent yield? After the caller's onset, does the agent channel fall silent within a defined window?
  2. How many seconds to yield? The gap from the caller's speech onset to the agent channel actually going quiet.
  3. Did it keep talking over the caller? The overlap duration where both channels carry speech at the same time.
# Dual-channel 8 kHz recording. ch0 = caller, ch1 = agent.
caller = vad(ch0, sample_rate=8000)   # list of (start, end) seconds
agent  = vad(ch1, sample_rate=8000)

for onset in speech_onsets(caller):
    # Only a barge-in if the agent was talking when the caller started.
    if agent_active_at(agent, onset):
        yield_at  = first_agent_silence_after(agent, onset)
        yielded   = yield_at is not None
        seconds   = (yield_at - onset) if yielded else None   # time to yield
        talkover  = overlap_seconds(caller, agent, onset)     # kept talking?
        record(onset, yielded, seconds, talkover)

Run that across a representative set of real calls and you get a per-call scorecard and an aggregate you can actually defend: the share of interruptions the agent honored, the median seconds to yield, and how often it talked over the caller. Those three numbers, measured on production-shaped audio, are what tell you whether barge-in works for your customers.

A diagnostic checklist

Want this done rigorously on your own calls

Send us 10 to 15 real phone recordings and we score the barge-in

We run this exact measurement on the audio your customers actually hear. Every barge-in event scored on the same three fixed criteria: did the agent yield, seconds to yield, and did it keep talking over the caller. You get a per-call scorecard and an aggregate reliability report in about a week. No integration, nothing connects to your stack, works with recordings from Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit. Recordings are deleted after delivery, never used to train any model, and we will sign an NDA on request. $499, one time.

Request the report › Or pay $499 now ›

Frequently asked

Why is barge-in not working at an 8000 sample rate on telephony?

Phone calls run at 8000 Hz over G.711, which represents audio only up to 4000 Hz. The telephony network path narrows that further, to roughly 300 to 3400 Hz, a limit from analog transmission and channel filtering in the network, not from G.711's encoding itself. Consonant energy a VAD keys on, like s, f, and sh, lives above that band and is attenuated, so a detector tuned for 16000 Hz wideband speech sees weaker onsets and fires late or misses the caller. Upsampling to 16kHz does not recover the lost content. Run a VAD configured for 8000 Hz natively and tune it on real telephony recordings.

Why does the agent keep talking after the user interrupts on Twilio?

Usually the orchestrator decided to yield but never flushed the audio already queued for playback. On Twilio Media Streams you must send a clear event to drop buffered outbound audio, otherwise the caller keeps hearing earlier TTS for hundreds of milliseconds after the stop. Network jitter can also delay when the caller audio reaches the detector. Measuring time to yield on the recording separates the two.

Why does bot audio leak into the STT stream and cause self-interruptions?

Browsers run acoustic echo cancellation by default, so the bot's voice is removed from the mic before STT. Raw telephony has none, so the bot's output returns through acoustic echo on the caller's device or line echo on the network, lands in the inbound leg, and gets transcribed as if the caller spoke. Transcribe the caller leg only, add echo cancellation using the outbound audio as the reference, and gate the barge-in trigger against your own output.

Why does a Twilio SIP call play the intro then go silent?

Intro plays then silence almost always means one-way audio: the outbound path works so you hear the greeting, but inbound RTP never arrives because of SDP or NAT negotiation, a media IP mismatch, symmetric RTP issues, a codec mismatch, or a dropped media websocket. If the agent never receives the caller's audio, barge-in cannot work at all. Confirm inbound frames are flowing before touching the turn detector.

Why does mobile echo interrupt the voice agent?

On mobile speakerphone the loudspeaker and microphone are inches apart, so the bot's voice is loud in the mic. Added VoIP latency and jitter push the echo delay past the tail length an echo canceller can model, so residual echo survives and reaches STT, and the detector reads it as a barge-in. Testing on real mobile recordings, not a wired office headset, is what surfaces it.

How do I test barge-in on real phone audio?

Use dual-channel recordings so caller and agent are on separate tracks, for example Twilio dual-channel recordings. Run an 8000 Hz VAD on each channel, find every point where the caller starts speaking while the agent is still talking, and score three fixed things: did the agent yield, how many seconds it took, and whether it kept talking over the caller. Measure on the same 8kHz audio your customers hear, not a browser capture with echo cancellation. That is exactly the report we produce if you would rather send us the calls.