BargeBench / guides / voice agent won't stop talking when interrupted
Engineering guide
Why Your Voice Agent Keeps Talking Over the Caller (Pipecat and LiveKit Debugging Guide)
Barge-in looks like one bug, but it is really a timing problem spread across four stages of a cascaded pipeline. This is a symptom to root cause map: for each way a voice agent doesn't stop speaking when the caller talks over it, the layer to check and how to confirm it from the call recording.
You ship a voice agent. In the demo it feels sharp. Then real callers start interrupting it, and it plows straight through them, finishing a sentence nobody is listening to anymore. The caller talks louder, the agent keeps going, and the call falls apart.
The frustrating part is that "the agent won't stop talking" is not a single failure. In a cascaded stack, the moment a caller barges in has to travel through voice activity detection, a turn or interruption decision, and then all the way back down to whatever is still holding audio: the TTS service and the output transport. A stall or a dropped signal at any one of those stages produces the same audible symptom. So the fix depends entirely on which stage is stalling, and you cannot tell which one by listening alone.
This guide walks the pipeline, shows you how to instrument it so you stop guessing, then maps seven specific symptoms (including the ones people report most: StartInterruptionFrame firing but TTS continuing, and min_interruption_duration looking like it is ignored) onto the exact layer that causes each one.
The pipeline where barge-in actually happens
A cascaded voice agent is a loop. Caller audio comes in, gets turned into text, a turn decision decides whether the caller is really taking the floor, the LLM produces a reply, TTS turns that reply into audio, and the output transport streams it back. Barge-in is what happens when new caller audio arrives while the bottom half of that loop is still emitting the previous reply.
Two clocks decide whether barge-in feels good, and it helps to separate them because they are fixed in different places:
- Detection latency: caller starts talking, and how long until your stack decides that is an interruption. This lives in the VAD window and the turn or interruption logic.
- Cancellation latency: the decision is made, and how long until the last caller-audible sample of the bot actually stops. This lives in the TTS buffer and the output transport, plus any telephony gateway between you and the caller.
"Time to yield" is the sum of both. When people say interrupts are slower, how do I cancel bot speech faster, they usually tune one clock and ignore the other. You have to measure both.
Instrument it before you guess
Every root cause below is confirmed the same way: line up three timelines and see where they disagree. Caller speech onset, the agent's actual audio envelope, and your framework's interruption events. If you only ever listen to the mixed call, you will keep tuning the wrong clock.
1. Get a recording you can measure
Pull a stereo or dual-channel recording if you possibly can, with the caller on one channel and the agent on the other. A mixed mono file still shows you that talk-over happened, but it makes "when did the agent go quiet" much harder to measure because both voices sit in the same waveform. Most stacks can hand you separate tracks: telephony providers expose dual-channel recording, and the agent frameworks can record the input and output legs.
2. Measure time to yield from the envelope
You do not need anything fancy to get a first number. Compute a short-window energy envelope on each channel, find where the caller channel comes alive while the agent channel is still hot, and then find where the agent channel drops and stays down. The gap is your time to yield for that event.
# first-pass time-to-yield from a stereo call recording
# channel 0 = caller, channel 1 = agent (swap if yours is reversed)
import numpy as np, soundfile as sf
audio, sr = sf.read("call_01.wav") # shape (n, 2)
caller, agent = audio[:, 0], audio[:, 1]
def envelope(x, sr, win_ms=20):
w = int(sr * win_ms / 1000)
frames = len(x) // w
e = np.array([np.sqrt(np.mean(x[i*w:(i+1)*w]**2)) for i in range(frames)])
t = np.arange(frames) * win_ms / 1000.0
return t, e
tc, ec = envelope(caller, sr)
ta, ea = envelope(agent, sr)
thr = 0.02 # tune to your noise floor
caller_on = ec > thr
agent_on = ea > thr
for i in range(1, len(tc)):
if caller_on[i] and agent_on[i]: # overlap starts: caller cut in
onset = tc[i]
j = i
while j < len(ea) and ea[j] > thr: # wait for agent to fall silent
j += 1
yield_t = ta[j] if j < len(ta) else ta[-1]
print(f"caller in {onset:.2f}s, agent silent {yield_t:.2f}s, "
f"time to yield {yield_t-onset:.2f}s")
break
This is a starting point, not a scorer. A fixed energy threshold trips on breaths and line noise, so for anything you report on, run a real VAD (for example Silero or WebRTC VAD) per channel and hold the "silent" decision across a few frames so a pause inside a word does not read as a yield. But even the crude version tells you the one thing you need next: is the delay before the agent stops, or after the decision to stop.
3. Turn on the framework's interruption logs
Both stacks emit events at exactly the moments you care about. Log them with timestamps and overlay them on the audio.
- Pipecat emits frames as the turn changes:
UserStartedSpeakingFramewhen the VAD calls a turn start,StartInterruptionFramewhen it decides to interrupt, andBotStoppedSpeakingFramewhen the bot audio is done. The distance fromUserStartedSpeakingFrametoStartInterruptionFrameis detection latency; from there to the audio actually stopping is cancellation latency. - LiveKit exposes the agent state and the current speech handle. Watch when
agent_statemoves offspeakingand when the speech handle for the turn is interrupted, and compare those to the recording.
Symptom to root cause map
Find your symptom, jump to the layer, confirm it against the recording. The detail for each row is below the table.
| Symptom | Layer to check | Root cause |
|---|---|---|
| Agent keeps talking for seconds after the caller cuts in | VAD / turn + toggle | Interruptions off, or a words gate is waiting on the STT |
min_interruption_duration looks ignored |
Turn logic + STT | min_interruption_words is above zero, so duration alone will not fire |
| A syllable of the bot leaks out after it stops | TTS chunk + transport buffer | Audio was already synthesized and buffered downstream |
StartInterruptionFrame fires but TTS keeps going |
Output transport / processors | The system frame did not reach or was not handled where the audio sits |
| Interruptible TTS guard passes, audio still plays | TTS vs transport order | Guard checked after the audio was already handed downstream |
| Bot repeats the same sentence after yielding | Context aggregation | Truncated turn not committed, or committed twice, out of order |
| Agent is silent but status still says speaking | Session state events | Speech handle done, but the state event never flipped |
Symptom 1
The agent keeps talking for whole seconds after the caller cuts in
Layer: the interruption toggle, then the turn / words gate
Start at the cheapest cause. If allow_interruptions is off, nothing downstream can save you, because the pipeline never even tries to yield. In Pipecat that flag lives on the pipeline params; in LiveKit it lives on the AgentSession and can also be set per turn. Confirm it is on first.
If interruptions are on and the agent still runs for seconds, the common cause is a words gate. LiveKit's min_interruption_words and Pipecat's words-based interruption strategy both hold the interruption until the STT has transcribed enough words. That is a deliberate feature so a "mhm" or a cough does not stop the bot, but it means your time to yield now includes the STT's recognition latency, which on a streaming STT can easily be a second or more after the caller started.
Overlay the caller onset and the interruption event. If the event fires only when the STT emits its first words (not when the VAD first hears speech), a words gate is your delay, and this is the same root cause as symptom 2.
Symptom 2
min_interruption_duration vs min_interruption_words: the duration looks like it is being ignored
Layer: the turn decision, gated on the STT
This one confuses people because both knobs sound like they gate the same thing, but they are measured by different components on different clocks.
min_interruption_durationis measured by the VAD, in seconds of continuous speech. It is fast and does not care what the caller said.min_interruption_wordsis measured by the STT, in transcribed words. It cannot be satisfied until recognition catches up, so it always lands later than the VAD.
If you set min_interruption_words above zero, the words condition is the binding one. Your carefully lowered min_interruption_duration is doing exactly what it says, but the agent still will not stop until the word count is met, so from the outside the duration knob looks ignored. It is not ignored, it is just no longer the gate.
min_interruption_words to 0 and control sensitivity with the VAD and min_interruption_duration instead. If you keep a words gate, accept that your floor for time to yield is roughly the STT's first-token latency, and measure it so the number is not a surprise. For how to tune this trade-off without over-triggering on backchannels or under-triggering on real interruptions, see turn detection tuning: false vs missed barge-in.Mark the VAD-based speech onset and the STT's first partial transcript on the timeline. If the yield lands with the transcript and not with the onset, the words gate is in control.
Symptom 3
A short burst of phonemes leaks out after the bot is told to stop
Layer: the TTS chunk size and the buffers downstream of it
The decision to interrupt is correct and on time, but the caller still hears a fraction of a second of the bot, sometimes a whole syllable, before it goes quiet. This is not a logic bug. It is audio that was already produced and already in flight when the stop happened: a TTS chunk that was synthesized ahead, samples sitting in the output transport buffer, and on phone calls the jitter buffer and the telephony gateway between you and the caller. You cannot recall audio that has already left the building.
What you can do is keep less of it in flight. Smaller TTS audio chunks mean less is buffered ahead of the play head, so an interruption discards less pending audio. Make sure the interruption actually flushes the output transport buffer rather than only telling TTS to stop generating new audio, because the already-buffered samples are the ones the caller hears. The tail will never be exactly zero over a phone network, but it should be a small, bounded, consistent number rather than a noticeable chunk of a word.
Measure the agent-channel energy from the interruption event to true silence. A tail under roughly a couple hundred milliseconds is normal buffering. A tail that carries a recognizable syllable or word means audio is being buffered too far ahead, or the transport buffer is not being cleared on interruption.
Symptom 4
Pipecat's StartInterruptionFrame fires but the TTS does not stop
Layer: the output transport and any custom processors in the path
StartInterruptionFrame is a system frame. It is meant to travel out of band so it can jump ahead of queued work and reach every stage quickly. But "reach every stage" is the operative phrase: stopping barge-in means the interruption has to land at every processor that is still holding audio, which is the TTS service and the output transport, not just one of them. If the TTS stops producing new audio but the caller still hears the bot, the audio is downstream in the transport, and the transport has to handle the interruption by clearing what it has buffered.
Two things break this in practice. First, a custom processor you inserted into the pipeline that does not pass system frames through, so the interruption dies partway down the chain and never reaches the transport. Second, allow_interruptions being off, in which case the frame you expected may not be generated at all. Read the interruption handling for the exact version you run rather than trusting a blog post, because these internals move between releases.
Log the frame at the point it is emitted and again at the output transport. If it is emitted but never logged at the transport, a processor in between is swallowing it. If it reaches the transport and audio still plays, the transport is not flushing its buffer on interruption.
Symptom 5
The interruptible TTS guard passes, but audio still plays
Layer: where the guard sits relative to the buffered audio
Some TTS integrations wrap generation in an interruptible service that checks an interruption flag before it pushes each audio chunk onward. The idea is right: once an interruption is in progress, stop pushing. The failure is one of placement. If the guard is checked in the TTS stage but the chunks it already pushed are now sitting in the output transport, the guard "passing" only stops future chunks. The ones already downstream are past the guard and still play. A guard on new synthesis does nothing about buffered playback.
The lesson is the same as symptom 4 from a different angle: the stop has to be enforced at the place that is actually emitting sound. A check at the top of the TTS service is necessary but not sufficient. The output transport is the last stage that touches audio, so the interruption has to clear its buffer too, or a guard further up will keep passing while the caller keeps hearing the bot.
If new TTS output stops immediately on interruption but a steady tail keeps playing, the guard is working and the buffered audio is the culprit. Trace the interruption into the transport and confirm the buffer is flushed.
Symptom 6
After it yields, the bot repeats the sentence it was cut off on
Layer: context aggregation and frame ordering
When an interruption lands, the agent's half-spoken reply has to be truncated to what the caller actually heard and written back into the conversation context, so the LLM knows it only got partway through. If that truncation races with the frames that add the user's new turn, or the partial assistant turn is committed twice, or the interruption is processed out of order relative to the context aggregator, the model's next turn can restate the sentence it was interrupted on. From the caller's side the agent yields politely and then says the same thing again, which is arguably worse than not yielding.
This is a state and ordering bug, not an audio bug, so the recording alone will not fully explain it. You need the transcript and the context. Look at what actually got written into the assistant message after the interruption, and whether the user turn landed before or after it.
Line up the interruption event with the stored conversation turns. If the truncated assistant turn is missing, duplicated, or ordered after the user turn that interrupted it, the context aggregation around interruptions is the cause, not the TTS.
Symptom 7
The agent goes silent, but its status still says it is speaking
Layer: the session state events
Here the audio side actually worked. The bot stopped, the caller is talking, and yet the agent's reported state is still speaking. The speech handle for that turn finished or was cancelled, but the event that flips the session from speaking to listening did not fire, or fired late. That matters because downstream logic often gates on the state: analytics, your own barge-in counters, or the trigger for the next turn. If the state is stuck on speaking, the next turn can stall even though the audio is long gone, and the call feels dead.
Treat the state machine as a first-class thing to test, not a display detail. The audio stopping and the state flipping are two separate events, and they can disagree.
Put the agent-channel energy next to the state events. If the energy is at the noise floor while the state event stays on speaking, you have a desync between the speech handle finishing and the state transition firing.
The knobs, side by side
Once you know which clock is slow, these are the levers. Names and defaults drift between releases, so treat this as a map of what to look for and confirm the exact values against the version you have installed.
Pipecat
allow_interruptionson the pipeline params. If this is off, barge-in cannot happen. Check it first.- VAD window on the analyzer, for example
start_secs(how long speech must persist to count as a turn start) andstop_secs(silence before the turn is over). Shorterstart_secslowers detection latency but trips on shorter noises. - Words-based interruption strategy (a minimum-words rule). Powerful for ignoring backchannels, but it puts the STT on the critical path. Drop it if you need the fastest yield.
StartInterruptionFramehandling. Confirm the output transport clears its buffer and that any custom processor forwards system frames.- TTS chunk size. Smaller chunks buffer less ahead, so an interruption discards a shorter tail.
LiveKit
allow_interruptionson theAgentSession, also settable per turn. Off means the agent will finish no matter what.min_interruption_duration: seconds of detected speech before it counts as an interruption. VAD-timed, so it is your fast lever.min_interruption_words: words the STT must transcribe first. Above zero it becomes the binding gate and adds STT latency. Set to 0 for the fastest yield.min_endpointing_delayandmax_endpointing_delay: how long to wait before deciding the caller is done. These shape turn-end, and interact with how eagerly the agent takes and gives the floor.- Turn detection model vs plain VAD. A turn model reduces false interruptions but is another stage in the detection clock.
- These parameter names live on
AgentSessionin recentlivekit-agentsreleases and have moved before. Check the docs for the version you have installed rather than assuming these names are current.
A test you can run this afternoon
Guessing is expensive; a repeatable measurement is cheap. Here is a small protocol that turns "it feels like it interrupts badly" into numbers you can move.
- Record with barge-in on purpose. Run a handful of calls where you interrupt the agent at known points: early in a long sentence, mid-word, and right as it starts a new sentence. Keep the recordings dual-channel.
- Score three fixed things per interruption. Did the agent yield at all, how many seconds it took to yield, and did it keep talking over the caller after the cut-in. Those three are objective and they map straight onto the two clocks above.
- Set pass thresholds up front. For example: yields on every event, time to yield under a target you pick for your use case, and no sustained talk-over past the buffered tail. Write the thresholds down before you look at the results so you are not grading on a curve.
- Change one knob, re-measure. Move
min_interruption_words, or the VAD window, or the TTS chunk size, one at a time, and watch the two clocks separately. If a change helps detection but hurts cancellation, you will only see it because you split the number.
The trap is that a handful of hand-run calls is not your real traffic. Real callers interrupt in messier ways than you do on purpose, so a stack that passes your five scripted barge-ins can still fail on the long tail. The honest version of this test runs across a real batch of your own recorded calls and scores every barge-in event the same way, so the number reflects production, not a demo. For a fuller walk-through of building this measurement, including what counts as an event and how to set thresholds, see how to test voice agent barge-in: 3 objective signals.
Once you have fixed it, verify it
Prove the fix across a real batch of your calls
When you think barge-in is handled, confirm it on production traffic instead of a scripted demo. Send us 10 to 15 of your own recorded calls and we score every barge-in event on the same three fixed criteria this guide uses: did the agent yield, seconds to yield, and did it keep talking over the caller. You get a per-call scorecard and an aggregate reliability read in about a week, and it works with your own recordings from Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit. Recordings are deleted after delivery, never used to train any model, we sign an NDA on request, and it's $499, one time.
FAQ
Why does my Pipecat agent talk over the user when interrupted?
Check three things in order. Is allow_interruptions on. Is a words-based interruption strategy holding the interruption until the STT emits enough words. And is StartInterruptionFrame reaching the output transport so buffered audio gets cleared, not just the TTS service. Confirm by overlaying the caller onset, the interruption event, and the agent audio envelope from a recording.
Why won't my LiveKit agent stop talking when interrupted?
Confirm allow_interruptions on the AgentSession first. Then look at min_interruption_duration, which sets how long detected speech must last, and min_interruption_words, which if above zero makes the agent wait for the STT to transcribe that many words before it yields. A short tail of audio after the stop is expected because audio already published to the room finishes playing out.
What is the difference between min_interruption_duration and min_interruption_words?
min_interruption_duration is measured by the VAD in seconds and fires fast. min_interruption_words is measured by the STT in transcribed words and only fires once recognition catches up, adding STT latency. If you set the words value above zero, it becomes the binding gate, which is why the duration setting can look ignored. Set words to 0 when you want the fastest yield.
Why does StartInterruptionFrame fire but the TTS keeps playing?
Because the frame has to reach every stage still holding audio, not only the TTS. If it stops synthesis but the caller still hears the bot, the audio is already in the output transport buffer or the telephony gateway, or a custom processor did not forward the system frame. Confirm the transport clears its buffer on interruption and that your processors pass system frames through.
Interrupts are slower than I want. How do I cancel bot speech faster?
Split time to yield into detection and cancellation. Lower detection by shortening the VAD start window and keeping min_interruption_words at 0 so you are not waiting on the STT. Lower cancellation by using smaller TTS chunks so less audio is buffered ahead, and by flushing the output transport buffer on interruption rather than only stopping new synthesis. Measure both halves before and after each change.
The agent goes silent but still reports that it is speaking. Why?
That is a state desync. The speech handle for the turn finished or was cancelled, but the state event that flips the agent from speaking to listening did not fire or fired late, so downstream logic that gates on the speaking state stalls the next turn. Confirm it by putting the agent-channel energy next to the state events: energy at the noise floor while the state stays on speaking is the tell.