Fix a Voice Agent Talking Over the Caller (Pipecat/LiveKit)

Q: Why does my Pipecat agent talk over the user when interrupted?

The usual causes are that allow_interruptions is off on the pipeline params, that a words based interruption strategy is holding the interruption until the STT has emitted enough words, or that the StartInterruptionFrame is reaching the TTS service but the output transport is still draining audio it already buffered. Confirm by lining up the caller speech onset in the recording against when the agent audio actually falls silent, and against your frame logs.

Q: Why won't my LiveKit agent stop talking when interrupted?

Check allow_interruptions on the AgentSession first. If it is on, the two knobs that most often keep the agent talking are min_interruption_duration, which sets how long detected speech must last before it counts, and min_interruption_words, which if set above zero makes the agent wait for the STT to transcribe that many words before it will stop. A short leftover of audio after the stop is expected because audio already published to the room finishes playing out.

Q: What is the difference between min_interruption_duration and min_interruption_words?

min_interruption_duration is measured by the VAD in seconds of continuous speech, so it fires quickly and independently of the words spoken. min_interruption_words is measured by the STT in transcribed words, so it only fires after recognition catches up, which adds the STT latency on top. If you set min_interruption_words above zero, the duration threshold alone will not interrupt, which is why the duration setting can look like it is being ignored.

Q: Why does StartInterruptionFrame fire but the TTS keeps playing?

StartInterruptionFrame is a system frame that has to reach every stage that holds audio, not just the TTS service. If it stops the TTS but the caller still hears the bot, the audio is usually already downstream in the output transport buffer or in the telephony gateway, or a custom processor in the pipeline did not pass the system frame through. Verify allow_interruptions is on, that your processors forward system frames, and that the output transport handles the interruption by clearing its buffer.

Q: Interrupts are slower than I want. How do I cancel bot speech faster?

Time to yield is detection latency plus cancellation latency. Lower detection latency by reducing the VAD start window and, on LiveKit, by keeping min_interruption_words at zero so you are not waiting on the STT. Lower cancellation latency by sending smaller TTS audio chunks so less is buffered ahead, and by making sure the interruption clears the output transport buffer rather than only stopping new synthesis. Measure both halves from a recording before and after each change.

Q: The agent goes silent but still reports that it is speaking. Why?

That is a state desync. The speech handle that represents the current turn finished or was cancelled, but the session state event that flips the agent from speaking to listening did not fire or fired late. Downstream logic that gates on the speaking state then stalls the next turn. Confirm it from the recording plus the logs: agent audio energy is at the noise floor while the state event stays on speaking.

You ship a voice agent. In the demo it feels sharp. Then real callers start interrupting it, and it plows straight through them, finishing a sentence nobody is listening to anymore. The caller talks louder, the agent keeps going, and the call falls apart.

The frustrating part is that "the agent won't stop talking" is not a single failure. In a cascaded stack, the moment a caller barges in has to travel through voice activity detection, a turn or interruption decision, and then all the way back down to whatever is still holding audio: the TTS service and the output transport. A stall or a dropped signal at any one of those stages produces the same audible symptom. So the fix depends entirely on which stage is stalling, and you cannot tell which one by listening alone.

This guide walks the pipeline, shows you how to instrument it so you stop guessing, then maps seven specific symptoms (including the ones people report most: StartInterruptionFrame firing but TTS continuing, and min_interruption_duration looking like it is ignored) onto the exact layer that causes each one.

The pipeline where barge-in actually happens

A cascaded voice agent is a loop. Caller audio comes in, gets turned into text, a turn decision decides whether the caller is really taking the floor, the LLM produces a reply, TTS turns that reply into audio, and the output transport streams it back. Barge-in is what happens when new caller audio arrives while the bottom half of that loop is still emitting the previous reply.

Barge-in success depends on the red path, not the black one. Detection is fast; stopping the audio already in flight is the hard part.

Two clocks decide whether barge-in feels good, and it helps to separate them because they are fixed in different places:

Detection latency: caller starts talking, and how long until your stack decides that is an interruption. This lives in the VAD window and the turn or interruption logic.
Cancellation latency: the decision is made, and how long until the last caller-audible sample of the bot actually stops. This lives in the TTS buffer and the output transport, plus any telephony gateway between you and the caller.

"Time to yield" is the sum of both. When people say interrupts are slower, how do I cancel bot speech faster, they usually tune one clock and ignore the other. You have to measure both.

Instrument it before you guess

Every root cause below is confirmed the same way: line up three timelines and see where they disagree. Caller speech onset, the agent's actual audio envelope, and your framework's interruption events. If you only ever listen to the mixed call, you will keep tuning the wrong clock.

1. Get a recording you can measure

Pull a stereo or dual-channel recording if you possibly can, with the caller on one channel and the agent on the other. A mixed mono file still shows you that talk-over happened, but it makes "when did the agent go quiet" much harder to measure because both voices sit in the same waveform. Most stacks can hand you separate tracks: telephony providers expose dual-channel recording, and the agent frameworks can record the input and output legs.

2. Measure time to yield from the envelope

You do not need anything fancy to get a first number. Compute a short-window energy envelope on each channel, find where the caller channel comes alive while the agent channel is still hot, and then find where the agent channel drops and stays down. The gap is your time to yield for that event.

# first-pass time-to-yield from a stereo call recording
# channel 0 = caller, channel 1 = agent (swap if yours is reversed)
import numpy as np, soundfile as sf

audio, sr = sf.read("call_01.wav")      # shape (n, 2)
caller, agent = audio[:, 0], audio[:, 1]

def envelope(x, sr, win_ms=20):
    w = int(sr * win_ms / 1000)
    frames = len(x) // w
    e = np.array([np.sqrt(np.mean(x[i*w:(i+1)*w]**2)) for i in range(frames)])
    t = np.arange(frames) * win_ms / 1000.0
    return t, e

tc, ec = envelope(caller, sr)
ta, ea = envelope(agent, sr)

thr = 0.02                             # tune to your noise floor
caller_on = ec > thr
agent_on  = ea > thr

for i in range(1, len(tc)):
    if caller_on[i] and agent_on[i]:       # overlap starts: caller cut in
        onset = tc[i]
        j = i
        while j < len(ea) and ea[j] > thr:  # wait for agent to fall silent
            j += 1
        yield_t = ta[j] if j < len(ta) else ta[-1]
        print(f"caller in {onset:.2f}s, agent silent {yield_t:.2f}s, "
              f"time to yield {yield_t-onset:.2f}s")
        break

This is a starting point, not a scorer. A fixed energy threshold trips on breaths and line noise, so for anything you report on, run a real VAD (for example Silero or WebRTC VAD) per channel and hold the "silent" decision across a few frames so a pause inside a word does not read as a yield. But even the crude version tells you the one thing you need next: is the delay before the agent stops, or after the decision to stop.

3. Turn on the framework's interruption logs

Both stacks emit events at exactly the moments you care about. Log them with timestamps and overlay them on the audio.

Pipecat emits frames as the turn changes: UserStartedSpeakingFrame when the VAD calls a turn start, StartInterruptionFrame when it decides to interrupt, and BotStoppedSpeakingFrame when the bot audio is done. The distance from UserStartedSpeakingFrame to StartInterruptionFrame is detection latency; from there to the audio actually stopping is cancellation latency.
LiveKit exposes the agent state and the current speech handle. Watch when agent_state moves off speaking and when the speech handle for the turn is interrupted, and compare those to the recording.

The tellIf the interruption event fires right on time but the audio keeps going, you have a cancellation problem, so look at TTS and the transport. If the audio stops promptly once the event fires but the event fires late, you have a detection problem, so look at the VAD window and the turn logic. That single split points you at the right half of the map below.

Symptom to root cause map

Find your symptom, jump to the layer, confirm it against the recording. The detail for each row is below the table.

Symptom	Layer to check	Root cause
Agent keeps talking for seconds after the caller cuts in	VAD / turn + toggle	Interruptions off, or a words gate is waiting on the STT
`min_interruption_duration` looks ignored	Turn logic + STT	`min_interruption_words` is above zero, so duration alone will not fire
A syllable of the bot leaks out after it stops	TTS chunk + transport buffer	Audio was already synthesized and buffered downstream
`StartInterruptionFrame` fires but TTS keeps going	Output transport / processors	The system frame did not reach or was not handled where the audio sits
Interruptible TTS guard passes, audio still plays	TTS vs transport order	Guard checked after the audio was already handed downstream
Bot repeats the same sentence after yielding	Context aggregation	Truncated turn not committed, or committed twice, out of order
Agent is silent but status still says speaking	Session state events	Speech handle done, but the state event never flipped

Symptom 1

The agent keeps talking for whole seconds after the caller cuts in

Layer: the interruption toggle, then the turn / words gate

Start at the cheapest cause. If allow_interruptions is off, nothing downstream can save you, because the pipeline never even tries to yield. In Pipecat that flag lives on the pipeline params; in LiveKit it lives on the AgentSession and can also be set per turn. Confirm it is on first.

If interruptions are on and the agent still runs for seconds, the common cause is a words gate. LiveKit's min_interruption_words and Pipecat's words-based interruption strategy both hold the interruption until the STT has transcribed enough words. That is a deliberate feature so a "mhm" or a cough does not stop the bot, but it means your time to yield now includes the STT's recognition latency, which on a streaming STT can easily be a second or more after the caller started.

Confirm from the recording

Overlay the caller onset and the interruption event. If the event fires only when the STT emits its first words (not when the VAD first hears speech), a words gate is your delay, and this is the same root cause as symptom 2.

Symptom 2

`min_interruption_duration` vs `min_interruption_words`: the duration looks like it is being ignored

Layer: the turn decision, gated on the STT

This one confuses people because both knobs sound like they gate the same thing, but they are measured by different components on different clocks.

min_interruption_duration is measured by the VAD, in seconds of continuous speech. It is fast and does not care what the caller said.
min_interruption_words is measured by the STT, in transcribed words. It cannot be satisfied until recognition catches up, so it always lands later than the VAD.

If you set min_interruption_words above zero, the words condition is the binding one. Your carefully lowered min_interruption_duration is doing exactly what it says, but the agent still will not stop until the word count is met, so from the outside the duration knob looks ignored. It is not ignored, it is just no longer the gate.

FixDecide what the words gate is buying you. If you need fast barge-in and can tolerate the odd backchannel stopping the bot, set min_interruption_words to 0 and control sensitivity with the VAD and min_interruption_duration instead. If you keep a words gate, accept that your floor for time to yield is roughly the STT's first-token latency, and measure it so the number is not a surprise. For how to tune this trade-off without over-triggering on backchannels or under-triggering on real interruptions, see turn detection tuning: false vs missed barge-in.

Confirm from the recording

Mark the VAD-based speech onset and the STT's first partial transcript on the timeline. If the yield lands with the transcript and not with the onset, the words gate is in control.

Symptom 3

A short burst of phonemes leaks out after the bot is told to stop

Layer: the TTS chunk size and the buffers downstream of it

The decision to interrupt is correct and on time, but the caller still hears a fraction of a second of the bot, sometimes a whole syllable, before it goes quiet. This is not a logic bug. It is audio that was already produced and already in flight when the stop happened: a TTS chunk that was synthesized ahead, samples sitting in the output transport buffer, and on phone calls the jitter buffer and the telephony gateway between you and the caller. You cannot recall audio that has already left the building.

What you can do is keep less of it in flight. Smaller TTS audio chunks mean less is buffered ahead of the play head, so an interruption discards less pending audio. Make sure the interruption actually flushes the output transport buffer rather than only telling TTS to stop generating new audio, because the already-buffered samples are the ones the caller hears. The tail will never be exactly zero over a phone network, but it should be a small, bounded, consistent number rather than a noticeable chunk of a word.

Confirm from the recording

Measure the agent-channel energy from the interruption event to true silence. A tail under roughly a couple hundred milliseconds is normal buffering. A tail that carries a recognizable syllable or word means audio is being buffered too far ahead, or the transport buffer is not being cleared on interruption.

Symptom 4

Pipecat's `StartInterruptionFrame` fires but the TTS does not stop

Layer: the output transport and any custom processors in the path

StartInterruptionFrame is a system frame. It is meant to travel out of band so it can jump ahead of queued work and reach every stage quickly. But "reach every stage" is the operative phrase: stopping barge-in means the interruption has to land at every processor that is still holding audio, which is the TTS service and the output transport, not just one of them. If the TTS stops producing new audio but the caller still hears the bot, the audio is downstream in the transport, and the transport has to handle the interruption by clearing what it has buffered.

Two things break this in practice. First, a custom processor you inserted into the pipeline that does not pass system frames through, so the interruption dies partway down the chain and never reaches the transport. Second, allow_interruptions being off, in which case the frame you expected may not be generated at all. Read the interruption handling for the exact version you run rather than trusting a blog post, because these internals move between releases.

Confirm from the recording and logs

Log the frame at the point it is emitted and again at the output transport. If it is emitted but never logged at the transport, a processor in between is swallowing it. If it reaches the transport and audio still plays, the transport is not flushing its buffer on interruption.

Symptom 5

The interruptible TTS guard passes, but audio still plays

Layer: where the guard sits relative to the buffered audio

Some TTS integrations wrap generation in an interruptible service that checks an interruption flag before it pushes each audio chunk onward. The idea is right: once an interruption is in progress, stop pushing. The failure is one of placement. If the guard is checked in the TTS stage but the chunks it already pushed are now sitting in the output transport, the guard "passing" only stops future chunks. The ones already downstream are past the guard and still play. A guard on new synthesis does nothing about buffered playback.

The lesson is the same as symptom 4 from a different angle: the stop has to be enforced at the place that is actually emitting sound. A check at the top of the TTS service is necessary but not sufficient. The output transport is the last stage that touches audio, so the interruption has to clear its buffer too, or a guard further up will keep passing while the caller keeps hearing the bot.

Confirm from the recording

If new TTS output stops immediately on interruption but a steady tail keeps playing, the guard is working and the buffered audio is the culprit. Trace the interruption into the transport and confirm the buffer is flushed.

Symptom 6

After it yields, the bot repeats the sentence it was cut off on

Layer: context aggregation and frame ordering

When an interruption lands, the agent's half-spoken reply has to be truncated to what the caller actually heard and written back into the conversation context, so the LLM knows it only got partway through. If that truncation races with the frames that add the user's new turn, or the partial assistant turn is committed twice, or the interruption is processed out of order relative to the context aggregator, the model's next turn can restate the sentence it was interrupted on. From the caller's side the agent yields politely and then says the same thing again, which is arguably worse than not yielding.

This is a state and ordering bug, not an audio bug, so the recording alone will not fully explain it. You need the transcript and the context. Look at what actually got written into the assistant message after the interruption, and whether the user turn landed before or after it.

Confirm from the transcript

Line up the interruption event with the stored conversation turns. If the truncated assistant turn is missing, duplicated, or ordered after the user turn that interrupted it, the context aggregation around interruptions is the cause, not the TTS.

Symptom 7

The agent goes silent, but its status still says it is speaking

Layer: the session state events

Here the audio side actually worked. The bot stopped, the caller is talking, and yet the agent's reported state is still speaking. The speech handle for that turn finished or was cancelled, but the event that flips the session from speaking to listening did not fire, or fired late. That matters because downstream logic often gates on the state: analytics, your own barge-in counters, or the trigger for the next turn. If the state is stuck on speaking, the next turn can stall even though the audio is long gone, and the call feels dead.

Treat the state machine as a first-class thing to test, not a display detail. The audio stopping and the state flipping are two separate events, and they can disagree.

Confirm from the recording and logs

Put the agent-channel energy next to the state events. If the energy is at the noise floor while the state event stays on speaking, you have a desync between the speech handle finishing and the state transition firing.

The knobs, side by side

Once you know which clock is slow, these are the levers. Names and defaults drift between releases, so treat this as a map of what to look for and confirm the exact values against the version you have installed.

Pipecat

allow_interruptions on the pipeline params. If this is off, barge-in cannot happen. Check it first.
VAD window on the analyzer, for example start_secs (how long speech must persist to count as a turn start) and stop_secs (silence before the turn is over). Shorter start_secs lowers detection latency but trips on shorter noises.
Words-based interruption strategy (a minimum-words rule). Powerful for ignoring backchannels, but it puts the STT on the critical path. Drop it if you need the fastest yield.
StartInterruptionFrame handling. Confirm the output transport clears its buffer and that any custom processor forwards system frames.
TTS chunk size. Smaller chunks buffer less ahead, so an interruption discards a shorter tail.

LiveKit

allow_interruptions on the AgentSession, also settable per turn. Off means the agent will finish no matter what.
min_interruption_duration: seconds of detected speech before it counts as an interruption. VAD-timed, so it is your fast lever.
min_interruption_words: words the STT must transcribe first. Above zero it becomes the binding gate and adds STT latency. Set to 0 for the fastest yield.
min_endpointing_delay and max_endpointing_delay: how long to wait before deciding the caller is done. These shape turn-end, and interact with how eagerly the agent takes and gives the floor.
Turn detection model vs plain VAD. A turn model reduces false interruptions but is another stage in the detection clock.
These parameter names live on AgentSession in recent livekit-agents releases and have moved before. Check the docs for the version you have installed rather than assuming these names are current.

A test you can run this afternoon

Guessing is expensive; a repeatable measurement is cheap. Here is a small protocol that turns "it feels like it interrupts badly" into numbers you can move.

Record with barge-in on purpose. Run a handful of calls where you interrupt the agent at known points: early in a long sentence, mid-word, and right as it starts a new sentence. Keep the recordings dual-channel.
Score three fixed things per interruption. Did the agent yield at all, how many seconds it took to yield, and did it keep talking over the caller after the cut-in. Those three are objective and they map straight onto the two clocks above.
Set pass thresholds up front. For example: yields on every event, time to yield under a target you pick for your use case, and no sustained talk-over past the buffered tail. Write the thresholds down before you look at the results so you are not grading on a curve.
Change one knob, re-measure. Move min_interruption_words, or the VAD window, or the TTS chunk size, one at a time, and watch the two clocks separately. If a change helps detection but hurts cancellation, you will only see it because you split the number.

The trap is that a handful of hand-run calls is not your real traffic. Real callers interrupt in messier ways than you do on purpose, so a stack that passes your five scripted barge-ins can still fail on the long tail. The honest version of this test runs across a real batch of your own recorded calls and scores every barge-in event the same way, so the number reflects production, not a demo. For a fuller walk-through of building this measurement, including what counts as an event and how to set thresholds, see how to test voice agent barge-in: 3 objective signals.

Once you have fixed it, verify it

Prove the fix across a real batch of your calls

When you think barge-in is handled, confirm it on production traffic instead of a scripted demo. Send us 10 to 15 of your own recorded calls and we score every barge-in event on the same three fixed criteria this guide uses: did the agent yield, seconds to yield, and did it keep talking over the caller. You get a per-call scorecard and an aggregate reliability read in about a week, and it works with your own recordings from Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit. Recordings are deleted after delivery, never used to train any model, we sign an NDA on request, and it's $499, one time.

Get the barge-in report › Or pay $499 now ›

FAQ

Why does my Pipecat agent talk over the user when interrupted?

Check three things in order. Is allow_interruptions on. Is a words-based interruption strategy holding the interruption until the STT emits enough words. And is StartInterruptionFrame reaching the output transport so buffered audio gets cleared, not just the TTS service. Confirm by overlaying the caller onset, the interruption event, and the agent audio envelope from a recording.

Why won't my LiveKit agent stop talking when interrupted?

Confirm allow_interruptions on the AgentSession first. Then look at min_interruption_duration, which sets how long detected speech must last, and min_interruption_words, which if above zero makes the agent wait for the STT to transcribe that many words before it yields. A short tail of audio after the stop is expected because audio already published to the room finishes playing out.

What is the difference between min_interruption_duration and min_interruption_words?

min_interruption_duration is measured by the VAD in seconds and fires fast. min_interruption_words is measured by the STT in transcribed words and only fires once recognition catches up, adding STT latency. If you set the words value above zero, it becomes the binding gate, which is why the duration setting can look ignored. Set words to 0 when you want the fastest yield.

Why does StartInterruptionFrame fire but the TTS keeps playing?

Because the frame has to reach every stage still holding audio, not only the TTS. If it stops synthesis but the caller still hears the bot, the audio is already in the output transport buffer or the telephony gateway, or a custom processor did not forward the system frame. Confirm the transport clears its buffer on interruption and that your processors pass system frames through.

Interrupts are slower than I want. How do I cancel bot speech faster?

Split time to yield into detection and cancellation. Lower detection by shortening the VAD start window and keeping min_interruption_words at 0 so you are not waiting on the STT. Lower cancellation by using smaller TTS chunks so less audio is buffered ahead, and by flushing the output transport buffer on interruption rather than only stopping new synthesis. Measure both halves before and after each change.

The agent goes silent but still reports that it is speaking. Why?

That is a state desync. The speech handle for the turn finished or was cancelled, but the state event that flips the agent from speaking to listening did not fire or fired late, so downstream logic that gates on the speaking state stalls the next turn. Confirm it by putting the agent-channel energy next to the state events: energy at the noise floor while the state stays on speaking is the tell.

Why Your Voice Agent Keeps Talking Over the Caller (Pipecat and LiveKit Debugging Guide)

The pipeline where barge-in actually happens

Instrument it before you guess

1. Get a recording you can measure

2. Measure time to yield from the envelope

3. Turn on the framework's interruption logs

Symptom to root cause map

The agent keeps talking for whole seconds after the caller cuts in

min_interruption_duration vs min_interruption_words: the duration looks like it is being ignored

A short burst of phonemes leaks out after the bot is told to stop

Pipecat's StartInterruptionFrame fires but the TTS does not stop

The interruptible TTS guard passes, but audio still plays

After it yields, the bot repeats the sentence it was cut off on

The agent goes silent, but its status still says it is speaking

The knobs, side by side

Pipecat

LiveKit

A test you can run this afternoon

Prove the fix across a real batch of your calls

FAQ

`min_interruption_duration` vs `min_interruption_words`: the duration looks like it is being ignored

Pipecat's `StartInterruptionFrame` fires but the TTS does not stop