BargeBench / Guides / Tuning turn detection and VAD
Voice agent engineering guide
Tuning Turn Detection and VAD: False Interruptions vs Missed Interruptions
Almost every voice agent team hits the same two-sided wall. Crank the sensitivity and the agent barges in on the user's own pauses and filler words. Back it off and the agent talks over real interruptions. This guide walks through every lever that moves that line, which direction each one pushes, and how to measure both kinds of error from your own recorded calls so tuning is data-driven instead of guesswork.
The one axis behind both failures
Barge-in is the moment a user starts speaking while the agent is still talking. Getting it right means the agent yields to the user quickly when they genuinely take the turn, and keeps talking when the user is only pausing, thinking, breathing, or dropping in a filler word. Those two goals pull in opposite directions, and that tension is the whole problem.
It helps to name the two errors precisely, because you tune against them by name:
- False barge-in (a false interruption). The agent yields when it should not have. The user paused mid-sentence, said
um, dropped a backchannel likemhmoryeah, took a breath, or a TV was playing in the background, and the agent stopped talking or cut itself off. To the user, the agent seems twitchy and keeps abandoning its own sentences. - Missed barge-in (a missed interruption, also called talk-over). The agent fails to yield when it should have. The user clearly started a new turn, and the agent kept talking over them for a second or more. To the user, the agent seems deaf and rude.
If you have tuned detectors before, this is precision and recall. Every knob in the stack sits somewhere on a single sensitivity axis. Push toward more sensitive and you cut missed barge-ins while adding false barge-ins. Push toward less sensitive and you cut false barge-ins while adding missed barge-ins and slowing the agent down. You cannot drive both errors to zero by turning one knob. You either pick a point on the curve that fits your product, or you move to a smarter detector that shifts the whole curve.
The levers, and which way each one pushes
Here are the controls that actually move barge-in behavior in a modern voice stack, whether you run Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit. The parameter names differ between stacks, but the levers are the same underneath. The table shows what happens when you turn each one up.
| Lever | What it controls | Turn it up and you get | Cost |
|---|---|---|---|
| VAD speech threshold | The speech-probability cutoff a voice activity detector uses to call a frame speech vs non-speech. | Fewer false barge-ins. Soft noise and quiet backchannels stop counting as speech. | More missed barge-ins on soft-spoken users, and more talk-over. |
| Minimum interruption duration | How long detected user speech must last before it counts as an interruption. | Fewer false barge-ins. Coughs, clicks, and single syllables are ignored. | Adds that much latency to every real interruption, and can drop very short ones. |
| Minimum interruption words | How many transcribed words are required before an interruption is allowed to stop the agent. | Fewer false barge-ins. Filler and one-word backchannels no longer trigger a yield. | Waits for the transcriber, so it adds latency, and can miss short real interruptions like stop. |
| Endpointing / EOU silence timeout | How long a silence must last before the user is judged to have finished their turn. | Fewer premature turn-ends. The agent stops jumping in on mid-sentence pauses. | Slower responses at the end of every genuine turn. |
| Semantic turn detector | A model that predicts, from words or prosody, whether the user is actually done vs only pausing. | Shifts the whole curve. Fewer false barge-ins without paying as much missed barge-in. | Adds model inference to the loop, and it can be wrong on unusual phrasing. |
| Allow-interruptions flag | Whether user speech is allowed to stop the agent at all during its turn. | Guarantees no talk-over on the segments where it is on. | If it is off when you did not mean it to be, the agent can never yield. A frequent silent culprit. |
Directions are the reliable part. Exact defaults depend on your stack and version, so confirm them in your own config.
VAD threshold and the raw detector
A voice activity detector answers one narrow question: is there speech in this frame. Silero VAD, which many stacks use, returns a speech probability between 0 and 1 for each short window and compares it to a threshold. Lower the threshold and quieter or noisier audio starts counting as speech, which catches soft interruptions but also fires on background talk and breaths. Raise it and you get the opposite.
The important mental model: a VAD reports speech presence, not turn intent. It does not know whether the user is interrupting, agreeing, or thinking aloud. That is why a VAD alone is a bad barge-in detector, and why the other levers exist on top of it. If silero vad is not detecting an interruption, the usual causes are a threshold set too high, a minimum-silence or minimum-speech window that filters the interruption out, or an audio format mismatch. Silero expects 16 kHz mono in a fixed window size (512 samples at 16 kHz), and feeding it the wrong rate quietly degrades the probabilities.
# Silero VAD, illustrative defaults. Lower threshold and the
# shorter windows = more sensitive = fewer missed, more false.
get_speech_timestamps(
audio, model,
threshold=0.5, # speech-probability cutoff
min_speech_duration_ms=250, # ignore blips shorter than this
min_silence_duration_ms=100, # silence before ending a segment
speech_pad_ms=30,
)
Parameter names from Silero VAD. Values shown are its documented defaults, not a recommendation. Confirm against your version.
Minimum interruption duration and minimum interruption words
These are the two cheapest, highest-leverage knobs for the "turn detection too sensitive, keeps interrupting the user" problem. A minimum interruption duration says the user must keep speaking for, say, a few hundred milliseconds before the agent yields, which throws away coughs, clicks, and stray syllables. A minimum interruption words count says the transcriber must produce at least N words before the agent yields, which is how you get interruption detection to ignore filler words and one-word backchannels.
The trade is direct and worth stating plainly. Both add latency to real interruptions, because the agent now waits to be sure. A duration gate adds a fixed delay. A words gate adds however long transcription takes to emit those words, and if you set it to two or three words you will miss short but genuine interruptions like stop, no, or wait. Many teams land on a small duration gate plus a words gate of zero or one, then let a semantic detector handle the harder filler cases.
Endpointing and end-of-utterance timeouts
Endpointing is the other side of the same coin, and it is usually the real reason a "voice assistant starts speaking before the user finishes." Endpointing decides when the user's turn is over. If it is silence-based and the timeout is short, a normal mid-sentence pause (for example "I want to book a flight to, uh, Denver") gets scored as end-of-turn, the agent takes the floor, and now you have both a false end-of-turn and, a beat later, the agent talking over the user as they finish. Lengthen the minimum silence or the minimum endpointing delay and the agent stops cutting people off, at the cost of a slower reply after every genuine turn. Frameworks such as LiveKit Agents expose this directly as a minimum and maximum endpointing delay so you can bound both the responsiveness and the worst-case wait.
Semantic turn detection: Smart Turn v3 and realtime semantic VAD
Everything above trades one error for the other along a fixed curve. Semantic turn detection is how you move the curve itself. Instead of asking only "is there silence," a semantic model asks "does this sound or read like a finished thought." Because it uses content and prosody rather than a raw gap, it can hold the turn through a pause or a filler word without you having to lengthen a blunt silence timer that also slows down real turn-ends.
Two concrete forms you are likely to reach for:
- Audio end-of-utterance models such as Smart Turn v3. These classify from the audio whether an utterance is complete or the speaker is merely pausing, and feed that into endpointing. Smart Turn v3 is multilingual and small enough to run quickly on CPU, so the end-of-utterance and EOU latency it adds is modest while it meaningfully cuts premature endpointing. It is available to run in Pipecat pipelines and elsewhere.
- Realtime semantic VAD. The OpenAI Realtime API lets you choose the turn-detection mode.
server_vadends the turn on silence and exposesthreshold,prefix_padding_ms, andsilence_duration_ms.semantic_vadinstead uses a model to judge whether the user is done, so a natural pause is less likely to be read as end-of-turn, which reduces false interruptions. Itseagernesssetting (low, medium, high, or auto) trades how quickly it commits against how long it waits to be sure.
// OpenAI Realtime: silence-based turn taking
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200
}
// OpenAI Realtime: model-judged turn taking,
// less likely to end a turn on a natural pause
"turn_detection": {
"type": "semantic_vad",
"eagerness": "auto"
}
Field names from the OpenAI Realtime turn-detection settings. Shown to illustrate the two modes, not tuned values.
A blunt threshold moves you along the false-vs-missed curve. A semantic turn detector moves the curve. Reach for the blunt knobs first because they are cheap and instant, and add a semantic detector when you cannot find any single threshold that gives you an acceptable rate of both errors at once.
How to measure both directions from recorded calls
Here is the part most teams skip, and it is the part that turns tuning from guesswork into engineering. You cannot improve what you only feel. You have to count both errors, on the same recordings, before and after every change. Tuning by listening to a demo call or two is how you ship an agent that felt fine in the office and talks over customers in production.
The measurement is a small labeling and counting job. Do it on real recorded calls, not scripted ones, because real calls carry the accents, phone-line codec artifacts, background noise, and hesitation patterns that actually trip the detector.
- Get clean audio, ideally dual-channel.
If your stack can record the user and the agent on separate channels, overlap becomes trivial to see: any span where both channels are hot is a talk-over candidate. If you only have a single mixed channel, segment it into who-is-speaking-when before you label. This is turn segmentation for scoring, not identifying any individual.
- Mark every candidate interruption.
Find every point where the user starts speaking while the agent is talking. Each one is a candidate. This is the denominator for both error rates.
- Label the user's intent at each candidate.
For each candidate, decide: was this a real turn-take, or was it a backchannel, filler, or noise. This is the human judgment the detector is trying to approximate, so it has to come from a person for the ground truth.
- Score the agent's response.
For real turn-takes: did the agent yield, and how many seconds did it take to go silent after the user's onset. For backchannels and filler: did the agent wrongly stop or hesitate. Now every candidate lands in one of the boxes below.
- Roll it up into two rates and one distribution.
Missed barge-in rate = real turn-takes the agent talked over, divided by all real turn-takes. False barge-in rate = non-turns the agent yielded to, divided by all non-turns (or per minute of agent speech). Time to yield = the distribution of seconds-to-silence on the real ones. Report the median and the tail, because the tail is what users remember.
| Agent yielded | Agent kept talking | |
|---|---|---|
| User really took the turn | Correct yield. Score its time to yield. | Missed barge-in (talk-over). |
| User only paused / filler / noise | False barge-in. | Correct hold. The agent kept its turn. |
The confusion matrix you are scoring toward. Both off-diagonal cells matter, and one knob trades one for the other.
Two things make this rigorous rather than anecdotal. First, if your stack lets you run the VAD or turn detector offline, replay the exact same audio through it at several threshold settings and count the boxes at each one. That gives you the real curve for your calls, not a vendor's benchmark. Second, always report both rates together. A single number like "yields 95 percent of the time" hides which error you are making, and the two are traded against each other. A team can cut talk-over to near zero and not notice they doubled their false barge-ins, because they were only watching one column.
A tuning loop that converges
Putting it together, the loop that reliably gets a team to a good operating point:
- Instrument first.
Log every VAD trigger, interruption event, and turn-end with timestamps, so a recording can be scored without re-listening to all of it.
- Measure the baseline both ways.
Run the scoring above on a real batch and write down your current false barge-in rate, missed barge-in rate, and time-to-yield distribution.
- Move the cheapest knob toward your dominant error.
If false barge-in dominates, raise the minimum interruption duration first, then consider a words gate or a semantic detector. If missed barge-in dominates, lower the VAD threshold or the duration gate, and confirm interruptions are even enabled.
- Re-measure on the same batch.
Confirm the error you targeted went down and watch how far the other one went up. Keep the change only if the trade is worth it for your product.
- Add a semantic detector when the blunt knobs run out.
If no single threshold gives you an acceptable rate of both, that is the signal to move the curve with semantic turn detection rather than keep trading one error for the other.
Whether a snappier median time to yield is worth a few more false barge-ins is a product call, not a universal answer. A high-stakes support line and a casual assistant will pick different points on the curve. The engineering job is to make the trade visible and measured so the product call is an informed one.
Want this run rigorously on your own calls
We will tell you which way your thresholds are erring.
Doing the measurement above well takes clean labeling and a consistent rubric. If you would rather see the numbers than build the harness, send us 10 to 15 of your own recorded calls. We score every barge-in event on three fixed, objective criteria (did the agent yield, seconds to yield, did it keep talking over the caller) and return a per-call scorecard plus an aggregate read in about a week. You will see plainly whether your current thresholds are erring toward false barge-in or toward talk-over, and by how much. It works with recordings from Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit, with no integration.
Request the report, $499, one time Or pay $499 nowOne-time report. No onboarding, nothing connects to your system. You send recordings, we return numbers. Recordings are deleted after delivery, never used to train any model, NDA available on request, $499, one time.
Frequently asked
Why is my voice agent's turn detection too sensitive and interrupting the user?
Sensitivity is set too high, so the agent treats the user's own pauses, filler, backchannels, breaths, and background noise as a turn and yields to them. That is a false barge-in. Reduce it by raising the VAD speech threshold, raising the minimum interruption duration so short noises are ignored, requiring a minimum number of transcribed words before an interruption counts, or adding a semantic turn detector. Each lowers false interruptions but can add latency or increase missed interruptions, so you are choosing a point on a trade-off, not eliminating both.
Why does my voice assistant start speaking before the user finishes?
The endpointing or end-of-utterance timeout is too short, so a mid-sentence pause is scored as the end of the turn and the agent jumps in. Raise the minimum endpointing delay or the required silence duration, and consider a semantic end-of-utterance model that holds the turn through pauses and filler rather than ending it on the first gap. Holding longer costs response speed, so tune it against your measured rate of premature turn-ends.
How do I make interruption detection ignore filler words?
Require a minimum number of transcribed words before an interruption is allowed, which some agent frameworks expose as a minimum interruption words setting, or classify backchannels and filler such as um, uh, yeah, and mhm as non-turn-taking. The trade-off is that waiting for words adds latency to real interruptions, and a high word threshold can miss short but genuine interruptions like stop or no.
Silero VAD is not detecting an interruption. What is wrong?
A voice activity detector reports whether speech is present, not whether the user means to take the turn, so you still need interruption logic on top of it. If Silero VAD is not firing on a real interruption, check that audio is 16 kHz mono with the expected window size, lower the speech-probability threshold, and shorten the minimum silence and minimum speech durations so brief interruptions are not filtered out. If it fires too often instead, raise the threshold and the minimum speech duration.
What are Smart Turn v3, endpointing, and end-of-utterance latency?
Endpointing is deciding when the user has finished a turn. End-of-utterance models such as Smart Turn v3 predict from the audio whether an utterance is complete or the speaker is only pausing, so endpointing can wait through pauses and filler instead of cutting the user off. Smart Turn v3 is multilingual and runs quickly on CPU, so it adds little inference latency while reducing premature endpointing, a common source of the agent speaking before the user is done.
OpenAI Realtime semantic_vad vs server_vad, which is better for interruptions?
server_vad ends the turn on silence and exposes threshold, prefix_padding_ms, and silence_duration_ms, so it can end a turn during a natural pause and let the agent speak too early. semantic_vad uses a model to predict turn boundaries from content, so it is less likely to treat a mid-thought pause as end-of-turn, which reduces false interruptions. Its eagerness setting trades responsiveness against how long it waits before deciding the user is done. Which is better depends on whether your calls suffer more from talk-over or from the agent cutting users off.