BargeBench Request the report

BargeBench / Guides / Barge-in testing

How to Test Voice Agent Barge-In: Three Objective Signals from Recorded Calls

Barge-in is the moment a caller starts talking while your agent is still speaking. A good agent stops and listens. A bad one talks over the caller. This is a guide to measuring that from call audio alone, with no integration, using three objective signals you can extract from a WAV and a transcript.

A method write-up from the team at Attention Labs. Applies to any stack that exports recordings, including Vapi, Retell, Bland, Pipecat, and self-hosted LiveKit.

Why measure barge-in from the audio

Most teams evaluate interruption handling by ear. Someone plays a few clips, winces at the one where the agent kept reciting a menu over a frustrated caller, and calls it a known issue. That is not a measurement. It does not tell you how often it happens, how bad it is on average, or whether last week's prompt change helped.

The reason people fall back on eyeballing is that the "correct" signals seem to live inside the stack: when did the endpointer fire, when did the TTS get a stop command, when did the playout buffer flush. Those internals differ across Vapi, Retell, Bland, Pipecat, and self-hosted LiveKit, and half the time you cannot see them anyway. So the practical way to test voice agent barge-in is to ignore the internals and read what actually reached the caller: the recorded audio. The audio is the ground truth for what the caller experienced, and every stack can produce it.

Everything below reduces to three observable events on the recording: when the caller started speaking, when the agent's voice actually went quiet, and how much of the caller's turn the agent ran over. From those three you get objective numbers that compare cleanly across calls, across stacks, and across releases.

The three signals

These are the three fixed criteria we score, and the only three you need to characterize the barge-in failure mode. Keep them separate. An agent can pass one and fail another, and the mix tells you where the problem is.

1

Did it yield

When the caller spoke over the agent, did the agent stop? A binary per interruption event.

Pass / fail
2

Seconds to yield

The gap from the caller's speech onset to the moment the agent's voice drops to silence.

Latency, in seconds
3

Overlap seconds

How long both voices ran at the same time after the caller began. The talk-over duration.

Duration, in seconds

Signal 1 is the coarse pass/fail. Signal 2 is the one people mean when they say the agent feels laggy or interrupts back. Signal 3 is what makes callers give up, because a full second of two voices at once is unintelligible to a human. Together they are the metrics to track for voice agent interruptions, and each one falls straight out of the audio.

What you need before you start

  • The call recordings. WAV or anything you can decode to PCM. Ten to fifteen real calls that actually contain interruptions is enough to see the pattern.
  • Two-channel audio if you can get it. The cleanest setup has the caller on one channel and the agent's text-to-speech on the other. Most platforms can export dual-channel or stereo recordings. This removes the hardest source of error, which is deciding who is talking. If you only have a mono mix, the method still works but needs an extra separation step, covered under pitfalls.
  • A transcript with timestamps is helpful but optional. It lets you tell a real barge-in ("no, I said medium") from a backchannel ("uh huh") and it makes spot-checking fast. The energy math does not depend on it.
  • Python with numpy and soundfile, plus a voice activity detector. The examples below use a simple energy gate so they run with nothing extra, and note where a learned detector such as Silero VAD or webrtcvad is worth swapping in.

Extracting each signal from a call

The whole method is four moves: mark the caller speech onset, mark the point the agent's voice energy drops, subtract to get the gap, and count the frames where both voices are active. Here it is end to end on one two-channel recording.

Step 1. Load the audio and compute short-time energy

Split the stereo file into the caller channel and the agent channel, then compute root-mean-square energy in short frames. Twenty millisecond frames with a ten millisecond hop give you a reading every ten milliseconds, which is fine resolution for turn-taking.

import numpy as np
import soundfile as sf

# Two-channel recording: caller on ch0, agent TTS on ch1.
# Confirm the channel order for your platform before trusting results.
audio, sr = sf.read("call_0007.wav")   # shape (n_samples, 2)
caller = audio[:, 0]
agent  = audio[:, 1]

def rms_db(x, sr, frame_ms=20, hop_ms=10):
    frame = int(sr * frame_ms / 1000)
    hop   = int(sr * hop_ms  / 1000)
    n = 1 + (len(x) - frame) // hop
    t  = np.empty(n)
    db = np.empty(n)
    for i in range(n):
        seg = x[i*hop : i*hop + frame]
        rms = np.sqrt(np.mean(seg*seg) + 1e-12)
        db[i] = 20.0 * np.log10(rms + 1e-12)
        t[i]  = (i*hop) / sr
    return t, db

t, caller_db = rms_db(caller, sr)
_, agent_db  = rms_db(agent, sr)
hop = t[1] - t[0]           # 0.010 s per frame

Step 2. Turn energy into speech masks

A frame counts as voiced when it sits well above that channel's own noise floor. Estimating the floor from a low percentile of the channel makes this robust to gain differences between the two legs of the call. This is the point where a learned VAD earns its keep on noisy calls, but the energy gate is transparent and good enough to build intuition.

def voiced(db, margin_db=12.0):
    floor = np.percentile(db, 10)      # robust per-channel noise floor
    return db > (floor + margin_db)

caller_on = voiced(caller_db)
agent_on  = voiced(agent_db)

# On mono or noisy audio, replace the two lines above with a learned VAD,
# e.g. Silero VAD or webrtcvad, run per channel. The rest is unchanged.

Step 3. Mark the caller's barge-in onset

A barge-in only counts when the caller starts talking while the agent is still talking. So the onset is the first caller frame that stays voiced for at least 150 milliseconds and lands during agent speech. The sustain requirement throws out lip smacks, keyboard clicks, and single-frame noise that would otherwise fire a false onset.

min_run = int(round(0.150 / hop))     # 150 ms of sustained caller speech

onset = None
for i in range(len(caller_on) - min_run):
    if agent_on[i] and caller_on[i:i+min_run].all():
        onset = i
        break

if onset is None:
    # The caller never spoke over the agent in this call. Not a barge-in event.
    print("no barge-in in this call")

Step 4. Mark where the agent's voice drops, then compute the signals

After the onset, find the first moment the agent goes quiet and stays quiet for at least 200 milliseconds. The sustain window matters: without it, the natural micro-pause between two words would read as a yield. That agent-stop point is the observable proxy for "the agent stopped talking," and it is exactly what the caller heard.

From there the three signals are arithmetic. Seconds to yield is the gap between onset and agent-stop. The agent yielded if its voice actually went quiet after the caller barged in; how good that yield was is a judgment you read off seconds to yield against the thresholds table below, not a pass/fail cutoff baked into the script. Overlap seconds is the total time both masks are true between onset and the stop.

quiet_run = int(round(0.200 / hop))    # 200 ms of agent silence = a real stop

stop, run = None, 0
for i in range(onset, len(agent_on)):
    run = run + 1 if not agent_on[i] else 0
    if run >= quiet_run:
        stop = i - quiet_run + 1       # first frame of the silence
        break

if stop is None:
    # Agent never went quiet after the caller barged in.
    yielded          = False
    seconds_to_yield = None
    overlap_seconds  = round(float(np.sum(agent_on[onset:] & caller_on[onset:]) * hop), 3)
else:
    seconds_to_yield = round(t[stop] - t[onset], 3)
    yielded          = True              # it went quiet; read the judgment off seconds_to_yield below
    overlap_seconds  = round(float(np.sum(agent_on[onset:stop] & caller_on[onset:stop]) * hop), 3)

print("Signal 1  did it yield     :", yielded)
print("Signal 2  seconds to yield :", seconds_to_yield)
print("Signal 3  overlap seconds  :", overlap_seconds)
That is the entire measurement. It reads audio, not internals, which is why the same script scores a Vapi recording, a Retell recording, a Bland recording, a Pipecat recording, or a self-hosted LiveKit recording with no changes. If you want to measure barge-in latency for one agent against another, run it on both sets of recordings and compare the distributions.

If you only have a mono mix

Single-channel recordings mix the caller and the agent into one waveform, so a naive energy gate cannot tell whose voice is whose, and the agent's own audio bleeding into the caller path can masquerade as a barge-in. Two options: request dual-channel exports from your platform, which most support, or run a diarization or source-separation pass first and treat its two output streams as the caller and agent channels. The four steps above are identical after that. Dual-channel is worth the config change, because it removes the largest error source in the whole pipeline. Mono mixdowns are especially common on telephony calls; see barge-in on 8kHz telephony for what phone audio does to this method specifically.

Reading the numbers

The measurement is objective. The thresholds are judgment, so treat the table below as sane engineering defaults, not laws. The anchor for "responsive" comes from conversation research: across languages, the gap between human turns clusters near 200 milliseconds. People notice when a turn transition drifts much past that, which is why an agent that takes more than a second to yield reads as talking over you even if it technically stopped.

SignalGoodWatchFailing
Did it yield yields on essentially every real barge-in misses the occasional one rides through the caller's turn
Seconds to yield under ~0.5 s ~0.5 to 1.0 s over ~1.0 s, or never
Overlap seconds under ~0.3 s ~0.3 to 0.8 s over ~0.8 s

Read the three together. A short seconds-to-yield with high overlap usually means the agent decided to stop quickly but its audio buffer kept playing out. A long seconds-to-yield with low overlap usually means the endpointer was slow to trust that the caller was really talking. The pattern points at the fix.

Pitfalls that quietly skew the results

  • Backchannels are not barge-ins. A caller saying "mm hm" or "yeah" to acknowledge is not asking the agent to stop, and a good agent keeps going. If you score those as failed yields you will punish correct behavior. This is where the timestamped transcript pays off: filter onsets whose caller speech is a short acknowledgment token before scoring.
  • Playout buffer tail. Even after the logic decides to stop, already-buffered TTS keeps playing for a moment. That tail is real audio the caller heard, so it belongs in overlap seconds. Do not "correct" for it. It is part of the experience you are measuring.
  • Echo on mono recordings. The agent's voice leaking into the caller leg can trip a caller-onset that never happened. Two-channel audio or an echo-cancelled caller channel removes this.
  • Endpointing versus interruption. If the agent had already finished its sentence and paused naturally, the caller speaking is a normal turn, not a barge-in. The "agent was still talking at onset" condition in step 3 guards against counting these.
  • Client-side versus server-side recording. Where the audio was captured changes the latency you see, because network jitter and buffering sit between the two. Keep the capture point consistent across every call in a batch so seconds-to-yield stays comparable.
  • Frame and sustain tuning. The 150 ms onset and 200 ms silence windows are defaults. Very short, fast callers may need a shorter onset window. Noisy lines may need a learned VAD instead of the energy gate. Tune once against a handful of hand-labeled clips, then hold the settings fixed for the whole batch. For the tradeoffs behind these specific numbers, see turn detection tuning: false vs missed interruptions.

How to fix an agent that fails

Once the numbers point at a cause, the fixes are concrete. This is stack-level configuration, not a rewrite. For a step-by-step walkthrough of this exact failure mode, see fix a voice agent that talks over the caller.

  • Confirm interruption is actually enabled. Most voice frameworks have a barge-in or interruptible flag on the agent or session. If it is off, the agent will finish its turn no matter what. This is the single most common cause of a zero-yield result.
  • Tune input VAD sensitivity and endpointing. A long seconds-to-yield with low overlap usually means the input detector waits too long to accept caller speech. Lower the speech-start threshold or the required speech duration so the stop fires sooner.
  • Flush the TTS on interrupt. High overlap with a short decision time means buffered audio is still draining. Send a stop or clear to the TTS and drop the queued audio the instant an interruption is detected, rather than letting the current utterance finish.
  • Prefer full-duplex handling. An agent that cannot listen while it speaks physically cannot detect a barge-in until it pauses. If the pipeline is half-duplex, that is the ceiling on how fast it can yield.
  • Debounce backchannels deliberately. If you want the agent to ignore short acknowledgments, do it on purpose with a minimum interruption duration, and set it knowing it also delays real barge-ins. Measure both effects with the same three signals.

After any change, re-score the same recordings and the same batch. The point of an objective measurement is that "it feels better now" becomes "median seconds to yield dropped and total overlap fell."

Scoring across a batch to benchmark the agent

One call is an anecdote. To benchmark voice agent interruption handling, run the per-call script over every recording, collect one row per barge-in event, and summarize the distribution rather than a single number.

events = [score_call(path) for path in recordings]      # each row: the 3 signals
barge_ins = [e for e in events if e["onset"] is not None]

yield_rate  = np.mean([e["yielded"] for e in barge_ins])
median_ttl  = np.median([e["seconds_to_yield"] for e in barge_ins
                         if e["seconds_to_yield"] is not None])
total_over  = np.sum([e["overlap_seconds"] for e in barge_ins])

print("yield rate         :", round(yield_rate, 3))
print("median s-to-yield  :", round(median_ttl, 3))
print("total overlap (s)  :", round(total_over, 1))

That is a scorecard: a yield rate, a median seconds-to-yield, and total overlap seconds across the batch, plus the two or three worst calls to listen to. It is stable enough to gate a release and simple enough to explain to a customer. And because it is the same three signals every time, you can score whether the agent yielded on recorded calls from last month against this month and actually trust the delta.

A quick honesty check on your own labels: pull the five worst calls the script flags and listen to them. If they sound bad, your thresholds are aligned with human perception. If they sound fine, your onset or silence windows need a nudge. Calibrate against your ears once, then let the numbers do the work at scale.

Want this run for you

The same three signals, scored by hand on your calls

Everything above is the exact scoring we run. If you would rather not build and tune the pipeline, send us 10 to 15 of your own recorded calls and we score every barge-in event on these three signals, by hand, and return a per-call scorecard plus an aggregate reliability report in about a week. $499, one time. No integration, nothing connects to your stack, and it works with recordings from Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit.

Your recordings are deleted after the report is delivered, never used to train any model, and we will sign an NDA on request. $499, one time.

FAQ

How do you test voice agent barge-in?

You test it from the audio, not the internals. For each recording, mark when the caller starts speaking, mark when the agent's speech energy drops to silence, and measure three things: whether the agent yielded, how many seconds it took, and how many seconds it kept speaking over the caller. Because it reads audio, it works for any stack that exports recordings.

How do you measure barge-in latency for a voice agent?

Barge-in latency, the seconds to yield, is the gap between the caller's speech onset and the point where the agent's text-to-speech energy drops and stays low. Compute short-time RMS energy on the agent audio in 20 ms frames, find the caller onset with a voice activity detector, and subtract. Cross-linguistic conversation studies put the typical gap between human turns near 200 ms, so a responsive agent should yield well under a second.

What metrics should you track for voice agent interruptions?

Three objective signals cover the failure mode: did the agent yield when the caller spoke, seconds to yield, and overlap seconds. Aggregate them across calls into a yield rate, a median seconds-to-yield, and total overlap seconds to get a scorecard you can trust across releases.

How do you benchmark voice agent interruption handling across a stack?

Collect 10 to 15 representative recorded calls that contain real interruptions, score every barge-in event on the same three signals, and report the distribution rather than one number. Because the method reads audio, the same benchmark applies whether the calls came from Vapi, Retell, Bland, Pipecat, or self-hosted LiveKit.

How do you score whether a voice agent yielded on recorded calls?

For each interruption, the agent yielded if its speech energy drops to the channel noise floor within a fixed window after the caller onset and does not resume over the caller. If the agent audio keeps its energy through the caller's turn, it did not yield. Two-channel recordings make this reliable because the caller and agent sit on separate channels and cannot be confused.

Does this require integrating with my voice agent?

No. The method reads call audio and an optional transcript, so nothing connects to your system. It is a one-time measurement, not continuous monitoring, and it works with Vapi, Retell, Bland, Pipecat, self-hosted LiveKit, or any stack that can export recordings.