Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

ATR: An Acoustic Turn-over Rate Metric for Evaluating Overlap Handling in Full-Duplex Spoken Dialogue Systems

Authors: Dike, Ifeanyi;

ATR: An Acoustic Turn-over Rate Metric for Evaluating Overlap Handling in Full-Duplex Spoken Dialogue Systems

Abstract

Turn-over Rate (TOR), introduced in Full-Duplex-Bench [5], measures how often a full-duplex spoken dialogue model yields the speaking floor during overlapping speech. TOR is computed from ASR transcripts, and this dependency is important: two commonly-used ASR backends produce TOR estimates that diverge by up to 23.5 percentage points on the same audio, making cross-study comparisons unreliable. We propose the Acoustic Turn-over Rate (ATR), which replaces the ASR step with voice activity detection (VAD). ATR asks the same question as TOR — did the model’s audio fall silent before the overlap window closed? — but answers it directly from the waveform, without transcription. Evaluated on Moshi [2] across all four Full-Duplex-Bench v1.5 [6] scenarios, three independent VAD backends (Silero, WebRTC, pyannote) agree within 7.6 percentage points on user interruption, roughly three times more consistent than ASR-based TOR on the same audio. ATR requires no language model, runs locally, and is language-agnostic, making it a practical and more reproducible alternative for benchmarking full-duplex speech systems. Index Terms: full-duplex dialogue, turn-taking evaluation, voice activity detection, spoken dialogue systems, overlap handling

Powered by OpenAIRE graph
Found an issue? Give us feedback