AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.1) Version 1.1: Some missing entries in the Emilia file map have been updated and a number of YODAS archive URLs have been corrected. The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models. AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties. Numerical labels: Speaker count Binary labels: Non-English (foreign) language Background music Noisy or poor-quality speech Synthetic (spoofed) speech Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines. AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI. Files include: Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json) Interface config for Label Studio (.xml) If you use this dataset, please cite: W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025. License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Keywords

In-The-Wild Data, Audio Classification, Speech Datasets

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average