Name: Zero-shot KWS for children’s speech using layer-wise features from SSL models
Keywords: Human-Computer Interaction, Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), Sound, Artificial Intelligence (cs.AI), Artificial Intelligence, Audio and Speech Processing (eess.AS), Signal Processing, FOS: Electrical engineering, electronic engineering, information engineering

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Nov 2025Embargo end date: 01 Jan 2025 English Publisher:Elsevier BVJournal:Pattern Recognition Letters, volume 197, pages 304-311 (issn: 0167-8655,

Authors: Subham Kutum; Abhijit Sinha; Hemant Kumar Kathania; Sudarsana Reddy Kadiri; Mahesh Chandra Govil;

doi: 10.1016/j.patrec.2025.08.010 , 10.48550/arxiv.2508.21248

arXiv: http://arxiv.org/abs/2508.21248

Zero-shot KWS for children’s speech using layer-wise features from SSL models

- Summary
- Subjects
- Metrics

Abstract

Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.

Accepted

Keywords

Human-Computer Interaction, Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), Sound, Artificial Intelligence (cs.AI), Artificial Intelligence, Audio and Speech Processing (eess.AS), Signal Processing, FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing, Human-Computer Interaction (cs.HC)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Related to Research communities

Knowmad Institut