Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

Qi, Gege; Chen, Yuefeng; Mao, Xiaofeng; Jia, Xiaojun; Duan, Ranjie; Zhang, Rong; Xue, Hui

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://doi.org/10.21437/inter...

Article . 2023 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2023

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 20 Aug 2023Embargo end date: 01 Jan 2023Publisher:ISCAJournal:INTERSPEECH 2023

Authors: Qi, Gege; Chen, Yuefeng; Mao, Xiaofeng; Jia, Xiaojun; Duan, Ranjie; Zhang, Rong; Xue, Hui;

doi: 10.21437/interspeech.2023-1556 , 10.48550/arxiv.2307.12498

arXiv: 2307.12498

Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat outperforms the original model by 6.28% WER reduction on ESB, achieving the new state-of-the-art.

Related Organizations

Chinese Academy of Sciences
China (People's Republic of)
Chinese Academy of Science
China (People's Republic of)
Chinese Academy of Sciences (中国科学院)
China (People's Republic of)
Alibaba Group (China)
China (People's Republic of)
Chinese Academy of Science (中国科学院)
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

2 Research products, page 1 of 1

SpeechT5 software on GitHub
IsRelatedTo
gpuRIR software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average