Minimum Word Error Rate Training for Speech Separation

The cocktail party problem, also known as a single-channel multi-talker problem, is a significant challenge to enhance the performance of automatic speech recognition (ASR) systems. Most existing speech separation model only concerns the signal-level performance, i.e., source-to-distortion ratio (SDR), via their cost/loss function, not a transcription-level performance. However, transcription-level measurement, such as word error rate (WER) is the ultimate measurement that can be used in the performance of ASR. Therefore we propose a new loss function that can directly consider both signal and transcription level performance with integrating both speech separation and speech recognition system. Moreover, we suggest the generalized integration architecture that can be applied to any combination of speech recognition/separation system regardless of their system environment. In this thesis, first, we review the techniques from the primary signal processing knowledge to deep learning techniques and introduce the detailed target and challenge in speech separation problem. Moreover, we analyze the several famous speech separation models derived from a deep learning approach. Then we introduce the new loss function with our detailed system architecture, including the step-by-step process from pre-processing to evaluation. We improve the performance of the existing model using our training approach. Our solution enhances average SDR from 0.10dB to 4.09dB and average WER from 92.7% to 55.7% using LibriSpeech dataset.

Master's thesis in Computer science

Country

Norway

Related Organizations

University of Stavanger
Norway

Keywords

datateknologi, informasjonsteknologi, datateknikk, VDP::Technology: 500::Information and communication technology: 550

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

UArctic