Feature-based robust techniques for speech recognition

Name: Feature-based robust techniques for speech recognition
Creator: Nguyen, Duc Hoang Ha
Keywords: DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition, :Engineering::Computer science and engineering::Computing methodologies::Pattern recognition [DRNTU], 004, 620

Nguyen, Duc Hoang Ha

Found an issue? Give us feedback

https://dr.ntu.edu.s...arrow_drop_down

https://dr.ntu.edu.sg/bitstrea...

Doctoral thesis

Data sources: UnpayWall

Digital Repository of NTU

Thesis . 2017

Data sources: Digital Repository of NTU

https://doi.org/10.32657/10356...

Doctoral thesis . 2020 . Peer-reviewed

Data sources: Crossref

DBLP

Doctoral thesis

Data sources: DBLP

https://dx.doi.org/10.32657/10...

Thesis

Data sources: Microsoft Academic Graph

DR-NTU (Digital Repository at Nanyang Technological University, Singapore)

Thesis . 2017

Data sources: Bielefeld Academic Search Engine (BASE)

Feature-based robust techniques for speech recognition

descriptionPublicationkeyboard_double_arrow_right Doctoral thesis , Thesis 28 Oct 2020 Singapore Publisher:Nanyang Technological University

Authors: Nguyen, Duc Hoang Ha;

doi: 10.32657/10356/69839

handle: 10356/69839

Feature-based robust techniques for speech recognition

- Summary
- Subjects
- Metrics

Abstract

Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environment, its accuracy degrades considerably under noisy conditions. I.e., robustness of ASR systems for real-world applications remains a challenge. In this thesis, speech feature enhancement and model adaptation for robust speech recognition is studied, and three novel methods to improve performance are introduced. The first work proposes a modification of the spectral subtraction method to reduce the non-stationary characteristics of additive noise in the speech. The main idea is to first normalise the noise's characteristics towards a Gaussian noise model, and then tackle the remaining noise by a model compensation method. The strategy is to reduce the noise handling problem to the back-end process. In this work, the back-end compensation process is applied using the vector Taylor series (VTS) model compensation approach, and we call this method the noise normalization VTS (NN-VTS). The second work proposes an extension of particle filter compensation (PFC) for the large vocabulary continuous speech recognition (LVCSR) task. PFC is a clean speech features tracking method using side information from hidden Markov models (HMM) for the particle filter framework. However, under noisy conditions for sub-word based LVCSR, the task to obtain an accurately aligned state sequence of HMM that describe the underlying clean speech features is challenging. This is because the total number of triphone models involved can be very large. To improve the identification of correct phone sequence, this work proposes to use a noisy model HMM trained from noisy data to estimate the state sequence and a parallel clean model HMM trained from clean data to generate the clean speech features. These two HMMs are trained jointly, and the alignment of states between the clean and noisy models HMM is obtained by single pass retraining (SPR) technique. With this approach, the accuracy of state sequence estimate is improved by the noisy model HMM, and the accurately aligned state is obtained by SPR technique. When the missing side information for PFC is available, a word error reduction of 28.46% from multi-condition training is observed for the Aurora-4 task. The third work proposes a novel spectro-temporal transform framework to improve word error rate for the noisy and reverberant environments. Motivated by the findings that human speech comprehension relies on both the spectral content and temporal envelope of speech signal, a spectro-temporal (ST) transform framework is proposed. This framework adapts the features to minimize the mismatch between the input features and training data using the Kullback Leibler divergence based cost function. In our work, we examined two implementations to overcome the limited adaptation data issue. The first implementation is a cross transform which is a sparse spectro-temporal transforms. The second implementation is a cascaded transform of temporal transform and spectral transform. Experiments are conducted on the REVERB Challenge 2014 task, where clean and multi-condition trained acoustic models are tested with real reverberant and noisy speech. Experimental results confirmed that temporal information is important for reverberant speech recognition and the simultaneous use of spectral and temporal information for feature adaptation is effective. Doctor of Philosophy (SCE)

Country

Singapore

Related Organizations

Nanyang Technological University
Singapore

Keywords

DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition, :Engineering::Computer science and engineering::Computing methodologies::Pattern recognition [DRNTU], 004, 620

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

bronze