Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Edinburgh Research A...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
DBLP
Doctoral thesis . 2024
Data sources: DBLP
versions View all 2 versions
addClaim

Learning to adapt: meta-learning approaches for speaker adaptation

Authors: Klejch, Ondrej;

Learning to adapt: meta-learning approaches for speaker adaptation

Abstract

The performance of automatic speech recognition systems degrades rapidly when there is a mismatch between training and testing conditions. One way to compensate for this mismatch is to adapt an acoustic model to test conditions, for example by performing speaker adaptation. In this thesis we focus on the discriminative model-based speaker adaptation approach. The success of this approach relies on having a robust speaker adaptation procedure – we need to specify which parameters should be adapted and how they should be adapted. Unfortunately, tuning the speaker adaptation procedure requires considerable manual effort. In this thesis we propose to formulate speaker adaptation as a meta-learning task. In meta-learning, learning occurs on two levels: a learner learns a task specific model and a meta-learner learns how to train these task specific models. In our case, the learner is a speaker dependent-model and the meta-learner learns to adapt a speaker-independent model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation schedules compared to adaptation procedures with handcrafted hyperparameters. Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we embed the gradient based adaptation directly into the training of the acoustic model. We hypothesise that this formulation should steer the training of the acoustic model into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and with SAT-LHUC, which represents a traditional speaker adaptive training method. We show that the meta-learning speaker-adaptive training approach achieves comparable results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC outperforms the baseline approach after adaptation. Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that SAT-LHUC interferes with batch normalisation, and that it benefits from an increased hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still favourable to train a speaker independent model with batch normalisation. As such, an effective way of training state-of-the-art SAT-LHUC models remains an open question. Finally, we show that the performance of unsupervised speaker adaptation can be further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to be adapted without overfitting being observed, and is successful even when the initial transcription has a WER in excess of 50%.

Country
United Kingdom
Related Organizations
Keywords

meta-learning, automatic speech recognition, speaker adaptation

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green
Funded by
Related to Research communities