Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Apolloarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
versions View all 1 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Improving Attention-based Sequence-to-sequence Models

Authors: Dou, Qingyun;

Improving Attention-based Sequence-to-sequence Models

Abstract

Attention-based models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Neural Machine Translation (NMT), Automatic Speech Recognition (ASR) and speech synthesis, also known as Text-To-Speech (TTS). These models are often autoregressive, which leads to high modeling capacity, but also makes training difficult. The standard training approach, teacher forcing, suffers from exposure bias: during training the model is guided with the reference output, but the generated output must be used at inference stage. To address this issue, scheduled sampling and professor forcing guide a model with both the reference and the generated output history. To facilitate convergence, they depend on a heuristic schedule or an auxiliary classifier, which can be difficult to tune. Alternatively, sequence-level training approaches guide the model with the generated output history, and optimize a sequence-level criterion. However, many tasks, such as TTS, do not have a well-established sequence-level criterion. In addition, the generation process is often sequential, which is undesirable for parallelizable models such as Transformer. This thesis introduces attention forcing and deliberation networks to improve attention-based sequence-to-sequence models. Attention forcing guides a model with the generated output history and reference attention. The training criterion is a combination of maximum log-likelihood and the KL-divergence between the reference attention and the generated attention. This approach does not rely on a heuristic schedule or a classifier, and does not require a sequence-level criterion. Variations of attention forcing are proposed for more challenging application scenarios. For tasks such as NMT, the output space is multi-modal in the sense that the given an input, the distribution of the corresponding output can be multi-modal. So a selection scheme is introduced to automatically turn attention forcing on and off depending on the mode of attention. For parallelizable models, an approximation scheme is proposed to run attention forcing in parallel across time. Deliberation networks consist of multiple attention-based models. The output is generated in multiple passes, each one conditioned on the initial input and the free running output of the previous pass. This thesis shows that deliberation networks can address exposure bias, which is essential for performance gains. In addition, various training approaches are discussed, and a separate training approach is proposed for its synergy with parallelizable models. Finally, for tasks where the output space is continuous, such as TTS, deliberation networks tend to ignore the free running outputs, thus losing its benefits. To address this issue, a guided attention loss is proposed to regularize the corresponding attention, encouraging the use of the free running outputs. TTS and NMT are investigated as example sequence-to-sequence tasks, and task-specific techniques are proposed, such as neural vocoder adaption using attention forcing. The experiments demonstrate that attention forcing improves the overall performance and diversity. It is also demonstrated that deliberation networks improve the overall performance, and reduce the chances of attention failure.

Keywords

machine learning, speech synthesis, sequence-to-sequence models, machine translation

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green