Transformers without Tears: Improving the Normalization of Self-Attention

descriptionPublicationkeyboard_double_arrow_right Conference object , Article , Preprint , Other literature type 01 Jan 2019Embargo end date: 01 Jan 2019 English Publisher:ZenodoJournal:CoRR, volume abs/1910.05895

Authors: Toan Q. Nguyen; Julian Salazar;

doi: 10.5281/zenodo.3525483 , 10.48550/arxiv.1910.05895 , 10.5281/zenodo.3525484

arXiv: 1910.05895

Transformers without Tears: Improving the Normalization of Self-Attention

- Summary
- Subjects
- Metrics

Abstract

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose $\ell_2$ normalization with a single scale parameter (ScaleNorm) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FixNorm). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT'15 English-Vietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT'14 English-German), ScaleNorm and FixNorm remain competitive but PreNorm degrades performance.

Accepted to IWSLT 2019 (oral); code is available at https://github.com/tnq177/transformers_without_tears

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning, Machine Learning (stat.ML), Computation and Language (cs.CL), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average