descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2022Embargo end date: 01 Jan 2021Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Authors: Xiong, Wenhan; Oğuz, Barlas; Gupta, Anchit; Chen, Xilun; Liskovich, Diana; Levy, Omer; Yih, Wen-tau; +1 Authors

doi: 10.18653/v1/2022.naacl-main.144 , 10.48550/arxiv.2112.07210

arXiv: http://arxiv.org/abs/2112.07210

Simple Local Attentions Remain Competitive for Long-Context Tasks

- Summary
- Subjects
- Related research
  (6)
- Metrics

Abstract

Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results -- using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer~\citep{longformer} with half of its pretraining compute. The code to replicate our experiments can be found at https://github.com/pytorch/fairseq/tree/main/examples/xformers

NAACL 2022 Main Conference

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)

6 Research products, page 1 of 1

Long Context Question Answering via Supervised Contrastive Learning
2022IsAmongTopNSimilarDocuments
Large-Context Question Answering with Cross-Lingual Transfer
2021IsAmongTopNSimilarDocuments
The effect of long context exposure on cued conditioning and c-fos expression in the rat forebrain
2005IsAmongTopNSimilarDocuments
Long Context Question Answering via Supervised Contrastive Learning
2022IsAmongTopNSimilarDocuments
Simple Local Attentions Remain Competitive for Long-Context Tasks
2022IsAmongTopNSimilarDocuments
fairseq software on GitHub
IsRelatedTo

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	6
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%