Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Authors: Yan, Tianshi;

Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Abstract

Transformers fail to generalize beyond training lengths on arithmetic tasks. We argue the root cause is geometric: dot-product attention projects onto the subspace spanned by training data, and cannot capture structural patterns that are orthogonal to content similarity. We introduce Dual-Head Attention, which adds Gram-Schmidt-orthogonalized sine heads alongside standard cosine heads. On N×N integer multiplication, an 883K-parameter model trained on 1-6 digit operands achieves 80.6% exact-match accuracy on 7-10 digit unseen operands, where a standard Transformer with identical capacity scores near zero. The model uses no scratchpad and no task-specific positional encoding. Code: https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization

Powered by OpenAIRE graph
Found an issue? Give us feedback