Separate and Amplify: Attention's Geometry of Retrieval

Using the Tuple-Structured Associative Recall task to isolate retrieval, we demonstrate that Transformer models learn high-magnitude spherical codes (sets of vectors with a guaranteed minimum angular separation) and can achieve perfect accuracy and robust length generalization down to single-digit head dimensions. We show by construction that attention's single-head retrieval capacity $N$ approaches the representational limit of the subspaces it projects from, and is thus unbounded with real numbers. Given $b$ bits per coordinate of input, capacity scales as $N \approx 2^{bd_k}$, or equivalently $N \approx 2^B$ for some total number of bits $B$. Head dimension $d_k \geq 2$ does not increase capacity, but influences how efficiently a given spherical code can approach this representational limit.

Found an issue? Give us feedback