A U-shape in dot-product attention under gradient flow

We prove that gradient flow on a single dot-product self-attention layer with tied unembedding, trained on cross-entropy loss in a two-token setting, exhibits a non-monotone trajectory in the Frobenius norm of its interaction matrix: starting from any \emph{sharp-wrong} initialization in which both tokens initially attend strongly to the wrong target, there exists $T_1 \in (0, \infty)$ such that $||M(t)||_F^2$ is strictly decreasing on $[0, T_1]$ and diverges to $+\infty$ as $t \to \infty$. The proof rests on a row-sum conservation law for the logit matrix $L = X M X^T$, which pins $d^2 - 2$ scalar invariants of $M$ at initialization and reduces the effective dynamics to two scalar coefficients. Borrowing the blacksmith's vocabulary, training proceeds in three phases: heating (parameter contraction on $[0, T_1]$), forging (unbounded growth for $t > T_1$), and cooling (the asymptotic regime established in prior work). We characterize the first two phases.

Found an issue? Give us feedback