
We present CoDA-GQA-L, an attention mechanism that compresses the KV cache fromO(n) to a fixed budget of W +Me+Ms slots per layer—independent of sequence length—while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, CoDA-GQA-L achieves bounded perplexity of 5.94 on WikiText-2 at 2,048 context with a fixed 218 KB per-layer cache, compared to >2 MB for the baseline (+23.5% PPL overhead, 9.5×memory reduction). The architecture combines three innovations: (1) Constrained Orthogonal Differential Attention (CoDA), which sharpens attention by subtracting a gated inhibitory stream produced via learnable orthogonal rotation—saving D×D parameters per head; (2) a dual-bank bounded memory comprising an exact landmark bank for high-fidelity token retention and an EMA summary bank for thematic compression, with value-routed semantic matching that ensures position-invariant updates despite RoPE-at-write key storage; and (3) two custom Triton kernels—a fused differential FlashAttention kernel and a fused exact-bank routing kernel—that replace ∼15 PyTorch kernel launches each with single-pass GPU computation. A two-phase training protocol first teaches differential attention with full context (2,000 steps), then adapts the model to bounded memory (600 steps). Key results on Mistral-7B: 100% needle-in-haystack retrieval up to 16K tokens, a 5.7× reduction in bounded penalty from differential attention (a 2×2 factorial ablation shows both methods achieve 5.75 PPL unbounded, but GQA loses +1.09 PPL going bounded while CoDA loses only +0.19), minimal context-length degradation (5.94 at 2K vs. 5.95 at 4K), and projected 1,100× compression at 70B/128K context. The trained checkpoint and all code (56 passing tests) are publicly available.Trained checkpoint: huggingface.co/anthonym21/Mistral-7B-v0.3-CoDA-GQA-L
attention mechanisms, coda, cuda, constrained orthogonal differential attention, gqa
attention mechanisms, coda, cuda, constrained orthogonal differential attention, gqa
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
