Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks

Authors: Maio, Anthony D.;

CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks

Abstract

We present CoDA-GQA-L, an attention mechanism that compresses the KV cache fromO(n) to a fixed budget of W +Me+Ms slots per layer—independent of sequence length—while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, CoDA-GQA-L achieves bounded perplexity of 5.94 on WikiText-2 at 2,048 context with a fixed 218 KB per-layer cache, compared to >2 MB for the baseline (+23.5% PPL overhead, 9.5×memory reduction). The architecture combines three innovations: (1) Constrained Orthogonal Differential Attention (CoDA), which sharpens attention by subtracting a gated inhibitory stream produced via learnable orthogonal rotation—saving D×D parameters per head; (2) a dual-bank bounded memory comprising an exact landmark bank for high-fidelity token retention and an EMA summary bank for thematic compression, with value-routed semantic matching that ensures position-invariant updates despite RoPE-at-write key storage; and (3) two custom Triton kernels—a fused differential FlashAttention kernel and a fused exact-bank routing kernel—that replace ∼15 PyTorch kernel launches each with single-pass GPU computation. A two-phase training protocol first teaches differential attention with full context (2,000 steps), then adapts the model to bounded memory (600 steps). Key results on Mistral-7B: 100% needle-in-haystack retrieval up to 16K tokens, a 5.7× reduction in bounded penalty from differential attention (a 2×2 factorial ablation shows both methods achieve 5.75 PPL unbounded, but GQA loses +1.09 PPL going bounded while CoDA loses only +0.19), minimal context-length degradation (5.94 at 2K vs. 5.95 at 4K), and projected 1,100× compression at 70B/128K context. The trained checkpoint and all code (56 passing tests) are publicly available.Trained checkpoint: huggingface.co/anthonym21/Mistral-7B-v0.3-CoDA-GQA-L

Keywords

attention mechanisms, coda, cuda, constrained orthogonal differential attention, gqa

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!