Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

CrystalCache: Cross-Domain Transfer from Cognitive Memory Crystallization to KV Cache Eviction in Long-Context LLMs

Authors: Lin, Po-Ting;

CrystalCache: Cross-Domain Transfer from Cognitive Memory Crystallization to KV Cache Eviction in Long-Context LLMs

Abstract

Abstract The Key–Value (KV) cache of long-context Large Language Models (LLMs) grows linearly with context length and is now the dominant memory bottleneck of long-context inference; at 128K tokens a single batch of bf16 KV for Llama-3-8B already exceeds the model weights themselves. Existing eviction methods fall into two generations. The first generation (H2O, SnapKV, StreamingLLM, Scissorhands) summarises each token by a single scalar and evicts at token granularity, producing "coverage holes" over semantically coherent passages. The second generation (ChunkKV, EpiCache, CAOTE, DefensiveKV, PyramidKV) advances along a single axis each — fixed-size grouping, signal fusion, or robust aggregation of repeated observations — but none simultaneously satisfies the four structural requirements of dynamic semantic boundaries, two independent scoring dimensions, an explicit rarity signal, and progressive (rather than binary) retention. We propose CrystalCache, a KV-cache eviction algorithm derived from the structural predictions of the Crystallization Memory Framework: that any system serving a memory function should describe each item along at least two independent axes (analogous to a crystal's structural extent and formation strength) and should organise items as a multi-branch trunk rather than a single block. CrystalCache instantiates these predictions in four concurrent design moves: (1) it builds trunks — semantic units bounded by sentence punctuation and refined by co-attention — rather than fixed-size chunks or utterance clusters; (2) it scores each trunk along two independently computed dimensions, an associative crystallization term D (structural centrality in the trunk graph) and an encoding impact term M_i (attention salience plus a Von Restorff rarity term), and composes them as Score = max(D, α · normalize(log(1 + M_i))), providing two independent survival paths; (3) it injects an explicit token-frequency rarity signal U_i = 1 / (1 + log(1 + c_i)) directly into the score, a signal absent from all four contemporaneous works; and (4) it replaces binary retention with a two-stage branch dissolution procedure that performs proportional retention between trunks and M_i-ranked retention within trunks. On Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Qwen3-8B, across Needle-in-a-Haystack and a Delayed Association diagnostic at retention budgets β ∈ {0.3, 0.5}, CrystalCache wins all 3 × 2 × 2 = 12 retrieval comparisons against H2O, SnapKV, ChunkKV, StreamingLLM, and PyramidKV; on Qwen3-8B Needle (β = 0.5) it doubles the best baseline (0.333 vs. 0.167) and quadruples the weakest (vs. 0.083). Ablations identify the Von Restorff rarity term as the single most impactful component (−0.383 when removed), confirm that trunk-level eviction outperforms token-level (−0.317 when T_max = 1), and confirm that the dual-dimension max composition strictly beats either dimension alone. On the broader-coverage LongBench suite, CrystalCache is competitive but not leading, a trade-off we attribute to the spatial-coverage cost of trunk-level retention and discuss honestly as a limitation. The end-to-end system delivers 50–70% steady-state decode memory savings; the prefill overhead (54–64% at 16K–32K) stems entirely from a CPU-NumPy O(n²) co-attention edge extraction and is engineering, not algorithmic. Beyond the empirical result, the consistency of the 12/12 cross-model, cross-task, cross-budget gains constitutes a computational corroboration of the structural predictions of the Crystallization Memory Framework: when a system serves a memory function, structural principles derived from biological memory transfer non-trivially to its design.

Powered by OpenAIRE graph
Found an issue? Give us feedback