Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

Authors: Kshirsagar, Parth Sanjay; Pandey, Kartikey;

ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

Abstract

Large language models face two compounding token inefficiencies: single-turn contexts containirrelevant passages that consume budget without contributing to answers, and multi-turn conversationsresend full history every call, causing cumulative cost to grow quadratically with conversation length.Deletion-based compression approaches are query-independent and cannot drop entire irrelevantpassages; multi-turn memory systems lack explicit protection for the bridging facts that multi-hopreasoning depends on. We present ReCompress, a two-component system addressing both regimes. Aquery-aware rewriting compressor, distilled into a 1.5B student (Qwen2.5-1.5B + LoRA), outperformsbear-1.1 by +0.252 F1 on HotpotQA while emitting roughly 8.5× fewer tokens (48 vs. 409 at a ratio-0.3 compression instruction). The gain is significant on multi-hop question answering with distractors(HotpotQA, and the near-in-distribution 2WikiMultiHop, +0.180 F1) and positive-but-not-significanton more dissimilar tasks (MuSiQue, SQuAD) at n = 50; we make the narrower claim the datasupports. We further audit the result against ourselves: the gap survives an independent solver, and amask-the-answer probe shows a substantial share of the margin comes from reliably retaining theanswer-bearing span at a 3.5% budget where deletion truncates it. A tiered multi-turn framework,RbD-Compress, holds the context sent to the solver flat through protected trauma memory, aversioned checkpoint stack with rollback, and Echidna, an intelligent trigger that reads traumamemory before compression decisions, at no measurable loss in answer quality — a flatness resultwe scope carefully against per-turn compression overhead and KV-caching assumptions. Our resultsshow that query-aware rewriting and deletion-based compression serve complementary operatingregimes.

Powered by OpenAIRE graph
Found an issue? Give us feedback