ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

Kshirsagar, Parth Sanjay; Pandey, Kartikey

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Kshirsagar, Parth Sanjay; Pandey, Kartikey;

doi: 10.5281/zenodo.20786357

ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

- Summary

Abstract

Large language models face two compounding token inefficiencies: single-turn contexts containirrelevant passages that consume budget without contributing to answers, and multi-turn conversationsresend full history every call, causing cumulative cost to grow quadratically with conversation length.Deletion-based compression approaches are query-independent and cannot drop entire irrelevantpassages; multi-turn memory systems lack explicit protection for the bridging facts that multi-hopreasoning depends on. We present ReCompress, a two-component system addressing both regimes. Aquery-aware rewriting compressor, distilled into a 1.5B student (Qwen2.5-1.5B + LoRA), outperformsbear-1.1 by +0.252 F1 on HotpotQA while emitting roughly 8.5× fewer tokens (48 vs. 409 at a ratio-0.3 compression instruction). The gain is significant on multi-hop question answering with distractors(HotpotQA, and the near-in-distribution 2WikiMultiHop, +0.180 F1) and positive-but-not-significanton more dissimilar tasks (MuSiQue, SQuAD) at n = 50; we make the narrower claim the datasupports. We further audit the result against ourselves: the gap survives an independent solver, and amask-the-answer probe shows a substantial share of the margin comes from reliably retaining theanswer-bearing span at a 3.5% budget where deletion truncates it. A tiered multi-turn framework,RbD-Compress, holds the context sent to the solver flat through protected trauma memory, aversioned checkpoint stack with rollback, and Echidna, an intelligent trigger that reads traumamemory before compression decisions, at no measurable loss in answer quality — a flatness resultwe scope carefully against per-turn compression overhead and KV-caching assumptions. Our resultsshow that query-aware rewriting and deletion-based compression serve complementary operatingregimes.

Found an issue? Give us feedback