
# Compressing Code Context for LLM-based Issue Resolution This artifact accompanies the paper **"Compressing Code Context for LLM-based Issue Resolution"**. It contains the implementation of two components: 1. **Oracle-Guided Context Distillation (OCD)** — an offline search-based pipeline that finds the minimal sufficient code context for resolving a bug (using HDD + genetic algorithm search with Docker-based evaluation). 2. **SWEzze Reranker (SWEzze)** — a fine-tuned sequence classification model that predicts which code segments to retain at inference time, without requiring expensive search. --- ## Directory Structure ``` artifact/ ├── data/ │ ├── gpt-5.2/ # Included artifact outputs for GPT-5.2 │ ├── qwen3-coder-next/ # Included artifact outputs for Qwen3-Coder-Next │ └── deepseek-v3.2/ # Included artifact outputs for DeepSeek-V3.2 ├── OCD/ │ ├── compress/ # OCD pipeline: CLI, HDD/GA search, reward function │ │ ├── cli/compress.py # Main entrypoint for context compression │ │ └── core/ # HDD, GA, reward function implementations │ ├── docker/ # Docker container management (build, run, test) │ ├── git/ # Git operations (clone, checkout, apply patch) │ ├── services/ # Repository workspace management │ └── shared/ # Cross-cutting utilities │ ├── model.py # LLM backend (API / HF / vLLM) │ ├── prompt.py # Prompt templates (Agentless-style) │ ├── editing.py # Patch post-processing │ ├── get_repo_structure.py # Repository structure extraction │ ├── constants.py # SWE-bench/SWE-smith specs │ └── ... └── SWEzze/ # SWEzze reranker (inference + training) ├── reranker.py # PatchAwareRerankerCompressor ├── base.py # BaseCompressor interface ├── data/ # Reranker training data preparation └── training/ └── train_reranker.py # Reranker training (pointwise / pairwise / support-aware) ``` --- ## Included Data The artifact includes precomputed outputs under `artifact/data/` for three models: - `gpt-5.2` - `qwen3-coder-next` - `deepseek-v3.2` Within each model directory, results are grouped by compression method: - `swezze` - `swepruner` - `llmlingua` - `longcodezip` - `no_compression` - `no_context` Each `artifact/data///` directory contains: - `compressed.jsonl` — the compressed-context outputs for that model/method setting - `patches.jsonl` — the corresponding generated patches for the same instances These files are included as ready-to-inspect artifact outputs for comparison, analysis, and case studies; they are not required to run the OCD pipeline or train the SWEzze reranker from scratch. --- ## Requirements ### System Requirements - Python 3.9+ - Docker daemon (required for OCD pipeline evaluation) - Git - CUDA-capable GPU (required for vLLM backend and model training; optional for API backend) ### Python Dependencies ```bash pip install docker gitpython datasets tqdm transformers torch openai python-dotenv \ jsonlines libcst peft trl scikit-learn ``` For vLLM inference backend: ```bash pip install vllm ``` ### Agentless ```bash git clone https://github.com/OpenAutoCoder/Agentless.git cd Agentless pip install -r requirements.txt ``` ### SWE-bench ```bash git clone https://github.com/princeton-nlp/SWE-bench.git cd SWE-bench pip install -e . ``` SWE-bench Docker images are downloaded automatically on first run. Ensure sufficient disk space (several GB per project family). ### SWE-smith ```bash git clone https://github.com/SWE-bench/SWE-smith.git cd SWE-smith pip install -e . ``` Set `--dataset swesmith` when running the compression pipeline against SWE-smith instances. --- ## Environment Variables Create a `.env` file in the project root (or export these variables): ```bash # Required for API backend (OpenAI-compatible endpoint) API_KEY=your_api_key_here BASE_URL=https://api.openai.com/v1 # or your custom endpoint # Required for cloning GitHub repositories (OCD pipeline) GITHUB_ACCESS_TOKEN=your_github_token_here # Optional: override HuggingFace mirror HF_ENDPOINT=https://huggingface.co # Optional: vLLM server configuration VLLM_BASE_URL=http://localhost:8005/v1 VLLM_API_KEY=EMPTY VLLM_TENSOR_PARALLEL_SIZE=2 VLLM_DATA_PARALLEL_SIZE=1 ``` --- ## Usage ### Part 1: Oracle-Guided Context Distillation (OCD) The OCD pipeline takes Agentless output (fault localization + repair samples) and finds the minimal sufficient context for each instance. #### Input Format The input JSONL file must contain one record per instance: ```json { "instance_id": "repo__owner.issue_number", "samples": [ { "prompt": "", "patches": ["", "", ...], "found_files": ["path/to/file.py"], "found_edit_locs": {"path/to/file.py": ["function_name"]} } ] } ``` Additionally, an `--auxiliary_data_path` directory is required containing per-instance subdirectories with: - `coverage.json` — code coverage data for the instance - `patch.diff` — the gold patch diff #### Running the Compression Pipeline ```bash python -m OCD.compress.cli.compress \ --data_path /path/to/agentless_output.jsonl \ --auxiliary_data_path /path/to/auxiliary_data \ --model \ --backend api \ --threads 4 \ --majority_voting 5 \ --dataset swebenchlite \ --playground ./playground ``` **Key arguments:** | Argument | Default | Description | |----------|---------|-------------| | `--data_path` | (required) | Path to input JSONL with Agentless repair samples | | `--auxiliary_data_path` | `./auxiliary_data` | Directory with coverage data and gold patches | | `--model` | `gpt-3.5-turbo` | LLM model name (API model or local model path) | | `--backend` | `auto` | `api` (OpenAI-compatible), `vllm`, `hf`, or `auto` | | `--threads` | `4` | Number of parallel compression threads | | `--majority_voting` | `5` | Patch candidates for majority-vote evaluation | | `--dataset` | `swesmith` | `swebenchlite` or `swesmith` | | `--playground` | `./playground` | Directory for cloned repositories | | `--instance_id` | None | Process a single instance (for debugging) | #### Output Format Results are written to `_compressed.jsonl`: ```json { "instance_id": "repo__owner.issue_number", "issue_description": "...", "buggy_file": "path/to/file.py", "samples": [ { "compression_method": "HDD", "initial_context": "", "compressed_context": "", "compression_ratio": 0.35 } ] } ``` Compression methods: `HDD` (passes original patches), `GA` (full genetic algorithm), `HEURISTIC+HDD`, `GA+HDD`, `EMPTY` (no context needed). --- ### Part 2: SWEzze Reranker The `SWEzze` module provides a reranker-based compressor that scores and selects code segments without running the search pipeline. #### Inference ```python from SWEzze import PatchAwareRerankerCompressor compressor = PatchAwareRerankerCompressor( model_name_or_path="Qwen/Qwen3-Reranker-0.6B", # or path to fine-tuned checkpoint # adapter_path="./outputs/SWEzze_reranker/lora_adapter", # optional LoRA adapter device="cuda", budget_tokens=4096, ) compressed_context = compressor.compress( issue=issue_description, found_files=["path/to/file.py"], initial_context=full_code_context, ) ``` #### Training Data Preparation Convert OCD output to reranker training format: ```bash python -m SWEzze.data.prepare_reranker_data \ --input /path/to/compressed.jsonl \ --output ./data/reranker_train.json \ --mode pointwise \ --split ``` #### Training the Reranker ```bash python -m SWEzze.training.train_reranker \ --model_name Qwen/Qwen3-Reranker-0.6B \ --train_data ./data/reranker_train.json \ --val_data ./data/reranker_val.json \ --output_dir ./outputs/SWEzze_reranker \ --mode pointwise \ --lora_r 64 \ --lora_alpha 128 \ --per_device_train_batch_size 8 \ --num_train_epochs 3 \ --learning_rate 2e-4 ``` The training script supports both base and support-aware training modes: | Mode | Description | |------|-------------| | `pointwise` | Binary relevance labels per (query, passage) pair | | `pairwise` | Contrastive loss over (query, positive, negative) triplets | | `auto` | Automatically detect format from data | --- ## Notes - **Docker daemon** must be running for the OCD pipeline (it launches Docker containers for each SWE-bench instance). - The OCD pipeline downloads SWE-bench Docker images on first run. Ensure sufficient disk space (several GB per project). - Repository cloning requires a valid `GITHUB_ACCESS_TOKEN`. - The `--playground` directory stores cloned repositories between runs to avoid redundant cloning. - Output files support incremental resumption: already-processed instances are skipped on restart.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
