Artifact for Paper: Compressing Code Context for LLM-based Issue Resolution

# Compressing Code Context for LLM-based Issue Resolution This artifact accompanies the paper **"Compressing Code Context for LLM-based Issue Resolution"**. It contains the implementation of two components: 1. **Oracle-Guided Context Distillation (OCD)** — an offline search-based pipeline that finds the minimal sufficient code context for resolving a bug (using HDD + genetic algorithm search with Docker-based evaluation). 2. **SWEzze Reranker (SWEzze)** — a fine-tuned sequence classification model that predicts which code segments to retain at inference time, without requiring expensive search. --- ## Directory Structure ``` artifact/ ├── data/ │ ├── gpt-5.2/ # Included artifact outputs for GPT-5.2 │ ├── qwen3-coder-next/ # Included artifact outputs for Qwen3-Coder-Next │ └── deepseek-v3.2/ # Included artifact outputs for DeepSeek-V3.2 ├── OCD/ │ ├── compress/ # OCD pipeline: CLI, HDD/GA search, reward function │ │ ├── cli/compress.py # Main entrypoint for context compression │ │ └── core/ # HDD, GA, reward function implementations │ ├── docker/ # Docker container management (build, run, test) │ ├── git/ # Git operations (clone, checkout, apply patch) │ ├── services/ # Repository workspace management │ └── shared/ # Cross-cutting utilities │ ├── model.py # LLM backend (API / HF / vLLM) │ ├── prompt.py # Prompt templates (Agentless-style) │ ├── editing.py # Patch post-processing │ ├── get_repo_structure.py # Repository structure extraction │ ├── constants.py # SWE-bench/SWE-smith specs │ └── ... └── SWEzze/ # SWEzze reranker (inference + training) ├── reranker.py # PatchAwareRerankerCompressor ├── base.py # BaseCompressor interface ├── data/ # Reranker training data preparation └── training/ └── train_reranker.py # Reranker training (pointwise / pairwise / support-aware) ``` --- ## Included Data The artifact includes precomputed outputs under `artifact/data/` for three models: - `gpt-5.2` - `qwen3-coder-next` - `deepseek-v3.2` Within each model directory, results are grouped by compression method: - `swezze` - `swepruner` - `llmlingua` - `longcodezip` - `no_compression` - `no_context` Each `artifact/data///` directory contains: - `compressed.jsonl` — the compressed-context outputs for that model/method setting - `patches.jsonl` — the corresponding generated patches for the same instances These files are included as ready-to-inspect artifact outputs for comparison, analysis, and case studies; they are not required to run the OCD pipeline or train the SWEzze reranker from scratch. --- ## Requirements ### System Requirements - Python 3.9+ - Docker daemon (required for OCD pipeline evaluation) - Git - CUDA-capable GPU (required for vLLM backend and model training; optional for API backend) ### Python Dependencies ```bash pip install docker gitpython datasets tqdm transformers torch openai python-dotenv \ jsonlines libcst peft trl scikit-learn ``` For vLLM inference backend: ```bash pip install vllm ``` ### Agentless ```bash git clone https://github.com/OpenAutoCoder/Agentless.git cd Agentless pip install -r requirements.txt ``` ### SWE-bench ```bash git clone https://github.com/princeton-nlp/SWE-bench.git cd SWE-bench pip install -e . ``` SWE-bench Docker images are downloaded automatically on first run. Ensure sufficient disk space (several GB per project family). ### SWE-smith ```bash git clone https://github.com/SWE-bench/SWE-smith.git cd SWE-smith pip install -e . ``` Set `--dataset swesmith` when running the compression pipeline against SWE-smith instances. --- ## Environment Variables Create a `.env` file in the project root (or export these variables): ```bash # Required for API backend (OpenAI-compatible endpoint) API_KEY=your_api_key_here BASE_URL=https://api.openai.com/v1 # or your custom endpoint # Required for cloning GitHub repositories (OCD pipeline) GITHUB_ACCESS_TOKEN=your_github_token_here # Optional: override HuggingFace mirror HF_ENDPOINT=https://huggingface.co # Optional: vLLM server configuration VLLM_BASE_URL=http://localhost:8005/v1 VLLM_API_KEY=EMPTY VLLM_TENSOR_PARALLEL_SIZE=2 VLLM_DATA_PARALLEL_SIZE=1 ``` --- ## Usage ### Part 1: Oracle-Guided Context Distillation (OCD) The OCD pipeline takes Agentless output (fault localization + repair samples) and finds the minimal sufficient context for each instance. #### Input Format The input JSONL file must contain one record per instance: ```json { "instance_id": "repo__owner.issue_number", "samples": [ { "prompt": "", "patches": ["", "", ...], "found_files": ["path/to/file.py"], "found_edit_locs": {"path/to/file.py": ["function_name"]} } ] } ``` Additionally, an `--auxiliary_data_path` directory is required containing per-instance subdirectories with: - `coverage.json` — code coverage data for the instance - `patch.diff` — the gold patch diff #### Running the Compression Pipeline ```bash python -m OCD.compress.cli.compress \ --data_path /path/to/agentless_output.jsonl \ --auxiliary_data_path /path/to/auxiliary_data \ --model \ --backend api \ --threads 4 \ --majority_voting 5 \ --dataset swebenchlite \ --playground ./playground ``` **Key arguments:** | Argument | Default | Description | |----------|---------|-------------| | `--data_path` | (required) | Path to input JSONL with Agentless repair samples | | `--auxiliary_data_path` | `./auxiliary_data` | Directory with coverage data and gold patches | | `--model` | `gpt-3.5-turbo` | LLM model name (API model or local model path) | | `--backend` | `auto` | `api` (OpenAI-compatible), `vllm`, `hf`, or `auto` | | `--threads` | `4` | Number of parallel compression threads | | `--majority_voting` | `5` | Patch candidates for majority-vote evaluation | | `--dataset` | `swesmith` | `swebenchlite` or `swesmith` | | `--playground` | `./playground` | Directory for cloned repositories | | `--instance_id` | None | Process a single instance (for debugging) | #### Output Format Results are written to `_compressed.jsonl`: ```json { "instance_id": "repo__owner.issue_number", "issue_description": "...", "buggy_file": "path/to/file.py", "samples": [ { "compression_method": "HDD", "initial_context": "", "compressed_context": "", "compression_ratio": 0.35 } ] } ``` Compression methods: `HDD` (passes original patches), `GA` (full genetic algorithm), `HEURISTIC+HDD`, `GA+HDD`, `EMPTY` (no context needed). --- ### Part 2: SWEzze Reranker The `SWEzze` module provides a reranker-based compressor that scores and selects code segments without running the search pipeline. #### Inference ```python from SWEzze import PatchAwareRerankerCompressor compressor = PatchAwareRerankerCompressor( model_name_or_path="Qwen/Qwen3-Reranker-0.6B", # or path to fine-tuned checkpoint # adapter_path="./outputs/SWEzze_reranker/lora_adapter", # optional LoRA adapter device="cuda", budget_tokens=4096, ) compressed_context = compressor.compress( issue=issue_description, found_files=["path/to/file.py"], initial_context=full_code_context, ) ``` #### Training Data Preparation Convert OCD output to reranker training format: ```bash python -m SWEzze.data.prepare_reranker_data \ --input /path/to/compressed.jsonl \ --output ./data/reranker_train.json \ --mode pointwise \ --split ``` #### Training the Reranker ```bash python -m SWEzze.training.train_reranker \ --model_name Qwen/Qwen3-Reranker-0.6B \ --train_data ./data/reranker_train.json \ --val_data ./data/reranker_val.json \ --output_dir ./outputs/SWEzze_reranker \ --mode pointwise \ --lora_r 64 \ --lora_alpha 128 \ --per_device_train_batch_size 8 \ --num_train_epochs 3 \ --learning_rate 2e-4 ``` The training script supports both base and support-aware training modes: | Mode | Description | |------|-------------| | `pointwise` | Binary relevance labels per (query, passage) pair | | `pairwise` | Contrastive loss over (query, positive, negative) triplets | | `auto` | Automatically detect format from data | --- ## Notes - **Docker daemon** must be running for the OCD pipeline (it launches Docker containers for each SWE-bench instance). - The OCD pipeline downloads SWE-bench Docker images on first run. Ensure sufficient disk space (several GB per project). - Repository cloning requires a valid `GITHUB_ACCESS_TOKEN`. - The `--playground` directory stores cloned repositories between runs to avoid redundant cloning. - Output files support incremental resumption: already-processed instances are skipped on restart.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average