Replication package for "The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models"

This repository contains the replication package for the paper "The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models" by Fernando Vallecillos Ruiz, Max Hort, and Leon Moonen, accepted for the research track of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025). A preprint of the paper is included. The source code is distributed under the MIT license, and except for 3rd party datasets that come with their own license, all documentation, data, models and results in this repository are distributed under the CC BY 4.0 license. Repository Overview This repository contains the necessary scripts, data, and resources to replicate the experiments presented in our conference paper. The structure of this repository has been organized to facilitate ease of use for researchers interested in reproducing our results, conducting similar analyses, or building upon our work. Repository Structure Folder Description analysis Contains Jupyter notebook scripts used to generate tables and visual analyses. These scripts assist in visualizing results, comparing metrics, and summarizing data from the experiments. The outputs can be easily exported for further use. apr_training Contains the dataset used for the Automated Program Repair (APR) training phase. This data is utilized by the scripts in train_src/ for fine-tuning the models. benchmarks Includes JSON files representing different benchmarks, specifically HumanEval-Java and Defects4J. In this work, we have primarily focused on and revised HumanEval-Java. inference_and_validation_src Contains Python scripts used to generate patches and validate them across different benchmarks. These scripts play a critical role in producing and assessing model outputs. inference_scripts Bash scripts used to automate the process of submitting inference and validation jobs to the compute cluster. This facilitates multiple iterations of inference and validation in a streamlined manner. models* Stores the fine-tuned machine learning models used in the experiments. These models are the output of the fine-tuning process and are referenced by the inference scripts. results Contains all the outputs from the models in JSON format, generated during the inference process. These files represent the raw experimental results. train_src Python scripts for model fine-tuning. These scripts include methods for performing both full model training and LoRA fine-tuning for parameter-efficient updates. validation_benchmark_dataset Contains the benchmark datasets used during validation. * Note that all contents except for the model files from the models/ folder are included in the compressed zip file in this Zenodo repository. The model files are uploaded separately to the repository to facilitate individual downloads, as several of them are relatively large (9.5-11.2GB). Detailed Folder Descriptions Analysis (analysis/) This folder contains Jupyter notebook scripts used to generate tables and visual analyses of the experimental data. These scripts are designed to assist in visualizing results, comparing performance metrics, and summarizing experimental outcomes. Researchers can easily export the generated tables to spreadsheets for further processing or visualization. The outputs help in validating the experiment's consistency and provide insights into the performance of various model configurations. Inference and Validation Source (inference_and_validation_src/) The Python scripts in this folder are used for generating patches and validating them against predefined benchmarks. We utilize the "Fire" library to parse parameters and execute the relevant methods efficiently. This folder contains: Scripts for generating patches directly from the benchmark data or using iterative approaches. Validation utilities for Defects4J and HumanEval benchmarks to ensure the generated patches are functional and comply with benchmark requirements. Key components include: Patch generation logic. Validation commands for HumanEval and Defects4J benchmarks. Utilities to verify data integrity of generated JSON files. Training Source (train_src/) This folder contains the scripts used for model fine-tuning: full_finetune.py: This script performs full fine-tuning of a model on a given training dataset. It updates all trainable parameters to achieve optimal model performance on the target task. lora_finetune.py: This script implements LoRA (Low-Rank Adaptation) fine-tuning. LoRA is a parameter-efficient fine-tuning approach where only a smaller subset of model parameters are updated, making it effective for resource-constrained tasks. Inference Scripts (inference_scripts/) These Bash scripts are designed to automate the inference process by submitting multiple iterations of inference and validation jobs to the compute cluster. The scripts create job dependencies, ensuring that all necessary tasks are completed in a logical sequence. The available inference scripts include: model_inferencing_adjustable_FULL_d4j_big.sh: Executes inference for specified model configurations with multiple iterations and outputs per iteration. model_inferencing_adjustable_FULL_d4j_lora_big.sh: Similar to the previous script, but optimized for LoRA-based models. These scripts accept three parameters: MODEL: The name of the model, as found in the models/ folder. NUM_ITERATIONS: The number of iterations to run. NUM_OUTPUTS: The number of outputs generated in each iteration. Citation and Zenodo links We hope this package serves as a useful resource for reproducing and expanding upon our research results. Please cite this work by referring to the published paper: Fernando Vallecillos Ruiz, Max Hort, and Leon Moonen, 2025. The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models. In proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025), ACM, 12 pages. @inproceedings{ruiz2025:art, title = {{The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models}}, author = {Ruiz, Fernando Vallecillos and Hort, Max and Moonen, Leon}, booktitle = {{Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE)}}, year = {2025}, pages = {12}, publisher = {{ACM}}, language = {en} } The replication package is archived on Zenodo with DOI: 10.5281/zenodo.15294695.

Acknowledgement This work is supported by the Research Council of Norway through the secureIT project (IKTPLUSS #288787), and by the European Union through the Horizon Europe Marie Skłodowska-Curie Actions (#101151798). The empirical evaluation made use of the Experimental Infrastructure for Exploration of Exascale Computing (eX3), financially supported by the Research Council of Norway under contract #270053. In addition, we acknowledge Sigma2, Norway for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland), and the LUMI consortium through the Research Council of Norway.

Related Organizations

Simula Research Laboratory
Norway

Keywords

software maintenance, automated program repair, software evolution, software testing, large language models

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Funded by

EC| condenSE