The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking

Name: The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking
Keywords: Automation, Evaluation strategies, Àrees temàtiques de la UPC::Informàtica::Enginyeria del software, Empirical studies, Over fitting problem, Repair techniques, Automated program repair, Patch assessment, Program debugging, Software testing

Justyna Petke; Matias Martinez; Maria Kechagia; Aldeida Aleti; Federica Sarro

Found an issue? Give us feedback

downloadFull-Text

UPCommons. Portal de...arrow_drop_down

UPCommons. Portal del coneixement obert de la UPC

Conference object . 2024 . Peer-reviewed

License: CC BY

Full-Text: https://upcommons.upc.edu/bitstreams/d0235e78-a4cb-46af-ae10-1abc726324e2/download

Data sources: UPCommons. Portal del coneixement obert de la UPC

https://doi.org/10.1145/366352...

Article . 2024 . Peer-reviewed

License: CC BY

Data sources: Crossref

Recolector de Ciencia Abierta, RECOLECTA

Conference object . 2024 . Peer-reviewed

License: CC BY

Data sources: Recolector de Ciencia Abierta, RECOLECTA

http://dx.doi.org/10.1145/3663...

Conference object

Full-Text: https://dl.acm.org/doi/pdf/10.1145/3663529

Data sources: Sygma

DBLP

Conference object

Data sources: DBLP

http://dx.doi.org/10.1145/3663...

Conference object . 2024

Data sources: European Union Open Data Portal

The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 10 Jul 2024 Spain Publisher:ACMJournal:Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software EngineeringFunded by:EC | EPIC, ARC | Discovery Projects - Gran...

Authors: Justyna Petke; Matias Martinez; Maria Kechagia; Aldeida Aleti; Federica Sarro;

doi: 10.1145/3663529.3663776

handle: 2117/421157

The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking

- Summary
- Subjects
- Metrics

Abstract

Automated program repair techniques aim to generate patches for software bugs, mainly relying on testing to check their validity. The generation of a large number of such plausible yet incorrect patches is widely believed to hinder wider application of APR in practice, which has motivated research in automated patch assessment. We reflect on the validity of this motivation and carry out an empirical study to analyse the extent to which 10 APR tools suffer from the overfitting problem in practice. We observe that the number of plausible patches generated by any of the APR tools analysed for a given bug from the Defects4J dataset is remarkably low, a median of 2, indicating that a developer only needs to consider 2 patches in most cases to be confident to find a fix or confirming its nonexistence. This study unveils that the overfitting problem might not be as bad as previously thought. We reflect on current evaluation strategies of automated patch assessment techniques and propose a Random Selection baseline to assess whether and when using such techniques is beneficial for reducing human effort. We advocate future work should evaluate the benefit arising from patch overfitting assessment usage against the random baseline.

Peer Reviewed

Country

Spain

Related Organizations

Universitat Politècnica de Catalunya
Spain
Universitat Polite`cnica de Catalunya
Spain
University College London
United Kingdom
Monash University
Australia
UNIVERSITY COLLEGE LONDON, Bartlett School of Planning
United Kingdom

Keywords

Automation, Evaluation strategies, Àrees temàtiques de la UPC::Informàtica::Enginyeria del software, Empirical studies, Over fitting problem, Repair techniques, Automated program repair, Patch assessment, Program debugging, Software testing, Software bug

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average