Artifact of LLM-Guided Genetic Improvement: Envisioning Semantic Aware Automated Software Evolution

Abstract: we propose the investigation of a new research line on AI-powered GI aimed at incorporating semantic aware search. We take a first step at it by augmenting GI with the use of automated clustering of LLM edits. We provide initial empirical evidence that our proposal, dubbed PatchCat, allows us to automatically and effectively categorize LLM-suggested patches. PatchCat identified 18 different types of software patches and categorized newly suggested patches with high accuracy. It also enabled detecting NoOp edits in advance and, prospectively, to skip test suite execution to save resources in many cases. These results, coupled with the fact that PatchCat works with small, local LLMs, are a promising step toward interpretable, efficient, and green GI. Note: This is the artifact of ASE NIER 2025 publication. Examples for RQ2 (full example text): Reasons for inconsistency in tagging: Case of Different prioritization by the model and humans: The model prioritizes importance differently than humans, e.g., when an entry was tagged Category #12 by the model and #9 by a human, who added a note: ”12 could also be considered but 9 is more important. Also makes changes to the return, but not mentioned in the description”, for the following shot 15-word description generated via LLM: "A Java code diff with 4 changes: catches ParseException, adds variable, and updates logic". Case of Incomplete or unclear summaries:Issues with the short 15-word description, such as not describing all changes or an unclear summary. For example, when the "if" statement was modified: 125d124 final Locale locale = Locale.ENGLISH; > final SimpleDateFormat format = new SimpleDateFormat(pattern, locale); > // assume no header date by default > boolean hasHeaderDate = false; 129a132 > hasHeaderDate = true; 133a137,140 > if (hasHeaderDate) { > // add a newline after the date field > header.append(""\n""); > } but this was not clear from the summary: "SimpleDateFormat constructor and locale usage changed, with additional logic for header date detection". Content of Files in the Artifact: Datasets: Raw Data is taken from here: https://zenodo.org/records/13381774. Initial Manual Clustering: clustering of 309 entries from JUnit4 and JCodec projects, with LLM patches generated with Mistral LLM. Size of dataset: 309. File: Patch Analysis-anon.xlsx. Augmented Dataset: The initial dataset was manually clustered after data augmentation. Size of dataset: 5806 (unique). File: DataAugmentation_Approach_Patch Classification_subsection.xlsx Validation (RQ1): Validation on unseen datasets (unseen projects, and/or unseen LLM-generated patches model).Size of dataset: 218. File: RQ2-dataset-all_patch_summaries.xlsx Statistics (RQ2): Data used to construct statistics of LLM-generated patches in Gin from ForArtifact.zip.Size of dataset: 3232. File: ForArtifact.zip Dockers: The model: the model is built via the offline approach to be used in the online approach in a Docker file, ready to test and use. File: model-in-a-docker-unseen−retrives−batch.tar. Code: Clustering is taken from here: https://github.com/rashadulrakib/short-text-clustering-enhancement, but applied to a new dataset. Clustering scripts, developed on top of the short-text-clustering work. File: clustering.zip. Code of RQ1 is in ForArtifact.zip.

Related Organizations

King's College London
United Kingdom
University College London
United Kingdom
Johannes Gutenberg University of Mainz
Germany
University of Stirling
United Kingdom

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

FORTHEM Alliance