Evaluating Generated Commit Messages with Large Language Models

Name: Evaluating Generated Commit Messages with Large Language Models
Keywords: Software Engineering (cs.SE), FOS: Computer and information sciences, Software Engineering

Zeng, Qunhong; Zhang, Yuxia; Ma, Zexiong; Jiang, Bo; Sun, Ningyuan; Stol, Klaas-Jan; Mou, Xingyu; Liu, Hui

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

Evaluating Generated Commit Messages with Large Language Models

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:arXiv

Authors: Zeng, Qunhong; Zhang, Yuxia; Ma, Zexiong; Jiang, Bo; Sun, Ningyuan; Stol, Klaas-Jan; Mou, Xingyu; +1 Authors

doi: 10.48550/arxiv.2507.10906

arXiv: 2507.10906

Evaluating Generated Commit Messages with Large Language Models

- Summary
- Subjects
- Metrics

Abstract

Commit messages are essential in software development as they serve to document and explain code changes. Yet, their quality often falls short in practice, with studies showing significant proportions of empty or inadequate messages. While automated commit message generation has advanced significantly, particularly with Large Language Models (LLMs), the evaluation of generated messages remains challenging. Traditional reference-based automatic metrics like BLEU, ROUGE-L, and METEOR have notable limitations in assessing commit message quality, as they assume a one-to-one mapping between code changes and commit messages, leading researchers to rely on resource-intensive human evaluation. This study investigates the potential of LLMs as automated evaluators for commit message quality. Through systematic experimentation with various prompt strategies and state-of-the-art LLMs, we demonstrate that LLMs combining Chain-of-Thought reasoning with few-shot demonstrations achieve near human-level evaluation proficiency. Our LLM-based evaluator significantly outperforms traditional metrics while maintaining acceptable reproducibility, robustness, and fairness levels despite some inherent variability. This work conducts a comprehensive preliminary study on using LLMs for commit message evaluation, offering a scalable alternative to human assessment while maintaining high-quality evaluation.

Keywords

Software Engineering (cs.SE), FOS: Computer and information sciences, Software Engineering

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green