
arXiv: 2506.04772
Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.
V3 contains only the English version, accepted to ACL 2025 main (26 pages). V2 contains both English (ACL 2025) and French (TALN 2025) versions (58 pages)
FOS: Computer and information sciences, Computer Science - Computation and Language, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], [INFO] Computer Science [cs], Computation and Language (cs.CL)
FOS: Computer and information sciences, Computer Science - Computation and Language, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], [INFO] Computer Science [cs], Computation and Language (cs.CL)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
