Identifying Reliable Evaluation Metrics for Scientific Text Revision

Name: Identifying Reliable Evaluation Metrics for Scientific Text Revision
Keywords: FOS: Computer and information sciences, Computer Science - Computation and Language, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], [INFO] Computer Science [cs], Computation and Language (cs.CL)

Jourdan, Léane; Boudin, Florian; Hernandez, Nicolas; Dufour, Richard

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

HAL Sorbonne Université

Conference object . 2025

License: CC BY NC SA

Data sources: HAL Sorbonne Université

https://doi.org/10.18653/v1/20...

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY NC SA

Data sources: Datacite

DBLP

Conference object

Data sources: DBLP

DBLP

Article

Data sources: DBLP

Identifying Reliable Evaluation Metrics for Scientific Text Revision

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2025Embargo end date: 01 Jan 2025 France Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Authors: Jourdan, Léane; Boudin, Florian; Hernandez, Nicolas; Dufour, Richard;

doi: 10.18653/v1/2025.acl-long.335 , 10.48550/arxiv.2506.04772

arXiv: 2506.04772

Identifying Reliable Evaluation Metrics for Scientific Text Revision

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

V3 contains only the English version, accepted to ACL 2025 main (26 pages). V2 contains both English (ACL 2025) and French (TALN 2025) versions (58 pages)

Country

France

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], [INFO] Computer Science [cs], Computation and Language (cs.CL)

1 Research products, page 1 of 1

parareval software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

INRIA

Identifying Reliable Evaluation Metrics for Scientific Text Revision

Identifying Reliable Evaluation Metrics for Scientific Text Revision

1 Research products, page 1 of 1

parareval software on GitHub