Quantified Reproducibility Assessment of NLP Results

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2022Embargo end date: 01 Jan 2022 Ireland Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Publicly fundedFunded by:SFI | ADAPT: Centre for Digital..., EC | CONNEXIONs, EC | WELCOME +2 projects

Authors: Anya Belz; Maja Popovic; Simon Mille;

doi: 10.18653/v1/2022.acl-long.2 , 10.48550/arxiv.2204.05961

arXiv: 2204.05961

Quantified Reproducibility Assessment of NLP Results

- Summary
- Subjects
- Metrics

Abstract

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and allows conclusions to be drawn about what changes to system and/or evaluation design might lead to improved reproducibility.

To be published in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL'22)

Country

Ireland

Related Organizations

Universitat Pompeu Fabra
Spain
Dublin City University
Ireland

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Machine learning, Computational linguistics, Computation and Language (cs.CL)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	6
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%