A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object , Contribution for newspaper or weekly magazine 01 Jan 2021Embargo end date: 01 Jan 2021 United Kingdom Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main VolumeFunded by:UKRI | EPSRC Centre for Doctoral..., UKRI | UK Robotics and Artificia...

Authors: Miruna-Adriana Clinciu; Arash Eshghi; Helen F. Hastie;

doi: 10.18653/v1/2021.eacl-main.202 , 10.48550/arxiv.2103.08545

arXiv: 2103.08545

handle: 20.500.11820/6031b13f-100b-49a0-ae3e-08eec96cdbff

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus of NL explanations for Bayesian Networks. We run correlations comparing human subjective ratings with NLG automatic measures. We find that embedding-based automatic NLG evaluation methods, such as BERTScore and BLEURT, have a higher correlation with human ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work has implications for Explainable AI and transparent robotic and autonomous systems.

Accepted at EACL 2021

Country

United Kingdom

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL)

1 Research products, page 1 of 1

COS software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	15
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%