Evaluation of Semantic Answer Similarity Metrics

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 25 Jun 2022Embargo end date: 01 Jan 2022Publisher:Academy and Industry Research Collaboration Center (AIRCC)Journal:Machine Learning & Applications

Authors: Farida Mustafazade; Peter F. Ebbinghaus;

doi: 10.5121/csit.2022.121109 , 10.5121/ijnlc.2022.11305 , 10.5281/zenodo.6984757 , 10.5281/zenodo.6984756 , 10.48550/arxiv.2206.12664

arXiv: 2206.12664

Evaluation of Semantic Answer Similarity Metrics

- Summary
- Subjects
- Metrics

Abstract

There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used both for training and as a benchmark.

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Question-answering, semantic answer similarity, exact match, pre-trained language models, cross-encoder, bi-encoder, semantic textual similarity, automated data labelling, Computation and Language (cs.CL), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average