Nibbling at the Hard Core of Word Sense Disambiguation

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Jan 2022 Italy Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Funded by:EC | ELG, EC | MOUSSE, EC | ELEXIS

Authors: Maru, Marco; Conia, Simone; Bevilacqua, Michele; Navigli, Roberto;

doi: 10.18653/v1/2022.acl-long.324 , 10.5281/zenodo.6800975 , 10.5281/zenodo.6800974

handle: 11573/1639906

Nibbling at the Hard Core of Word Sense Disambiguation

- Summary
- Subjects
- External Databases
  (2)
- Metrics

Abstract

With state-of-the-art systems having finally attained estimated human performance, Word Sense Disambiguation (WSD) has now joined the array of Natural Language Processing tasks that have seemingly been solved, thanks to the vast amounts of knowledge encoded into Transformer-based pre-trained language models. And yet, if we look below the surface of raw figures, it is easy to realize that current approaches still make trivial mistakes that a human would never make. In this work, we provide evidence showing why the F1 score metric should not simply be taken at face value and present an exhaustive analysis of the errors that seven of the most representative state-of-the-art systems for English all-words WSD make on traditional evaluation benchmarks. In addition, we produce and release a collection of test sets featuring (a) an amended version of the standard evaluation benchmark that fixes its lexical and semantic inaccuracies, (b) 42D, a challenge set devised to assess the resilience of systems with respect to least frequent word senses and senses not seen at training time, and (c) hardEN, a challenge set made up solely of instances which none of the investigated state-of-the-art systems can solve. We make all of the test sets and model predictions available to the research community at https://github.com/ SapienzaNLP/wsd-hard-benchmark.

Country

Italy

Related Organizations

Keywords

strategies, tools, standards for lexicographic resources (objective 3), WP3, word sense disambiguation; semantics; natural language processing; benchmark

5for

6for

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	10
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%