descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Feb 2014 English Publisher:Elsevier BVJournal:Journal of Biomedical Informatics, volume 47, pages 83-90 (issn: 1532-0464,

Authors: Mark Stevenson; Bridget T. McInnes;

doi: 10.1016/j.jbi.2013.09.009

pmid: 24076369

Determining the difficulty of Word Sense Disambiguation

- Summary
- Subjects
- Metrics

Abstract

Automatic processing of biomedical documents is made difficult by the fact that many of the terms they contain are ambiguous. Word Sense Disambiguation (WSD) systems attempt to resolve these ambiguities and identify the correct meaning. However, the published literature on WSD systems for biomedical documents report considerable differences in performance for different terms. The development of WSD systems is often expensive with respect to acquiring the necessary training data. It would therefore be useful to be able to predict in advance which terms WSD systems are likely to perform well or badly on. This paper explores various methods for estimating the performance of WSD systems on a wide range of ambiguous biomedical terms (including ambiguous words/phrases and abbreviations). The methods include both supervised and unsupervised approaches. The supervised approaches make use of information from labeled training data while the unsupervised ones rely on the UMLS Metathesaurus. The approaches are evaluated by comparing their predictions about how difficult disambiguation will be for ambiguous terms against the output of two WSD systems. We find the supervised methods are the best predictors of WSD difficulty, but are limited by their dependence on labeled training data. The unsupervised methods all perform well in some situations and can be applied more widely.

Related Organizations

University of Minnesota System
United States
University of Sheffield (Dept. Computer Science)
United Kingdom
University of Minnesota Morris
United States
University of Sheffield
United Kingdom
University of Minnesota
United States

View all View all

Keywords

Ambiguity, Knowledge Bases, MEDLINE, Health Informatics, NLP, Artificial Intelligence, Humans, WSD, Word Sense Disambiguation, Language, Natural Language Processing, Models, Statistical, Reproducibility of Results, Biomedical documents, Unified Medical Language System, Computer Science Applications, Semantics, Vocabulary, Controlled, Algorithms, Medical Informatics

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	23
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average