Name: Evaluating the evaluation
Creator: Ellen M. Voorhees
Keywords: 9. Industry and infrastructure, 4. Education, 05 social sciences, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 0509 other social sciences, 7. Clean energy, 12. Responsible consumption

a case study using the TREC 2002 question answering track

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2003Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03, volume 1, pages 181-188

Authors: Ellen M. Voorhees;

doi: 10.3115/1073445.1073479

Evaluating the evaluation

- Summary
- Metrics

Abstract

Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.

Related Organizations

National Institute of Standards and Technology
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	8
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%