A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2020Embargo end date: 01 Jan 2020Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 58th Annual Meeting of the Association for Computational LinguisticsFunded by:SNSF | Learning to Interact with..., CHIST-ERA | LIHLITH, EC | INODE

Authors: Jan Deriu; Katsiaryna Mlynchyk; Philippe Schläpfer; Álvaro Rodrigo; Dirk Von Gruenigen; Nicolas Kaiser; Kurt Stockinger; +2 Authors

doi: 10.18653/v1/2020.acl-main.84 , 10.48550/arxiv.2004.07633 , 10.21256/zhaw-20319

arXiv: 2004.07633

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

- Summary
- Subjects
- Metrics

Abstract

In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database called Operation Trees (OT). This representation allows us to invert the annotation process without losing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of query tokens to OT operations. In our method, we randomly generate OTs from a context-free grammar. Afterwards, annotators have to write the appropriate natural language question that is represented by the OT. Finally, the annotators assign the tokens to the OT operations. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Related Organizations

Zürcher Fachhochschule
Switzerland
Zurich University of Applied Sciences
Switzerland
National University of Distance Education
Spain
University of the Basque Country
Spain

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial intelligence, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Deep learning, Semantic parsing, 006: Spezielle Computerverfahren, Machine Learning (cs.LG), 400: Sprache und Linguistik, Artificial Intelligence (cs.AI), Natural language interface to database, Computation and Language (cs.CL)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%