Efficient Enumeration Algorithms for Regular Document Spanners

descriptionPublicationkeyboard_double_arrow_right Article 08 Feb 2020 English Publisher:Association for Computing Machinery (ACM)Journal:ACM Transactions on Database Systems, volume 45, pages 1-42 (issn: 0362-5915, eissn: 1557-4644,

Copyright policy )

Authors: Florenzano Hernández, Fernando Alberto; Riveros Jaeger, Cristian; Ugarte, M.; Vansummeren, S.; Vrgoc, Domagoj;

doi: 10.1145/3351451

handle: 2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/307225

Efficient Enumeration Algorithms for Regular Document Spanners

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners , use regular languages to locate the data that a user wants to extract from a text document and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have efficient evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Toward this goal, we present a practical evaluation algorithm that allows output-linear delay enumeration of a spanner’s result after a precomputation phase that is linear in the document. Although the algorithm assumes that the spanner is specified in a syntactic variant of variable-set automata, we also study how it can be applied when the spanner is specified by general variable-set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner and provide a fine-grained analysis of the classes of document spanners that support efficient enumeration of their results.

Related Organizations

Keywords

enumeration delay, Information extraction, Informatique générale, Enumeration delay, automata, Formal languages and automata, capture variables, Automata, spanners, Information storage and retrieval of data, Capture variables, information extraction, Nonnumerical algorithms, Spanners

1 Research products, page 1 of 1

New insights into island vegetation composition and species diversity—Consistent and conditional responses across contrasting insular habitats at the plot-scale
2018IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	19
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%