Inference of Regular Expressions for Text Extraction from Examples

descriptionPublicationkeyboard_double_arrow_right Article 01 May 2016 Italy Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Knowledge and Data Engineering, volume 28, pages 1,217-1,230 (issn: 1041-4347,

Copyright policy )

Authors: BARTOLI, Alberto; DE LORENZO, ANDREA; MEDVET, Eric; TARLAO, FABIANO;

doi: 10.1109/tkde.2016.2515587

handle: 11368/2864925

Inference of Regular Expressions for Text Extraction from Examples

- Summary
- Subjects
- Metrics

Abstract

A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it.

Country

Italy

Related Organizations

University of Trieste
Italy

Keywords

Information extraction, Programming by example, Genetic Programming, Heuristic search, Genetic Programming; Information extraction; Programming by examples; Multiobjective optimization; Heuristic search, Multiobjective optimization

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	64
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

64

Top 10%

Green

bronze

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering