descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Conference object 01 Jan 2015 English Publisher:Springer International Publishing

Authors: De Mol, Robin; Bronselaer, Antoon; Nielandt, Joachim; De Tré, Guy;

doi: 10.1007/978-3-319-11313-5_50

handle: 1854/LU-5718316

Data Driven XPath Generation

- Summary
- Subjects
- Metrics

Abstract

The XPath query language offers a standard for information extraction from HTML documents. Therefore, the DOM tree representation is typically used, which models the hierarchical structure of the document. One of the key aspects of HTML is the separation of data and the structure that is used to represent it. A consequence thereof is that data extraction algorithms usually fail to identify data if the structure of a document is changed. In this paper, it is investigated how a set of tabular oriented XPath queries can be adapted in such a way it deals with modifications in the DOM tree of an HTML document. The basic idea is hereby that if data has already been extracted in the past, it could be used to reconstruct XPath queries that retrieve the data from a different DOM tree. Experimental results show the accuracy of our method.

Related Organizations

Ghent University
Belgium

Keywords

HTML, Technology and Engineering, XPath Generation, WEB DATA EXTRACTION, Data Driven, INFORMATION EXTRACTION

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average