Fast detection of XML structural similarity

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Feb 2005Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Knowledge and Data Engineering, volume 17, pages 160-175 (issn: 1041-4347,

Copyright policy )

Authors: FLESCA, Sergio; G. MANCO; E. MASCIARI; L. PONTIERI; PUGLIESE, Andrea;

doi: 10.1109/tkde.2005.27

handle: 11588/761559 , 11588/763002 , 20.500.14243/201495 , 20.500.14243/13397 , 20.500.14243/196921 , 20.500.11770/123213 , 20.500.11770/167299

Fast detection of XML structural similarity

- Summary
- Subjects
- Metrics

Abstract

Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of Discrete Fourier Transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.

Related Organizations

University Federico II of Naples
Italy
University of Calabria
Italy
National Research Council
Italy
Institute of High Performance Computing
Singapore
Institute for high performance computing and networking
Italy

View all View all

Keywords

Web mining; mining methods and algorithms; XML/XSL/RDF; text mining; similarity measures

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	78
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

78

Top 10%

Top 1%

Top 10%

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Related to Research communities

Aurora Universities Network