Efficient exact similarity joins

Similarity joins play an important role in many application areas, such as near duplicate Web page detection, data integration and cleaning, record linkage, and pattern recognition. Consequently, there has been much interest in developing efficient algorithms for this fundamental operation. In this thesis, we investigate four important problems of similarity join. 1. set similarity joins 2. similarity joins with edit constraints 3. top-k set similarity joins 4. approximate entity extraction with edit constraints We first study the problem of set similarity join. Two filtering techniques are developed by exploiting the ordering of tokens. They can be adapted or combined with existing approaches to produce better quality results or improve the runtime efficiency in detecting near duplicate Web pages. To address the problem of similarity joins with edit constraints, we propose a novel perspective of analyzing mismatching q-grams to speed up the similarity join. Two new filtering methods are developed to handle non-clustered edit errors and clustered edit errors. Then we study the top-k set similarity join problem. A novel algorithm is proposed to answer top-k similarity join queries. Opposed to traditional similarity joins algorithms, the proposed algorithm can progressively compute join results without providing a similarity threshold. Finally, we study the problem of approximate entity extraction with edit constraints. We tackle the major technical problem in existing neighborhood generation algorithms. Two novel pruning techniques are proposed to reduce the size of neighborhood, making our approach a highly competitive solution to approximate entity extraction.

Country

Australia

Related Organizations

UNSW Sydney
Australia

Keywords

Near duplicate detection, Similarity join, Text and document retrieval, 004

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green