Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ UNSWorksarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
UNSWorks
Doctoral thesis . 2010
License: CC BY NC ND
https://dx.doi.org/10.26190/un...
Doctoral thesis . 2010
License: CC BY NC ND
Data sources: Datacite
DBLP
Doctoral thesis
Data sources: DBLP
versions View all 2 versions
addClaim

Efficient exact similarity joins

Authors: Xiao, Chuan;

Efficient exact similarity joins

Abstract

Similarity joins play an important role in many application areas, such as near duplicate Web page detection, data integration and cleaning, record linkage, and pattern recognition. Consequently, there has been much interest in developing efficient algorithms for this fundamental operation. In this thesis, we investigate four important problems of similarity join. 1. set similarity joins 2. similarity joins with edit constraints 3. top-k set similarity joins 4. approximate entity extraction with edit constraints We first study the problem of set similarity join. Two filtering techniques are developed by exploiting the ordering of tokens. They can be adapted or combined with existing approaches to produce better quality results or improve the runtime efficiency in detecting near duplicate Web pages. To address the problem of similarity joins with edit constraints, we propose a novel perspective of analyzing mismatching q-grams to speed up the similarity join. Two new filtering methods are developed to handle non-clustered edit errors and clustered edit errors. Then we study the top-k set similarity join problem. A novel algorithm is proposed to answer top-k similarity join queries. Opposed to traditional similarity joins algorithms, the proposed algorithm can progressively compute join results without providing a similarity threshold. Finally, we study the problem of approximate entity extraction with edit constraints. We tackle the major technical problem in existing neighborhood generation algorithms. Two novel pruning techniques are proposed to reduce the size of neighborhood, making our approach a highly competitive solution to approximate entity extraction.

Country
Australia
Related Organizations
Keywords

Near duplicate detection, Similarity join, Text and document retrieval, 004

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green