FrepJoin: an efficient partition-based algorithm for edit similarity join

descriptionPublicationkeyboard_double_arrow_right Article 01 Oct 2017 English Publisher:Zhejiang University PressJournal:Frontiers of Information Technology & Electronic Engineering, volume 18, pages 1,499-1,510 (issn: 2095-9184, eissn: 2095-9230,

Copyright policy )

Authors: Jizhou Luo; Shengfei Shi; Hongzhi Wang 0001; Jianzhong Li 0001;

doi: 10.1631/fitee.1601347

FrepJoin: an efficient partition-based algorithm for edit similarity join

- Summary
- Metrics

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Related Organizations

Harbin Institute of Technology
China (People's Republic of)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average