Probabilistic Iterative Duplicate Detection

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Conference object , Article 01 Jan 2005Publisher:Springer Berlin Heidelberg

Authors: Patrick Lehti; Peter Fankhauser;

doi: 10.1007/11575801_19

Probabilistic Iterative Duplicate Detection

- Summary
- Metrics

Abstract

The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now