Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Data & Knowledge Eng...arrow_drop_down
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
Data & Knowledge Engineering
Article . 2017 . Peer-reviewed
License: Elsevier TDM
Data sources: Crossref
DBLP
Article
Data sources: DBLP
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

A supervised gradient-based learning algorithm for optimized entity resolution

Authors: Orion Fausto Reyes-Galaviz; Witold Pedrycz; Ziyue He; Nick J. Pizzi;

A supervised gradient-based learning algorithm for optimized entity resolution

Abstract

The task of probabilistic record linkage is to find and link records that refer to the same entity across several disparate data sources. The accurate linking of records (entity resolution) is an important task for the healthcare industry, government, law enforcement, and the private sector, for obvious reasons. However, finding exact matches of an entity can be challenging due to records with typographical, phonetical or other types of errors (noise) found across real-world data sources. Over the years, many comparison functions have been developed to relate pairs of records and produce a similarity score. With a pair of predefined thresholds, one may decide if records pairs match, do not match, or if they require further clerical review. Nevertheless, finding appropriate comparison functions, identity descriptors (fields), threshold values, and efficient classifiers remains a challenging task. In this study, we propose a supervised gradient-based learning model that can adjust its structure and parameters based on matching scores coming from many comparison functions (and applied to many fields), to efficiently classify the records. The design of this structure is transparent, and can potentially allow us to locate which comparison functions and fields are more significant to correctly link the records. To train this structure, we propose a novel performance index that can help learn how to separate matched from non-matched records. Results completed with the use of synthetic datasets affected by different levels of noise and real-world datasets show the effectiveness of the algorithm, which can significantly reduce the number of false positives, false negatives, and the number of records selected for review.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    17
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Top 10%
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Top 10%
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
17
Top 10%
Top 10%
Top 10%
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!