
handle: 10852/101604 , 11250/3055140 , 20.500.12876/98882
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Maximum entropy classification for record linkage
FOS: Computer and information sciences, Design of Experiments and Sample Surveys, Density ratio, Statistical Methodology, Survey sampling, 004, Methodology (stat.ME), Probabilistic linkage, False link, Missing match, Statistics - Methodology, Probability
FOS: Computer and information sciences, Design of Experiments and Sample Surveys, Density ratio, Statistical Methodology, Survey sampling, 004, Methodology (stat.ME), Probabilistic linkage, False link, Missing match, Statistics - Methodology, Probability
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
