Machine Learning-Based Name Matching: A Logistic Regression Perspective

In this study, we conducted experiments to investigate the use of logistic regression in de- veloping a name matching system. The primary objective was to create a system capable of identifying potential matches between names in a given dataset and a query. To achieve this, we employed established techniques like Levenshtein distance and fuzzywuzzy similarity to assess the similarity between names. Initially, we preprocessed the dataset by calculating the Levenshtein distance and fuzzy- wuzzy percentages for each name in comparison to the query. These calculated features were then appended as additional columns to the dataset. Subsequently, we utilized a logistic regression model that had been previously trained using a labeled dataset. To evaluate the performance of the model, we employed it to predict the likelihood of a name being a match for each entry in the dataset. These predictions were incorporated as a new column within the dataset. Finally, we sorted the dataset in descending order based on the prediction values to identify the most probable name matches. The developed name matching system provides a scalable and efficient approach, enabling users to input a query and obtain a ranked list of potential name matches. To further assess the accuracy and efficacy of the system, it is possible to compare the predicted matches with known ground truth data. The results obtained from our study demonstrate the effectiveness of the name matching system in identifying potential matches based on the computed features and the trained logistic regression model. The system holds significant value in various applications, including data integration, record linkage, and identity verification.

Related Organizations

University of South Africa
South Africa

Keywords

Identity resolution, Record linkage, Match probability, Feature engineering, Machine learning, Levenshtein distance, Logistic regression, Data integration, Name matching, Preprocessing, Similarity measures

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average