Downloads provided by UsageCounts
In this study, we conducted experiments to investigate the use of logistic regression in de- veloping a name matching system. The primary objective was to create a system capable of identifying potential matches between names in a given dataset and a query. To achieve this, we employed established techniques like Levenshtein distance and fuzzywuzzy similarity to assess the similarity between names. Initially, we preprocessed the dataset by calculating the Levenshtein distance and fuzzy- wuzzy percentages for each name in comparison to the query. These calculated features were then appended as additional columns to the dataset. Subsequently, we utilized a logistic regression model that had been previously trained using a labeled dataset. To evaluate the performance of the model, we employed it to predict the likelihood of a name being a match for each entry in the dataset. These predictions were incorporated as a new column within the dataset. Finally, we sorted the dataset in descending order based on the prediction values to identify the most probable name matches. The developed name matching system provides a scalable and efficient approach, enabling users to input a query and obtain a ranked list of potential name matches. To further assess the accuracy and efficacy of the system, it is possible to compare the predicted matches with known ground truth data. The results obtained from our study demonstrate the effectiveness of the name matching system in identifying potential matches based on the computed features and the trained logistic regression model. The system holds significant value in various applications, including data integration, record linkage, and identity verification.
Identity resolution, Record linkage, Match probability, Feature engineering, Machine learning, Levenshtein distance, Logistic regression, Data integration, Name matching, Preprocessing, Similarity measures
Identity resolution, Record linkage, Match probability, Feature engineering, Machine learning, Levenshtein distance, Logistic regression, Data integration, Name matching, Preprocessing, Similarity measures
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 43 | |
| downloads | 39 |

Views provided by UsageCounts
Downloads provided by UsageCounts