Multiple Entity Reconciliation
Samoila, Lavinia Andreea
- Publisher: KTH, Skolan för informations- och kommunikationsteknik (ICT)
entity matching | data linkage | data quality | machine learning | text processing | Computer and Information Sciences | Data- och informationsvetenskap
Living in the age of "Big Data" is both a blessing and a curse. On he one hand, the raw data can be analysed and then used for weather redictions, user recommendations, targeted advertising and more. On he other hand, when data is aggregated from multiple sources, there is no guarantee that each source has stored the data in a standardized or even compatible format to what is required by the application. So there is a need to parse the available data and convert it to the desired form. Here is where the problems start to arise: often the correspondences are not quite so straightforward between data instances that belong to the same domain, but come from different sources. For example, in the film industry, information about movies (cast, characters, ratings etc.) can be found on numerous websites such as IMDb or Rotten Tomatoes. Finding and matching all the data referring to the same movie is a challenge. The aim of this project is to select the most efficient algorithm to correlate movie related information gathered from various websites automatically. We have implemented a flexible application that allows us to make the performance comparison of multiple algorithms based on machine learning techniques. According to our experimental results, a well chosen set of rules is on par with the results from a neural network, these two proving to be the most effective classifiers for records with movie information as content.