publication . Preprint . 2019

A survey of OpenRefine reconciliation services

Delpeuch, Antonin;
Open Access English
  • Published: 19 Jun 2019
Abstract
We review the services implementing the OpenRefine reconciliation API, comparing their design to the state of the art in record linkage. Due to the design of the API, the matching scores returned by the services are of little help to guide matching decisions. This suggests possible improvements to the specifications of the API, which could improve user workflows by giving more control over the scoring mechanism to the client.
Subjects
free text keywords: Computer Science - Information Retrieval, Computer Science - Databases
Funded by
EC| TheyBuyForYou
Project
TheyBuyForYou
Enabling procurement data value chains for economic development, demand management, competitive markets and vendor intelligence
  • Funder: European Commission (EC)
  • Project Code: 780247
  • Funding stream: H2020 | IA
Download from
18 references, page 1 of 2

[1] Elasticsearch from the Bottom Up, Part https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up, September 2013.

[2] Reconciliation Service API. https://github.com/OpenRefine/OpenRefine, November 2018.

[3] Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In Proceedings of the 2010 International Conference on Management of Data - SIGMOD '10, page 783, Indianapolis, Indiana, USA, 2010. ACM Press.

[4] Rohan Baxter, Peter Christen, and Tim Churches. A Comparison of Fast Blocking Methods for Record Linkage. page 6, 2003.

[5] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: A generic approach to entity resolution. The VLDB Journal, 18(1):255-276, January 2009.

[6] Peter Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 08, page 151, Las Vegas, Nevada, USA, 2008. ACM Press.

[7] Peter Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media, 2012.

[8] Peter Christen. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537-1555, September 2012.

[9] Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. Efficient data reconciliation. Information Sciences, 137(1):1-15, September 2001.

[10] William W Cohen, Pradeep Ravikumar, and Stephen E Fienberg. A Comparison of String Metrics for Matching Names and Records. page 6, 2003.

[11] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16, January 2007.

[12] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183-1210, 1969. [OpenAIRE]

[13] David Huynh, Tom Morris, Stefano Mazzocchi, Iain Sproat, Martin Magdinier, Thad Guidry, Jesus M. Castagnetto, James Home, Cora JohnsonRoberson, Will Moffat, Pablo Moyano, David Leoni, Peilonghui, Rudy Alvarez, Vishal Talwar, Scott Wiedemann, Mateja Verlic, Antonin Delpeuch, Shixiong Zhu, Charles Pritchard, Ankit Sardesai, Gideon Thomas, Daniel Berthereau, and Andreas Kohn. OpenRefine. 2019.

[14] Gad M Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157-169, June 1989. [OpenAIRE]

[15] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, March 1970.

18 references, page 1 of 2
Abstract
We review the services implementing the OpenRefine reconciliation API, comparing their design to the state of the art in record linkage. Due to the design of the API, the matching scores returned by the services are of little help to guide matching decisions. This suggests possible improvements to the specifications of the API, which could improve user workflows by giving more control over the scoring mechanism to the client.
Subjects
free text keywords: Computer Science - Information Retrieval, Computer Science - Databases
Funded by
EC| TheyBuyForYou
Project
TheyBuyForYou
Enabling procurement data value chains for economic development, demand management, competitive markets and vendor intelligence
  • Funder: European Commission (EC)
  • Project Code: 780247
  • Funding stream: H2020 | IA
Download from
18 references, page 1 of 2

[1] Elasticsearch from the Bottom Up, Part https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up, September 2013.

[2] Reconciliation Service API. https://github.com/OpenRefine/OpenRefine, November 2018.

[3] Arvind Arasu, Michaela Götz, and Raghav Kaushik. On active learning of record matching packages. In Proceedings of the 2010 International Conference on Management of Data - SIGMOD '10, page 783, Indianapolis, Indiana, USA, 2010. ACM Press.

[4] Rohan Baxter, Peter Christen, and Tim Churches. A Comparison of Fast Blocking Methods for Record Linkage. page 6, 2003.

[5] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: A generic approach to entity resolution. The VLDB Journal, 18(1):255-276, January 2009.

[6] Peter Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 08, page 151, Las Vegas, Nevada, USA, 2008. ACM Press.

[7] Peter Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media, 2012.

[8] Peter Christen. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537-1555, September 2012.

[9] Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. Efficient data reconciliation. Information Sciences, 137(1):1-15, September 2001.

[10] William W Cohen, Pradeep Ravikumar, and Stephen E Fienberg. A Comparison of String Metrics for Matching Names and Records. page 6, 2003.

[11] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16, January 2007.

[12] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183-1210, 1969. [OpenAIRE]

[13] David Huynh, Tom Morris, Stefano Mazzocchi, Iain Sproat, Martin Magdinier, Thad Guidry, Jesus M. Castagnetto, James Home, Cora JohnsonRoberson, Will Moffat, Pablo Moyano, David Leoni, Peilonghui, Rudy Alvarez, Vishal Talwar, Scott Wiedemann, Mateja Verlic, Antonin Delpeuch, Shixiong Zhu, Charles Pritchard, Ankit Sardesai, Gideon Thomas, Daniel Berthereau, and Andreas Kohn. OpenRefine. 2019.

[14] Gad M Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157-169, June 1989. [OpenAIRE]

[15] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, March 1970.

18 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue