Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Presentation . 2025
License: CC BY
Data sources: Datacite
ZENODO
Presentation . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Combining Algorithms and Human Expertise: OpenAIRE's Entity Disambiguation Method

Authors: Baglioni, Miriam; Atzori, Claudio; Končić, Ivana; Amodeo, Stefania; Bardi, Alessia; Pavone, Gina; Macan, Bojan; +1 Authors

Combining Algorithms and Human Expertise: OpenAIRE's Entity Disambiguation Method

Abstract

Lightning Talk at the International Digital Curation Conference 2025. The presentation examines OpenAIRE's solution to the “entity disambiguation” problem, presenting a hybrid data curation method that combines deduplication algorithms with the expertise of human curators to ensure high-quality, interoperable scholarly information. Entity disambiguation is invaluable to building a robust and interconnected open scholarly communication system. It involves accurately identifying and differentiating entities such as authors, organisations, data sources and research results across various entity providers. This task is particularly complex in contexts like the OpenAIRE Graph, where metadata is collected from over 100,000 data sources. Different metadata describing the same entity can be collected multiple times, potentially providing different information, such as different Persistent Identifiers (PIDs) or names, for the same entity. This heterogeneity poses several challenges to the disambiguation process. For example, the same organisation may be referenced using different names in different languages, or abbreviations. In some cases, even the use of PIDs might not be effective, as different identifiers may be assigned by different data providers. Therefore, accurate entity disambiguation is essential for ensuring data quality, improving search and discovery, facilitating knowledge graph construction, and supporting reliable research impact assessment. To address this challenge, OpenAIRE employs a deduplication algorithm to identify and merge duplicate entities, configured to handle different entity types. While the algorithm proves effective for research results, when applied to organisations and data sources, it needs to be complemented with human curation and validation since additional information may be needed. OpenAIRE's data source disambiguation relies primarily on the OpenAIRE technical team overseeing the deduplication process and ensuring accurate matches across DRIS, FAIRSharing, re3data, and OpenDOAR registries. While the algorithm automates much of the process, human experts verify matches, address discrepancies and actively search for matches not proposed by the algorithm. External stakeholders, such as data source managers, can also contribute by submitting suggestions through a dedicated ticketing system. So far OpenAIRE curated almost 3 935 groups for a total of 8 140 data sources. To address organisational disambiguation, OpenAIRE developed OpenOrgs, a hybrid system combining automated processes and human expertise. The tool works on organisational data aggregated from multiple sources (ROR registry, funders databases, CRIS systems, and others) by the OpenAIRE infrastructure, automatically compares metadata, and suggests potential merged entities to human curators. These curators, authorised experts in their respective research landscapes, validate merged entities, identify additional duplicates, and enrich organisational records with missing information such as PIDs, alternative names, and hierarchical relationships. With over 100 curators from 40 countries, OpenOrgs has curated more than 100,000 organisations to date. A dataset containing all the OpenOrgs organizations can be found on Zenodo (https://doi.org/10.5281/zenodo.13271358). This presentation demonstrates how OpenAIRE's entity disambiguation techniques and OpenOrgs aim to be game-changers for the research community by building and maintaining an integrated open scholarly communication system in the years to come. 

Keywords

IDCC25, Lightning Talks, Curation infrastructure: Innovative research and exploration into new curation methods, Curation infrastructure: Proposals for new approaches to large-scale service delivery, Entity Disambiguation, Hybrid data curation, Open Scholarly Communication

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green
Funded by
Related to Research communities
OpenAIRE