Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Datasets for OntoClue Project

Authors: Ravinder, Rohitha; Geist, Lukas; Rebholz-Schuhmann, Dietrich; Castro, Leyla Jael;

Datasets for OntoClue Project

Abstract

Description This release contains the datasets and files associated with the OntoClue project, which investigates various text embedding techniques for assessing document-to-document similarity in biomedical literature. The project primarily utilizes the RELISH Corpus [1], a comprehensive dataset curated by experts that includes relevance annotations for document pairs based on their similarity. This release includes datasets for establishing ground truth, as well as retrieved titles and abstracts for all PMIDs in the RELISH database. The files also contain preprocessed tokens for use in text embedding neural network models, as well as annotated tokens based on the MeSH (Medical Subject Headings) [2] vocabulary. Data Structure and Files missing_pmids.tsv: List of PMIDs for which titles and abstracts could not be retrieved relevance_matrix.tsv : Ground truth dataset file derived from the RELISH JSON file containing 189,634 documents pairs, with three columns: PMID1 (reference article), PMID2 (assessed article), and relevance (relevance score between the two documents). Consists of 68,479 completely relevant pairs, 65,406 partially relevant pairs and 55,749 irrelevant pairs. relish_documents.tsv: Contains retrieved RELISH documents, including PMID, title and abstract (163,189 articles) relish_bert_input_text.zip: Preprocessed titles and abstracts for use with BERT-based models relish_preprocessed_normal_tokens.zip: Document text preprocessed for use with all embeddings approaches relish_normal_split_datasets.zip: Preprocessed document text split into training, validation and test datasets relish_xml_files.zip: RELISH articles retrieved as XML files relish_annotated_xml_files.zip: Annotated XML files of RELISH articles (163,189 articles) relish_preprocessed_annotated_tokens.zip: Document text preprocessed for use with all embeddings approaches, with annotations relish_annotated_split_datasets.zip: Preprocessed and annotated document text split into a training, validation and test datasets relish_ground_truth_split_datasets.zip: Ground truth dataset split into a training, validation and test datasets Data Collection The RELIHS dataset v1 was downloaded from the corresponding FigShare record [3] on January 24th, 2022. The dataset, in JSON format, contains PubMed IDs (PMIDs) along with relevance assessments for document pairs. Using the BioC API, we retrieved XML files containing the PMID, title, and abstract for each unique entry in the RELIHS JSON file. Any PMIDs that failed to retrieve, or lacked titles and abstracts, were recorded as missing. In total, approximately 163,189 XML files were successfully retrieved. These XML files were also converted into a TSV file with three columns: PMID, title, and abstract. The text from the titles and abstracts was further preprocessed for use in various approaches. References [1] Peter Brown, RELISH Consortium , Yaoqi Zhou, Large expert-curated database for benchmarking document similarity detection in biomedical literature search, Database, Volume 2019, 2019, baz085, https://doi.org/10.1093/database/baz085 [2] Lipscomb C. E. (2000). Medical Subject Headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–266. [3] Brown, Peter (2019). RELISH_v1. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7722905.v1

Related Organizations
Keywords

MeSH, pubmed articles, recommendation systems, document-to-document similarity, semantics

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average