Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and Normalization

Authors: Becerra-Tomé, Alberto; Rodríguez Miret, Jan; Rodríguez Ortega, Miguel; Marsol Torrent, Sergi; Lima-López, Salvador; Farré-Maduell, Eulàlia; Krallinger, Martin;

AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and Normalization

Abstract

This dataset comprises a professions gazetteer generated with automatically extracted terminology from the Mesinesp2 corpus, a manually annotated corpus in which domain experts have labeled a set of scientific literature, clinical trials, and patent abstracts, as well as clinical case reports. A silver gazetteer for mention classification and normalization is created combining the predictions of automatic Named Entity Recognition models and normalization using Entity Linking to three controlled vocabularies SNOMED CT, NCBI and ESCO. The sources are 265,025 different documents, where 249,538 correspond to MESINESP2 Corpora and 15,487 to clinical cases from open clinical journals. From them, 5,682,000 mentions are extracted and 4,909,966 (86.42%) are normalized to any of the ontologies: SNOMED CT (4,909,966) for diseases, symptoms, drugs, locations, occupations, procedures and species; ESCO (215,140) for occupations; and NCBI (1,469,256) for species. The repository contains a .tsv file with the following columns: filenameid: A unique identifier combining the file name and mention span within the text. This ensures each extracted mention is uniquely traceable. Example: biblio-1000005#239#256 refers to a mention spanning characters 239–256 in the file with the name biblio-1000005. span: The specific text span (mention) extracted from the document, representing a term or phrase identified in the dataset. Example: centro oncológico. source: The origin of the document, indicating the corpus from which the mention was extracted. Possible values: mesinesp2, clinical_cases. filename: The name of the file from which the mention was extracted. Example: biblio-1000005. mention_class: Categories or semantic tags assigned to the mention, describing its type or context in the text. Example: ['ENFERMEDAD', 'SINTOMA']. codes_esco: The normalized ontology codes from the European Skills, Competences, Qualifications, and Occupations (ESCO) vocabulary for the identified mention (if applicable). This field may be empty if no ESCO mapping exists. Example: 30629002. terms_esco: The human-readable terms from the ESCO ontology corresponding to the codes_esco. Example: ['responsable de recursos', 'director de recursos', 'directora de recursos']. codes_ncbi: The normalized ontology codes from the NCBI Taxonomy vocabulary for species (if applicable). This field may be empty if no NCBI mapping exists. terms_ncbi: The human-readable terms from the NCBI Taxonomy vocabulary corresponding to the codes_ncbi. Example: ['Lacandoniaceae', 'Pandanaceae R.Br., 1810', 'Pandanaceae', 'Familia']. codes_sct: The normalized ontology codes from SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms) vocabulary for diseases, symptoms, drugs, locations, occupations, procedures, and species (if applicable). Example: 22232009. terms_sct: The human-readable terms from the SNOMED CT ontology corresponding to the codes_sct. Example: ['adjudicador de regulaciones del seguro nacional']. sct_sem_tag: The semantic category tag assigned by SNOMED CT to describe the general classification of the mention. Example: environment. Suggestion: If you load the dataset using python, it is recommended to read the columns containing lists as follows import ast df["mention_class"] = df["mention_class"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x) License This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). This means you are free to: Share: Copy and redistribute the material in any medium or format. Adapt: Remix, transform, and build upon the material for any purpose, even commercially. Attribution Requirement: Please credit the dataset creators appropriately, provide a link to the license, and indicate if changes were made. Contact If you have any questions or suggestions, please contact us at: Martin Krallinger () Additional resources and corpora If you are interested, you might want to check out these corpora and resources: MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts, different document collection) MEDDOPROF corpus Codes Reference List (for MEDDOPROF-NORM) Annotation Guidelines Occupations Gazetteer

This resource been funded by the Spanish National Proyectos I+D+i 2020 AI4ProfHealth project PID2020-119266RA-I00 (PID2020-119266RA-I0/AEI/10.13039/501100011033).

Related Organizations
Keywords

normalization, entity grounding, ner, employment status, corpus, clinical nlp, entity linking, nlp, gazetteer, bionlp

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average