<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and Normalization

Name: AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and Normalization
Keywords: normalization, entity grounding, ner, employment status, corpus, clinical nlp, entity linking, nlp, gazetteer, bionlp

Research datakeyboard_double_arrow_right Dataset 23 Nov 2024Publisher:Zenodo

Authors: Becerra-Tomé, Alberto; Rodríguez Miret, Jan; Rodríguez Ortega, Miguel; Marsol Torrent, Sergi; Lima-López, Salvador; Farré-Maduell, Eulàlia; Krallinger, Martin;

doi: 10.5281/zenodo.14210425 , 10.5281/zenodo.14210424

AI4PROFHEALTH - Automatic Silver Gazetteer for Named Entity Recognition and Normalization

- Summary
- Subjects
- Metrics

Abstract

This dataset comprises a professions gazetteer generated with automatically extracted terminology from the Mesinesp2 corpus, a manually annotated corpus in which domain experts have labeled a set of scientific literature, clinical trials, and patent abstracts, as well as clinical case reports. A silver gazetteer for mention classification and normalization is created combining the predictions of automatic Named Entity Recognition models and normalization using Entity Linking to three controlled vocabularies SNOMED CT, NCBI and ESCO. The sources are 265,025 different documents, where 249,538 correspond to MESINESP2 Corpora and 15,487 to clinical cases from open clinical journals. From them, 5,682,000 mentions are extracted and 4,909,966 (86.42%) are normalized to any of the ontologies: SNOMED CT (4,909,966) for diseases, symptoms, drugs, locations, occupations, procedures and species; ESCO (215,140) for occupations; and NCBI (1,469,256) for species. The repository contains a .tsv file with the following columns: filenameid: A unique identifier combining the file name and mention span within the text. This ensures each extracted mention is uniquely traceable. Example: biblio-1000005#239#256 refers to a mention spanning characters 239–256 in the file with the name biblio-1000005. span: The specific text span (mention) extracted from the document, representing a term or phrase identified in the dataset. Example: centro oncológico. source: The origin of the document, indicating the corpus from which the mention was extracted. Possible values: mesinesp2, clinical_cases. filename: The name of the file from which the mention was extracted. Example: biblio-1000005. mention_class: Categories or semantic tags assigned to the mention, describing its type or context in the text. Example: ['ENFERMEDAD', 'SINTOMA']. codes_esco: The normalized ontology codes from the European Skills, Competences, Qualifications, and Occupations (ESCO) vocabulary for the identified mention (if applicable). This field may be empty if no ESCO mapping exists. Example: 30629002. terms_esco: The human-readable terms from the ESCO ontology corresponding to the codes_esco. Example: ['responsable de recursos', 'director de recursos', 'directora de recursos']. codes_ncbi: The normalized ontology codes from the NCBI Taxonomy vocabulary for species (if applicable). This field may be empty if no NCBI mapping exists. terms_ncbi: The human-readable terms from the NCBI Taxonomy vocabulary corresponding to the codes_ncbi. Example: ['Lacandoniaceae', 'Pandanaceae R.Br., 1810', 'Pandanaceae', 'Familia']. codes_sct: The normalized ontology codes from SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms) vocabulary for diseases, symptoms, drugs, locations, occupations, procedures, and species (if applicable). Example: 22232009. terms_sct: The human-readable terms from the SNOMED CT ontology corresponding to the codes_sct. Example: ['adjudicador de regulaciones del seguro nacional']. sct_sem_tag: The semantic category tag assigned by SNOMED CT to describe the general classification of the mention. Example: environment. Suggestion: If you load the dataset using python, it is recommended to read the columns containing lists as follows import ast df["mention_class"] = df["mention_class"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x) License This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). This means you are free to: Share: Copy and redistribute the material in any medium or format. Adapt: Remix, transform, and build upon the material for any purpose, even commercially. Attribution Requirement: Please credit the dataset creators appropriately, provide a link to the license, and indicate if changes were made. Contact If you have any questions or suggestions, please contact us at: Martin Krallinger () Additional resources and corpora If you are interested, you might want to check out these corpora and resources: MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts, different document collection) MEDDOPROF corpus Codes Reference List (for MEDDOPROF-NORM) Annotation Guidelines Occupations Gazetteer

This resource been funded by the Spanish National Proyectos I+D+i 2020 AI4ProfHealth project PID2020-119266RA-I00 (PID2020-119266RA-I0/AEI/10.13039/501100011033).

Related Organizations

Barcelona Supercomputing Center
Spain

Keywords

normalization, entity grounding, ner, employment status, corpus, clinical nlp, entity linking, nlp, gazetteer, bionlp

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average