<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
This dataset comprises a professions gazetteer generated with automatically extracted terminology from the Mesinesp2 corpus, a manually annotated corpus in which domain experts have labeled a set of scientific literature, clinical trials, and patent abstracts, as well as clinical case reports. A silver gazetteer for mention classification and normalization is created combining the predictions of automatic Named Entity Recognition models and normalization using Entity Linking to three controlled vocabularies SNOMED CT, NCBI and ESCO. The sources are 265,025 different documents, where 249,538 correspond to MESINESP2 Corpora and 15,487 to clinical cases from open clinical journals. From them, 5,682,000 mentions are extracted and 4,909,966 (86.42%) are normalized to any of the ontologies: SNOMED CT (4,909,966) for diseases, symptoms, drugs, locations, occupations, procedures and species; ESCO (215,140) for occupations; and NCBI (1,469,256) for species. The repository contains a .tsv file with the following columns: filenameid: A unique identifier combining the file name and mention span within the text. This ensures each extracted mention is uniquely traceable. Example: biblio-1000005#239#256 refers to a mention spanning characters 239–256 in the file with the name biblio-1000005. span: The specific text span (mention) extracted from the document, representing a term or phrase identified in the dataset. Example: centro oncológico. source: The origin of the document, indicating the corpus from which the mention was extracted. Possible values: mesinesp2, clinical_cases. filename: The name of the file from which the mention was extracted. Example: biblio-1000005. mention_class: Categories or semantic tags assigned to the mention, describing its type or context in the text. Example: ['ENFERMEDAD', 'SINTOMA']. codes_esco: The normalized ontology codes from the European Skills, Competences, Qualifications, and Occupations (ESCO) vocabulary for the identified mention (if applicable). This field may be empty if no ESCO mapping exists. Example: 30629002. terms_esco: The human-readable terms from the ESCO ontology corresponding to the codes_esco. Example: ['responsable de recursos', 'director de recursos', 'directora de recursos']. codes_ncbi: The normalized ontology codes from the NCBI Taxonomy vocabulary for species (if applicable). This field may be empty if no NCBI mapping exists. terms_ncbi: The human-readable terms from the NCBI Taxonomy vocabulary corresponding to the codes_ncbi. Example: ['Lacandoniaceae', 'Pandanaceae R.Br., 1810', 'Pandanaceae', 'Familia']. codes_sct: The normalized ontology codes from SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms) vocabulary for diseases, symptoms, drugs, locations, occupations, procedures, and species (if applicable). Example: 22232009. terms_sct: The human-readable terms from the SNOMED CT ontology corresponding to the codes_sct. Example: ['adjudicador de regulaciones del seguro nacional']. sct_sem_tag: The semantic category tag assigned by SNOMED CT to describe the general classification of the mention. Example: environment. Suggestion: If you load the dataset using python, it is recommended to read the columns containing lists as follows import ast df["mention_class"] = df["mention_class"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x) License This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). This means you are free to: Share: Copy and redistribute the material in any medium or format. Adapt: Remix, transform, and build upon the material for any purpose, even commercially. Attribution Requirement: Please credit the dataset creators appropriately, provide a link to the license, and indicate if changes were made. Contact If you have any questions or suggestions, please contact us at: Martin Krallinger () Additional resources and corpora If you are interested, you might want to check out these corpora and resources: MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts, different document collection) MEDDOPROF corpus Codes Reference List (for MEDDOPROF-NORM) Annotation Guidelines Occupations Gazetteer
This resource been funded by the Spanish National Proyectos I+D+i 2020 AI4ProfHealth project PID2020-119266RA-I00 (PID2020-119266RA-I0/AEI/10.13039/501100011033).
normalization, entity grounding, ner, employment status, corpus, clinical nlp, entity linking, nlp, gazetteer, bionlp
normalization, entity grounding, ner, employment status, corpus, clinical nlp, entity linking, nlp, gazetteer, bionlp
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |