<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

CodiEsp corpus: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

Name: CodiEsp corpus: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020
Keywords: ICD10, clinical case, medical Informatics, hierarchical Multi-label Classification, CIE10, text categorization, multi-label Classification, supervised machine learning, NLP, eHealth CLEF

Research datakeyboard_double_arrow_right Dataset 23 Jan 2020 Spanish Publisher:Zenodo

Authors: Antonio Miranda; Aitor Gonzalez-Agirre; Martin Krallinger;

doi: 10.5281/zenodo.3693570 , 10.5281/zenodo.3633048 , 10.5281/zenodo.3758054

CodiEsp corpus: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

- Summary
- Subjects
- Related research
  (5)
- Metrics

Abstract

Introduction These are the train, development, test and background sets of the CodiEsp corpus. Train and development have gold standard annotations. The unannotated background and test sets are distributed together. All documents are released in the context of the CodiEsp track for CLEF ehealth 2020 (http://temu.bsc.es/codiesp/). The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each. The test set contains 250 clinical cases and it is released together with the background set (with 2751 clinical cases). CodiEsp participants must submit predictions for the test and background set, but they will only be evaluated on the test set. Zip structure Three folders: train, dev and test. Each one of them contains the files for the train, development and test corpora, respectively. train and dev folders have: 3 tab-separated files with the annotation information relevant for each of the 3 sub-tracks of CodiEsp. A subfolder named text_files with the plain text files of the clinical cases. A subfolder named text_files_en with the plain text files machine-translated to English. Due to the translation process, the text files are sentence-splitted. The test folder has only text_files and text_files_en subfolders with the plain text files. Corpus format description The CodiEsp corpus is distributed in plain text in UTF8 encoding, where each clinical case is stored as a single file whose name is the clinical case identifier. Annotations are released in a tab-separated file. Since the CodiEsp track has 3 sub-tracks, every set of documents (train and test) has 3 tab-separated files associated with it. For the sub-tracks CodiEsp-D and CodiEsp-P, the file has the following fields: articleID ICD10-code Tab-separated files for the sub-track CodiEsp-X contain extra fields that provide the text-reference and its position: articleID label ICD10-code text-reference reference-position Corpus summary statistics The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case. It contains a total of 396,988 words, with an average of 396.2 words per clinical case. For more information, visit the track webpage: http://temu.bsc.es/codiesp/

{"references": ["Villegas M, de la Pe\u00f1a S, Intxaurrondo A, Santamaria J, Krallinger M. Esfuerzos para fomentar la miner\u00eda de textos en biomedicina m\u00e1s all\u00e1 del ingl\u00e9s: el plan estrat\u00e9gico nacional espa\u00f1ol para las tecnolog\u00edas del lenguaje. Procesamiento del Lenguaje Natural. 2017(59):141-4."]}

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Related Organizations

Barcelona Supercomputing Center
Spain

Keywords

ICD10, clinical case, medical Informatics, hierarchical Multi-label Classification, CIE10, text categorization, multi-label Classification, supervised machine learning, NLP, eHealth CLEF, clinical coding, text Mining

Filter by relation

All relations

arrow_drop_down

5 Research products, page 1 of 1

CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020
2020IsVersionOf
CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020
2020IsAmongTopNSimilarDocuments
CodiEsp codes: list of valid CIE10 codes for the CodiEsp task
2020IsAmongTopNSimilarDocuments
CodiEsp Silver Standard: Participant predictions in eHealth CLEF2020 - Spanish clinical cases coded in ICD10 (CIE10)
2020IsAmongTopNSimilarDocuments
CodiEsp corpus training and development set: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020
2020IsAmongTopNSimilarDocuments

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	53
download	downloads	5

53
views
5
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average

Beta

SDGs Suggest

3. Good health

Beta

SDGs:

3. Good health,

Related to Research communities

Knowmad Institut

CodiEsp corpus: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

CodiEsp corpus: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

5 Research products, page 1 of 1

CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

CodiEsp codes: list of valid CIE10 codes for the CodiEsp task

CodiEsp Silver Standard: Participant predictions in eHealth CLEF2020 - Spanish clinical cases coded in ICD10 (CIE10)

CodiEsp corpus training and development set: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020