<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
CARMEN-I is a corpus of 2,000 de-identified clinical records generated at the Hospital Clínic of Barcelona (HCB) from March 2020 to March 2022, during the height of the COVID-19 pandemic, and developed in collaboration with the Barcelona Supercomputing Center (BSC). It consists of discharge letters, referrals and radiology reports written mainly in Spanish, with some sections in Catalan. The corpus covers patients admitted with COVID-19, and includes a wide variety of comorbidities, such as kidney failure, chronic cardiovascular and respiratory diseases, malignancies and immunosuppression. CARMEN-I has been exhaustively anonymized and validated by hospital physicians, natural language processing experts and linguists, following detailed annotation guidelines, and replacing original sensitive data elements by synthetic equivalents. A subset of the corpus has been annotated with key medical concepts labeled by experts, namely, symptoms, diseases, procedures, medications, species and humans (incl. family members), using an annotation scheme based on previously-released biomedical corpora such as DisTEMIST, ProcTEMIST or LivingNER. This repository includes the anonymization protocol in Spanish. This document describes the protocol created for the data anonymization process, as well as the control mechanisms put in place for this purpose. It also includes addenda to the MEDDOCAN guidelines for the annotation of sensitive data, criteria for inclusion/exclusion of documents, and a list of indirect identifiers. CARMEN-I is available on PhysioNet under demand. Other relevant links: CARMEN-I Clinical Entities Annotation Guidelines (Spanish version): zenodo.org/doi/10.5281/zenodo.10171539 CARMEN-I Clinical Entities Annotation Guidelines (English version): zenodo.org/doi/10.5281/zenodo.10171646 CARMEN-I Anonymization Protocol (Spanish version): zenodo.org/doi/10.5281/zenodo.10171660 CARMEN-I Anonymization Protocol (English version): zenodo.org/doi/10.5281/zenodo.10171681 MEDDOCAN Anonymization Corpus: zenodo.org/doi/10.5281/zenodo.4279322 MEDDOCAN Anonymization Guidelines: zenodo.org/doi/10.5281/zenodo.4279337 If you use this document, please cite: @article{LimaLopez2025,author = {Salvador Lima-López and Eulàlia Farré-Maduell and Luis Gasco and Jan Rodríguez-Miret and Santiago Frid and Xavier Pastor and Xavier Borrat and Martin Krallinger},title = {A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization},journal = {Scientific Data},volume = {12},pages = {Article 1088},year = {2025},publisher = {Nature Publishing Group},doi = {10.1038/s41597-025-05320-1},url = {https://www.nature.com/articles/s41597-025-05320-1}}
de-identification, bionlp, anonymization
de-identification, bionlp, anonymization
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |