
MultiGraSCCo - A Multilingual version of the Graz Synthetic Clinical text Corpus with Annotations of Personal Information This repository is an external resource of: Baroud, I., Otto, C., Czehmann, V., Hovhannisyan, C., Raithel, L., Möller, S., Roller, R. (2026). MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers. arXiv preprint arXiv:2603.08879. The Graz Synthetic Clinical text Corpus (GraSCCo) is a dataset that contains artificially generated semi-structured and unstructured German-language clinical summaries. These summaries are formulated as letters from the hospital to the patient's GP after in-patient or out-patient care. Further details: Stefan Schulz. (2022). GraSCCo (Version v1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6539131 Modersohn L, Schulz S, Lohr C, Hahn U. GRASCCO - The First Publicly Shareable, Multiply-Alienated German Clinical Text Corpus. Stud Health Technol Inform. 2022;296:66-72. doi:10.3233/SHTI220805 This work extends the annotations of Proteced Health Information in GraSCCo introduced in the following resource with annotations of Indirect Personal Identifiers (IPI) such as information about family, lifestyle, and the socioeconomic and criminal history of the patient: Lohr, C., Matthies, F., Jakob, F., Modersohn, L., Riedel, A., Hahn, U., Kiser, R., Boeker, M., & Meineke, F. (2024). GraSCCo_PHI - Graz Synthetic Clinical text Corpus with Protected Health Information Annotations [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11502329 We use and further develop the following guiedelines for annotating IPIs in GraSCCo: Baroud, I., Raithel, L., Möller, S., & Roller, R. (2025). MIMIC_III_IPI - Discharge Summaries from MIMIC-III with Indirect Personal Identifiers Annotations [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15372705 In this work, GraSCCo, together with annotations of direct and indirect personal information was translated into 9 languages from 6 language families and 3 scripts. MultiGraSCCo includes the following language families/languages: German, English (Germanic); Italian, French (Romance); Arabic (Semitic); Polish, Russian, Ukrainian (Slavic); Turkish (Turkic); and Persian (Indo-Iranian). The repository contains the annotations of PHI and IPI information in JSON format in 10 languages as well as the IPI annotation guidelines.
