Powered by OpenAIRE graph
Found an issue? Give us feedback
https://doi.org/10.2...arrow_drop_down
https://doi.org/10.2139/ssrn.6...
Article . 2026 . Peer-reviewed
Data sources: Crossref
ZENODO
Conference object . 2026
License: CC BY
Data sources: Datacite
ZENODO
Conference object . 2026
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

mailcom: Pseudonymization Tool for Textual Data

Authors: Le, Kim Tuyen; Gärtner, Laura; Fleischle, Felix; Schoeller, Thore; Große, Sybille; Ulusoy, Inga;

mailcom: Pseudonymization Tool for Textual Data

Abstract

The rapid growth of data and its usage by Artificial Intelligence applications leads to heightened concerns about data privacy. Researchers often need to analyze datasets that contain personal information, sometimes paired with sensitive attributes such as medical records or political views. To support such analyses without exposing identifiable content, the Scientific Software Center (SSC) of Heidelberg University developed the mailcom package for pseudonymization. This capability is especially important when employing web-hosted Large Language Models for downstream analysis. As a use case, we applied mailcom to a multilingual email corpus in Spanish, French, and Portuguese contributed by multiple donors as part of a pilot study, in collaboration with the research group of Sybille Große (Department of Romance Studies, Heidelberg University). To protect donor privacy, sensitive information such as names, email addresses, and numbers is extracted and pseudonymized. The package processes text from email subjects and bodies in eml and html formats, as well as from csv rows, making it applicable to a wide range of textual data beyond email. mailcom is built entirely on open-source libraries and is designed for configurability and extensibility. Its core features are: (i) language identification, (ii) named-entity recognition, (iii) extraction of temporal expressions, and (iv) de-identifying sensitive data via pseudonyms. Three aforementioned languages are supported by default, with options to add further languages and change back-end libraries via configuration. We present these features in end-to-end processing pipelines using examples from our use case. The main parts include: (1) General workflow from raw text to pseudonymized output,(2) Default libraries and techniques (e.g. eml-parser, spaCy, langid, langdetect, transformers, and rule-based)(3) Mechanisms for adapting to new languages, transformer pipelines, and spaCy models with minimal effort. Since pseudonymized outputs still require human review to guarantee full anonymization, the package serves as a scalable pre-processing layer that reduces manual work while establishing a principled baseline of privacy protection. This reproducible, privacy-aware tool enables empirical research on digital text under current data-ethics and governance standards.

Related Organizations
Keywords

pseudonymization, sensitive data, named entity recognition, Natural Language Processing

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!