mailcom: Pseudonymization Tool for Textual Data

The rapid growth of data and its usage by Artificial Intelligence applications leads to heightened concerns about data privacy. Researchers often need to analyze datasets that contain personal information, sometimes paired with sensitive attributes such as medical records or political views. To support such analyses without exposing identifiable content, the Scientific Software Center (SSC) of Heidelberg University developed the mailcom package for pseudonymization. This capability is especially important when employing web-hosted Large Language Models for downstream analysis. As a use case, we applied mailcom to a multilingual email corpus in Spanish, French, and Portuguese contributed by multiple donors as part of a pilot study, in collaboration with the research group of Sybille Große (Department of Romance Studies, Heidelberg University). To protect donor privacy, sensitive information such as names, email addresses, and numbers is extracted and pseudonymized. The package processes text from email subjects and bodies in eml and html formats, as well as from csv rows, making it applicable to a wide range of textual data beyond email. mailcom is built entirely on open-source libraries and is designed for configurability and extensibility. Its core features are: (i) language identification, (ii) named-entity recognition, (iii) extraction of temporal expressions, and (iv) de-identifying sensitive data via pseudonyms. Three aforementioned languages are supported by default, with options to add further languages and change back-end libraries via configuration. We present these features in end-to-end processing pipelines using examples from our use case. The main parts include: (1) General workflow from raw text to pseudonymized output,(2) Default libraries and techniques (e.g. eml-parser, spaCy, langid, langdetect, transformers, and rule-based)(3) Mechanisms for adapting to new languages, transformer pipelines, and spaCy models with minimal effort. Since pseudonymized outputs still require human review to guarantee full anonymization, the package serves as a scalable pre-processing layer that reduces manual work while establishing a principled baseline of privacy protection. This reproducible, privacy-aware tool enables empirical research on digital text under current data-ethics and governance standards.

Related Organizations

Heidelberg University
Germany

Keywords

pseudonymization, sensitive data, named entity recognition, Natural Language Processing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

Digital Humanities and Cultural Heritage

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now