Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Software . 2026
License: CC BY
Data sources: Datacite
ZENODO
Software . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Supporting Privacy-Aware Requirements Engineering through Topic Modeling of AI Training Data

Supporting Privacy-Aware Requirements Engineering through Topic Modeling of AI Training Data

Abstract

Artifacts related to the paper: Supporting Privacy-Aware Requirements Engineering through Topic Modeling of AI Training Data Abstract Background: AI models are increasingly trained on large volumes of raw textual data that may contain privacy-sensitive information, raising risks related to data protection and regulatory compliance. Due to the scale and unstructured nature of such data, manual inspection is costly and difficult to sustain. From a Requirements Engineering (RE) perspective, this creates the need for systematic approaches to identify privacy risks before data are incorporated into AI systems. Goal: This study proposes a semi-automated approach based on topic modeling to support privacy-aware RE in the analysis of AI training data. Method: The approach combines text preprocessing and Latent Dirichlet Allocation (LDA) to extract latent topics from raw textual data. These topics are interpreted through expert judgment supported by human reviewers and LLM-based agents to identify privacy-sensitive themes and assign privacy risk levels to documents. Results: The results indicate that the approach enables the identification of privacy-relevant topics and supports the classification of documents according to risk levels. By integrating topic modeling with expert interpretation, the approach provides practical support for dataset screening and privacy-oriented decision-making in RE. Conclusion: The study demonstrates that topic modeling can be effectively used as a decision-support mechanism for privacy-aware Requirements Engineering, extending RE practices to the governance of AI training data.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average