
Artifacts related to the paper: Supporting Privacy-Aware Requirements Engineering through Topic Modeling of AI Training Data Abstract Background: AI models are increasingly trained on large volumes of raw textual data that may contain privacy-sensitive information, raising risks related to data protection and regulatory compliance. Due to the scale and unstructured nature of such data, manual inspection is costly and difficult to sustain. From a Requirements Engineering (RE) perspective, this creates the need for systematic approaches to identify privacy risks before data are incorporated into AI systems. Goal: This study proposes a semi-automated approach based on topic modeling to support privacy-aware RE in the analysis of AI training data. Method: The approach combines text preprocessing and Latent Dirichlet Allocation (LDA) to extract latent topics from raw textual data. These topics are interpreted through expert judgment supported by human reviewers and LLM-based agents to identify privacy-sensitive themes and assign privacy risk levels to documents. Results: The results indicate that the approach enables the identification of privacy-relevant topics and supports the classification of documents according to risk levels. By integrating topic modeling with expert interpretation, the approach provides practical support for dataset screening and privacy-oriented decision-making in RE. Conclusion: The study demonstrates that topic modeling can be effectively used as a decision-support mechanism for privacy-aware Requirements Engineering, extending RE practices to the governance of AI training data.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
