
This repository contains the complete research pipeline for automatically analyzing privacy policies in a multilingual setting (validated for English, German, French, and Italian). The project operationalizes this pipeline in the context of (1) a revision in Swiss privacy law and (2) the use of automated policy generators. Repository Structure This repository is organized into three main subdirectories, each serving a specific purpose in the research pipeline: 1. Analysis The Analysis directory contains most statistical analyses and data processing scripts. Key Components:- Data preprocessing (including text cleaning and removal of personal identifiers with Presidio)- Creation of the three relevant groups for the analysis (CH, CH & EU, and EU)- Main statistical and semantic analysis scripts presented in the paper- Analysis of policy clusters based on generator use 2. Data The Data directory contains the relevant datasets and corpora used throughout the paper. Key Components:- Annotated datasets: 1. LLM-annotated full original dataset after removal of personal identifiers with Presidio ("swiss-gdpr_annotated.parquet") 2. Final annotated and grouped dataset used for all statistical analyses ("swiss-gdpr_annotated_groups.parquet") 3. Log of the LLM annotations ("run.log")- CrUX dataset used for website popularity rankings as well as the website budgeting list used to scrape the initial dataset- Embeddings of the policies used for the cluster analysis 3. LLM The LLM directory contains all files related to the LLM-based data analysis. Key Components:- The codebooks, human annotations (reference benchmark), and evaluations for all three initial annotation phases ("Annotations")- The validation of the models against the final set of human annotations ("Validation")- The scripts for the large-scale policy evaluation using OpenAI's GPT-5 ("Evaluation") Citation If you use this work in your research, please cite it as: ```Accepted at PETS '26 [Citation details to be added upon publication] ``` We kindly ask you to cite the paper and not the dataset itself. Please find a more detailed list of funding sources in the paper's Acknowledgments section. License This work and its artifacts are licensed under a CC-BY 4.0 license. Contact For questions about this research, please contact Luka Nenadic at lnenadic@ethz.ch.
