Dataset for paper "The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok"

This is a dataset accompanying the paper “The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok”, designed to analyze video interactions, ad classifications, and user engagement patterns. It contains records of video interactions, including metadata about the videos, user demographics, and ad classifications, allowing the full replication of results presented in the paper. The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person. To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation. Paper: TBA (currently under review) Preprint: https://arxiv.org/abs/2603.05653 GitHub repository: https://github.com/kinit-sk/ai-auditology-advertising-and-minor-profiling-tiktok References If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper: @misc{solarova2026dsasblindspotalgorithmic, title={The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok}, author={Sara Solarova and Matej Mosnar and Matus Tibensky and Jan Jakubcik and Adrian Bindas and Simon Liska and Filip Hossner and Matúš Mesarčík and Ivan Srba}, year={2026}, eprint={2603.05653}, archivePrefix={arXiv}, primaryClass={cs.CY}, url={https://arxiv.org/abs/2603.05653}, } Dataset Description The logs of video presented to individual simulated users are provided in the ai-auditology-advertising-and-minor-profiling-tiktok_video_data.csv file. It is structured into 31 columns, capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata. Column Name Data Type Description Example Value session_id string Session identifier captured during browsing 1765302414.743265 video_id string Platform video identifier [anonymized] timestamp datetime Timestamp when the record was captured 2025-12-09T17:47:56.296448 is_ad boolean Whether the video was classified as an ad false ad_type string (nullable) Ad classification type when is_ad is true other ad_topic string (nullable) Detected topic for ad content beauty visual_indicators array[string] List of visual indicators used to classify ads ["hashtag #clearskin"] reasoning string Model reasoning for the ad classification No disclosure label visible. interaction_number integer Sequential interaction count within the session 1 search_term string Search term used to find the content clear skin video_action_skip boolean Whether the user skipped the video False video_action_watch boolean Whether the user watched the video True video_action_like boolean Whether the user liked the video True video_action_bookmark boolean Whether the user bookmarked the video True video_time_watch_loop_start float (nullable) Timestamp when watch loop started 1765302470.8245792 video_time_watch_loop_end float (nullable) Timestamp when watch loop ended 1765302477.842666 video_time_skip float (nullable) Timestamp when the video was skipped nan video_time_like float (nullable) Timestamp when the video was liked 1765302471.8269806 video_time_bookmark float (nullable) Timestamp when the video was bookmarked 1765302477.3054323 video_time_predict_interaction float (nullable) Timestamp for predicted interaction (if any) nan topic string User interest topic used for personalization beauty gender string User gender female country_code string User country code DE date_of_birth date User date of birth 2009-11-29 agent string Agent identifier added during processing Beauty_minor video_url string Full URL to the video https://www.tiktok.com/[anonymized] video_author string Account handle of the video author [anonymized] video_description string Video description text little bonus - your waist? nonexistent #chiaseeds #guthealth video_time_duration float Video duration in seconds 25.866667 video_transcript string (nullable) Auto-transcribed video text if available nan video_transcript_language string (nullable) Language of the transcript nan Manual annotations of selected videos (used to assess the accuracy of ad type and topic classification model) are provided in ai-auditology-advertising-and-minor-profiling-tiktok_annotator_1.csv and ai-auditology-advertising-and-minor-profiling-tiktok_annotator_2.csv, for the first and second human annotator respectively. Ethical considerations Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure. The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed. The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata. To mitigate potential biases and inaccuracies inherent in the Large Vision Model (LVM) used for advertisement classification, we implemented a multi-layered validation process. This included both ad-hoc and systematic manual audits of dataset subsets. Data failing to meet accuracy benchmarks were excluded, and we have reported the estimated error rates accordingly. To prioritize ethical standards and researcher well-being, all manual annotations were conducted solely by the study’s authors, following expert ethical guidelines. Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average