
Dataset for the Generative AI Detection Task (Subtask 2) @ PAN 2025. As large language models (LLMs) like GPT-4o, Claude 3.5, and Gemini 1.5-pro become increasingly accessible, machine-generated content is proliferating across diverse domains, including news, social media, education, and academia. These models produce highly fluent and coherent text, making them valuable for automating various writing tasks. However, their widespread use also raises concerns about misinformation, academic integrity, and content authenticity. Identifying the degree of human and machine involvement in text creation is crucial for addressing these challenges. In this shared task, we focus on Human-AI Collaborative Text Classification, where the goal is to categorize documents that have been co-authored by humans and LLMs. Specifically, we aim to classify texts into six distinct categories based on the nature of human and machine contributions: Fully human-written: The document is entirely authored by a human without any AI assistance. Human-initiated, then machine-continued: A human starts writing, and an AI model completes the text. Human-written, then machine-polished: The text is initially written by a human but later refined or edited by an AI model. Machine-written, then machine-humanized (obfuscated): An AI generates the text, which is later modified to obscure its machine origin. Machine-written, then human-edited: The content is generated by an AI but subsequently edited or refined by a human. Deeply-mixed text: The document contains interwoven sections written by both humans and AI, without a clear separation. Label Distribution: Label Category Train Dev Machine-written, then machine-humanized 91,232 10,137 Human-written, then machine-polished 95,398 12,289 Fully human-written 75,270 12,330 Human-initiated, then machine-continued 10,740 37,170 Deeply-mixed text (human + machine parts) 14,910 225 Machine-written, then human-edited 1,368 510 Total 288,918 72,661
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
