CEAID Adversarial Subset

CEAID is a dataset (described in a paper) for machine-generated text detection benchmark for 7 Central European languages (Croatian, Czech, German, Hungarian, Polish, Slovak, and Slovenian) in two domains (news and social media). It contains 188,098 texts, of which about 23k are human-written and about 165k are generated by 8 multilingual large language models. This dataset is an extension of CEAID for evaluation of adversarial robustness of the machine-generated text detection methods. It contains a carefully balanced pseudorandomly selected subset of 100 texts for each domain and language for the machine-generated as well as human-written class. It further contains adversarially modified counterparts for the machine-generated samples by each of the used two attacks (homoglyph attack and paraphrasing). In total, it contains 1,400 human-written and 4,200 machine-generated samples. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers. If you use this dataset in any publication, project, tool or in any other form, please, cite the paper. Disclaimer Due to data source (original CEAID is a subset of a combination of news articles from MULTITuDEv3 and social-media texts from MultiSocial), the dataset may contain harmful, disinformation, or offensive content. MultiSocial dataset description states that based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only. Data The dataset has the following fields: 'text' - a text sample, 'label' - 0 for human-written text, 1 for machine-generated text, 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, while after "_" character is a string representing original/homoglyph/paraphrased subset 'language' - the ISO 639-1 language code identifying the detected language of the given text, 'length' - word count of the given text, 'source' - a string identifying the source dataset of the given text (whther originated in CEAID, MULTITuDE, or MultiSocial), 'domain' - "news" for news articles, "social_media" for social-media texts. Basic statistics: language original human original machine homoglyph machine paraphrased machine cs 200 200 200 200 de 200 200 200 200 hr 200 200 200 200 hu 200 200 200 200 pl 200 200 200 200 sk 200 200 200 200 sl 200 200 200 200

EOSC Subjects

Twitter Data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average