RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

Name: RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection
Keywords: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Liting Huang; Zhihao Zhang; Yiran Zhang; Xiyue Zhou; Shoujin Wang

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://doi.org/10.1145/370171...

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 08 May 2025Embargo end date: 01 Jan 2024Publisher:ACMJournal:Companion Proceedings of the ACM on Web Conference 2025

Authors: Liting Huang; Zhihao Zhang; Yiran Zhang; Xiyue Zhou; Shoujin Wang;

doi: 10.1145/3701716.3715306 , 10.48550/arxiv.2406.04906

arXiv: 2406.04906

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

The recent generative AI models' capability of creating realistic and human-like content is significantly transforming the ways in which people communicate, create and work. The machine-generated content is a double-edged sword. On one hand, it can benefit the society when used appropriately. On the other hand, it may mislead people, posing threats to the society, especially when mixed together with natural content created by humans. Hence, there is an urgent need to develop effective methods to detect machine-generated content. However, the lack of aligned multimodal datasets inhibited the development of such methods, particularly in triple-modality settings (e.g., text, image, and voice). In this paper, we introduce RU-AI, a new large-scale multimodal dataset for robust and effective detection of machine-generated content in text, image and voice. Our dataset is constructed on the basis of three large publicly available datasets: Flickr8K, COCO and Places205, by adding their corresponding AI duplicates, resulting in a total of 1,475,370 instances. In addition, we created an additional noise variant of the dataset for testing the robustness of detection models. We conducted extensive experiments with the current SOTA detection methods on our dataset. The results reveal that existing models still struggle to achieve accurate and robust detection on our dataset. We hope that this new data set can promote research in the field of machine-generated content detection, fostering the responsible use of generative AI. The source code and datasets are available at https://github.com/ZhihaoZhang97/RU-AI.

Accepted by WWW'25 Resource Track

Related Organizations

UNSW Sydney
Australia
University of Technology Sydney
Australia
Macquarie University
Australia
University of Technology Sydney (UTS)
Australia
University of Sydney
Australia

Keywords

FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

1 Research products, page 1 of 1

RU-Lang8 software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Top 10%

Average

Green

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection

1 Research products, page 1 of 1

RU-Lang8 software on GitHub