Detecting images generated by diffusers

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 10 Jul 2024Embargo end date: 01 Jan 2023 Italy English Publisher:PeerJJournal:PeerJ Computer Science, volume 10, page e2127 (eissn: 2376-5992,

Copyright policy )Funded by:EC | AI4Media

Authors: Coccomini, Davide Alessandro; Esuli, Andrea; Falchi, Fabrizo; Gennaro, Claudio; Amato, Giuseppe;

doi: 10.7717/peerj-cs.2127 , 10.48550/arxiv.2303.05275

pmid: 39145210

pmc: PMC11322988

arXiv: 2303.05275

handle: 20.500.14243/506283

Detecting images generated by diffusers

- Summary
- Subjects
- Metrics

Abstract

In recent years, the field of artificial intelligence has witnessed a remarkable surge in the generation of synthetic images, driven by advancements in deep learning techniques. These synthetic images, often created through complex algorithms, closely mimic real photographs, blurring the lines between reality and artificiality. This proliferation of synthetic visuals presents a pressing challenge: how to accurately and reliably distinguish between genuine and generated images. This article, in particular, explores the task of detecting images generated by text-to-image diffusion models, highlighting the challenges and peculiarities of this field. To evaluate this, we consider images generated from captions in the MSCOCO and Wikimedia datasets using two state-of-the-art models: Stable Diffusion and GLIDE. Our experiments show that it is possible to detect the generated images using simple multi-layer perceptrons (MLPs), starting from features extracted by CLIP or RoBERTa, or using traditional convolutional neural networks (CNNs). These latter models achieve remarkable performances in particular when pretrained on large datasets. We also observe that models trained on images generated by Stable Diffusion can occasionally detect images generated by GLIDE, but only on the MSCOCO dataset. However, the reverse is not true. Lastly, we find that incorporating the associated textual information with the images in some cases can lead to a better generalization capability, especially if textual features are closely related to visual ones. We also discovered that the type of subject depicted in the image can significantly impact performance. This work provides insights into the feasibility of detecting generated images and has implications for security and privacy concerns in real-world applications. The code to reproduce our results is available at:https://github.com/davide-coccomini/Detecting-Images-Generated-by-Diffusers.

Country

Italy

Related Organizations

National Research Council
Sri Lanka
Institute of Information Science and Technologies "A. Faedo"
Italy
University of Pisa
Italy
National Research Council
Italy

Keywords

Deepfake, FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, CLIP, Deep learning, QA75.5-76.95, Deepfake detection, Deep Learning, Multimodal Machine Learning, Artificial Intelligence, Transformers, Multimodal machine learning, Electronic computers. Computer science, Convolutional neural networks, Computer vision, Synthetic image detection

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	11
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%