Deepfake detection: challenges and solutions

Deepfakes can have a serious impact on the spread of fake news and on people's lives in general, becoming every day more dangerous. Moderation of online content and databases is vital to mitigate this phenomenon but the development of systems to distinguish between fake and genuine content comes with its own challenges: (a) The lack of generalization capabilities, due to the fact that most deepfake detection models are trained on a specific type of deepfake and struggle to detect deepfakes generated using different techniques. (b) When applying the deepfake detectors to the real world many peculiarities may occur; for example, the management of videos in which there are multiple people in the same scene or the recognition of the faces' movements towards or backwards the camera. In the analysis work we conducted, we started focusing on the generalization problem, trying to understand whether a particular deep learning architecture was more capable of abstracting the concept of deepfake to such an extent that it could detect images or videos that had been manipulated even with novel techniques. In [2] and [5] we compared Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) of various kinds by putting them in a cross-forgery context revealing the superiority of the ViTs, which are less tied to the specific anomalies they see during training. After that, noting a scarcity of methods based on ViT and even more so those based on hybrid architectures, we developed our first real deepfake detector. In [1] we created a new architecture, combining an EfficientNet-B0 and Cross ViT, which we have named Convolutional Cross Vision Transformer. Thanks to the local-global attention mechanism within it and the exploitation of features extracted from the CNN, the model was able to effectively detect deepfake videos, achieving SOTA results on DFDC[6] and FaceForensics++[9] dataset, all while keeping the number of parameters low. The model was also used to participate in the competition presented in [7]. In [4] we designed a new type of Convolutional TimeSformer that take into account both the spatial position of faces in the frame and their temporal position in the video. It is also capable of managing multiple identities and being robust to face-size movements thanks to the introduction of a novel attention mechanism and positional embedding. Our method surpassed the SOTA on in-dataset tests on [8] and performed robustly in real-world situations. Future work will mainly focus on improving deepfake detectors in order to make them more robust to other real-world problems. We also want to make detectors capable of combining information also of a textual nature, context, and the reputation of the account disseminating it, to understand video veracity. Also, as we started doing in [3], we will work on the more generic problem of synthetic content detection.

Country

Italy

Related Organizations

University of Pisa
Italy
National Research Council
Italy
Institute of Information Science and Technologies "A. Faedo"
Italy

Keywords

Deepfake detection, Deep Learning, Computer vision

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green