
Deepfakes pose a growing threat to the integrity of visual media, necessitating robust detection methods. However, existing detection approaches still struggle to reliably identify forged images and videos, particularly as modern deepfakes become increasingly realistic and human-indistinguishable. This paper proposes a deepfake detection approach based on CLIP-derived vision transformers (SigLIP-2), combined with a multi-task design for classification and manipulated-region localization. The models are evaluated on three public benchmarks of increasing complexity: HiDF, SIDA, and CIFake. Our detector achieves state-of-the-art results in all three. On HiDF, it achieves an AUC of 0.931 for deepfake video detection, improving by ~0.20 over the best prior (EB4), and a similarly high AUC of 0.968 on images. On SIDA, the model reaches 99.1% accuracy, substantially outperforming the previous 93.5% baseline while correctly localizing most tampered pixels. It also exceeds 95% accuracy on CiFake, with an AUC of 0.986. The proposed model substantially advances detection performance on challenging realistic forgeries, providing both high precision and interpretable localization to support practical deepfake mitigation.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
