Benchmarking and Cross-Platform Evaluation of Public Deepfake Detection Models on Viral Real-World Media

Abstract: This study evaluates the performance of publicly accessible deepfake detection tools on 20 viral political and celebrity videos. Deepfakes pose serious risks to public trust and information integrity, yet the reliability of off-the-shelf detection tools for identifying real-world deepfakes remains unclear. We hypothesised that publicly available detectors would show inconsistent accuracy and produce both false positives and false negatives when applied to in-the-wild videos. To test this, we evaluated 20 viral clips (10 confirmed deepfakes, 10 authentic controls) using two public detection platforms: Deepware AI Scanner and UB Media Forensics Lab's DeepFake-O-Meter. We recorded ensemble and per-model likelihoods across more than ten detectors. Results revealed substantial cross-platform disagreement and significant inconsistencies, including frequent false positives and false negatives. One platform's ensemble flagged only a minority of confirmed deepfakes while the research platform produced extreme per-model score variance, so that sensitivity depended strongly on how an intermediate "Suspicious" label was treated. Depending on the binary mapping used, measured sensitivity varied widely while specificity remained high for this sample. Our results demonstrate the shortcomings of existing detection technologies and the pressing need for more reliable, transparent, and strong deepfake forensic techniques. We conclude that current public detectors provide useful signals but are not yet reliable as sole arbiters of authenticity for viral content. We recommend publishing full per-video numeric outputs, versioned model identifiers, and pairing automated screening with human expert review.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green