
doi: 10.2139/ssrn.6248880
Video anomaly detection identifies unusual events in surveillance videos and is crucial for public safety. Recent open-vocabulary video anomaly detection (OVVAD) methods leverage vision-language models to recognize unseen anomaly categories, but they exhibit fundamental limitations. Existing approaches rely on instantaneous frame-text similarity for categorization, failing to capture temporal dynamics essential for distinguishing complex anomalies. We propose DTVAD, a temporal-aware OVVAD framework that captures multi-scale temporal dependencies through dilated convolutions applied to frame-text cost volumes. Our framework introduces three components: a temporal-aware anomaly module modeling temporal patterns; a consistency loss enforcing branch alignment; and a contrastive loss preventing representation collapse. Experimental results on UCF-Crime and XD-Violence demonstrate that DTVAD outperforms recent state-of-the-art in open-vocabulary anomaly detection and categorization, validating the benefit of our architectural designs and training objectives.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
