Combining Global and Local Attention with Positional Encoding for Video Summarization

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Nov 2021Publisher:IEEEJournal:2021 IEEE International Symposium on Multimedia (ISM)Funded by:UKRI | Deep Learning from Crawle..., EC | MIRROR

Authors: Evlampios Apostolidis; Georgios Balaouras; Vasileios Mezaris; Ioannis Patras;

doi: 10.1109/ism52913.2021.00045 , 10.5281/zenodo.6683784 , 10.5281/zenodo.6683785

Combining Global and Local Attention with Positional Encoding for Video Summarization

- Summary
- Subjects
- Metrics

Abstract

This paper presents a new method for supervised video summarization. To overcome drawbacks of existing RNN-based summarization architectures, that relate to the modeling of long-range frames’ dependencies and the ability to parallelize the training process, the developed model re-lies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames’ dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames’ dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. An ablation study that focuses on our main proposed components, namely the use of global and local multi-head attention mechanisms in collaboration with an absolute positional encoding component, shows their relative contributions to the overall summarization performance.

Related Organizations

Keywords

self-attention, video summarization, positional encoding, multi-head attention, supervised learning

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	61
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%