Weakly Supervised Dense Video Captioning

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jul 2017Embargo end date: 01 Jan 2017Publisher:IEEEJournal:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Authors: Zhiqiang Shen; Jianguo Li; Zhou Su; Minjun Li; Yurong Chen 0001; Yu-Gang Jiang 0001; Xiangyang Xue 0001;

doi: 10.1109/cvpr.2017.548 , 10.48550/arxiv.1704.01502

arXiv: 1704.01502

Weakly Supervised Dense Video Captioning

- Summary
- Subjects
- Metrics

Abstract

This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit annotation of fine-grained sentence to video region-sequence correspondence, but is only based on weak video-level sentence annotations. It differs from existing video captioning systems in three technical aspects. First, we propose lexical fully convolutional neural networks (Lexical-FCN) with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels. Second, we introduce a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. A winner-takes-all scheme is adopted to weakly associate sentences to region-sequences in the training phase. Third, a sequence-to-sequence learning based language model is trained with the weakly supervised information obtained through the association process. We show that the proposed method can not only produce informative and diverse dense captions, but also outperform state-of-the-art single video captioning methods by a large margin.

To appear in CVPR 2017

Related Organizations

Fudan University
China (People's Republic of)
Intel (United States)
United States

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	85
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%