descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Dec 2015Embargo end date: 01 Jan 2015Publisher:IEEEJournal:2015 IEEE International Conference on Computer Vision (ICCV)Funded by:NSF | NRI: Collaborative Resear..., NSF | RI: Large: Collaborative ..., NSF | EAGER: Quantifying and Re...

Authors: Jeff Donahue; Subhashini Venugopalan; Kate Saenko; Trevor Darrell; Raymond J. Mooney; Marcus Rohrbach;

doi: 10.1109/iccv.2015.515 , 10.48550/arxiv.1505.00487

arXiv: http://arxiv.org/abs/1505.00487

Sequence to Sequence -- Video to Text

- Summary
- Subjects
- Related research
  (5)
- Metrics

Abstract

Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).

ICCV 2015 camera-ready. Includes code, project page and LSMDC challenge results

Related Organizations

University of California System
United States
University of California, Berkeley
United States
University of California, San Francisco
United States
The University of Texas at Austin
United States
University of Massachusetts System
United States

View all View all

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

5 Research products, page 1 of 1

The Long-Short Story of Movie Description
2015IsAmongTopNSimilarDocuments
Video Captioning with Transferred Semantic Attributes
2017IsAmongTopNSimilarDocuments
coco-caption software on GitHub
IsRelatedTo
caffe software on GitHub
IsRelatedTo
caffe software on GitHub
IsRelatedTo

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	997
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 0.1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 0.1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 0.1%