Video Question Answering with Iterative Video-Text Co-tokenization

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint 01 Jan 2022Embargo end date: 01 Jan 2022 English Publisher:Springer Nature Switzerland

Authors: Piergiovanni, AJ; Morton, Kairo; Kuo, Weicheng; Ryoo, Michael S.; Angelova, Anelia;

doi: 10.1007/978-3-031-20059-5_5 , 10.48550/arxiv.2208.00934

arXiv: 2208.00934

Video Question Answering with Iterative Video-Text Co-tokenization

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.

ECCV 2022

Related Organizations

Google (United States)
United States

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

1 Research products, page 1 of 1

Hierarchical Relational Attention for Video Question Answering
2018IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	11
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%