Question-Aware Tube-Switch Network for Video Question Answering

descriptionPublicationkeyboard_double_arrow_right Article 15 Oct 2019Publisher:ACMJournal:Proceedings of the 27th ACM International Conference on Multimedia

Authors: Tianhao Yang; Zheng-Jun Zha; Hongtao Xie; Meng Wang; Hanwang Zhang;

doi: 10.1145/3343031.3350969

Question-Aware Tube-Switch Network for Video Question Answering

- Summary
- Related research
  (2)
- Metrics

Abstract

Video Question & Answering (VideoQA), a task to answer questions in videos, involves rich spatio-temporal content (e.g., appearance and motion) and requires multi-hop reasoning process. However, existing methods usually deal with appearance and motion separately and fail to synchronize the attentions on appearance and motion features, neglecting two key properties of video QA: (1) appearance and motion features are usually concomitant and complementary to each other at time slice level. Some questions rely on joint representations of both kinds of features at some point in the video; (2) appearance and motion have different importance in multi-step reasoning. In this paper, we propose a novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains (1) a Mix module to synchronously combine the appearance and motion representation at time slice level, achieving fine-grained temporal alignment and correspondence between appearance and motion at every time slice and (2) a Switch mod- ule to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process. To end-to-end train TSN, we utilize the Gumbel-Softmax strategy to account for the discrete tube-switch process. Extensive experimental results on two benchmarks: MSVD-QA and MSRVTT-QA, have demonstrated that the proposed TSN consistently outperforms state-of-the-art on all metrics.

Related Organizations

Hefei University of Technology
China (People's Republic of)
University of Science and Technology of China
China (People's Republic of)
Nanyang Technological University
Singapore

2 Research products, page 1 of 1

Spatiotemporal-Textual Co-Attention Network for Video Question Answering
2019IsAmongTopNSimilarDocuments
Hierarchical Relational Attention for Video Question Answering
2018IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	19
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%