DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 01 Jan 2022Embargo end date: 01 Jan 2021Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Multimedia, volume 24, pages 3,369-3,380 (issn: 1520-9210, eissn: 1941-0077,

Copyright policy )

Authors: Jianyu Wang; Bing-Kun Bao; Changsheng Xu;

doi: 10.1109/tmm.2021.3097171 , 10.48550/arxiv.2107.04768

arXiv: 2107.04768

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Video question answering is a challenging task, which requires agents to be able to understand rich video contents and perform spatial-temporal reasoning. However, existing graph-based methods fail to perform multi-step reasoning well, neglecting two properties of VideoQA: (1) Even for the same video, different questions may require different amount of video clips or objects to infer the answer with relational reasoning; (2) During reasoning, appearance and motion features have complicated interdependence which are correlated and complementary to each other. Based on these observations, we propose a Dual-Visual Graph Reasoning Unit (DualVGR) which reasons over videos in an end-to-end fashion. The first contribution of our DualVGR is the design of an explainable Query Punishment Module, which can filter out irrelevant visual features through multiple cycles of reasoning. The second contribution is the proposed Video-based Multi-view Graph Attention Network, which captures the relations between appearance and motion features. Our DualVGR network achieves state-of-the-art performance on the benchmark MSVD-QA and SVQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA datasets. Our code is available at https://github.com/MMIR/DualVGR-VideoQA.

12 pages, 12 figures

Related Organizations

Chinese Academy of Sciences
China (People's Republic of)
Institute of Automation
China (People's Republic of)
Nanjing University of Posts and Telecommunications
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)

1 Research products, page 1 of 1

Hierarchical Relational Attention for Video Question Answering
2018IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	42
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%