
Video Question Answering (VideoQA) tasks require understanding of the connection of context specific video parts which are temporally distributed. Humans are capable of focusing on temporally distributed video scenes and also to find correspondence or relationships among these segments. To achieve similar capability, a hierarchical relational attention mechanism is proposed in this paper. The proposed VideoQA model derives attention on temporal segments i.e. video features based on each of the question words. Also, contextual relevance of these temporal segments are captured to derive the final video representation which leads to a better reasoning capability. We evaluate the performance of the proposed approach on the MSRVTT-QA and the MSVD-QA datasets to establish its superior performance over the state of the art.
scene understanding, Hierarchical relational attention, Visual Question Answering (VQA), 004
scene understanding, Hierarchical relational attention, Visual Question Answering (VQA), 004
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 20 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
