Structured Attentions for Visual Question Answering

Preprint English OPEN
Zhu, Chen; Zhao, Yanpeng; Huang, Shuaiyi; Tu, Kewei; Ma, Yi;
  • Subject: Computer Science - Computer Vision and Pattern Recognition

Visual attention, which assigns weights to image regions according to their relevance to a question, is considered as an indispensable part by most Visual Question Answering models. Although the questions may involve complex relations among multiple regions, few attenti... View more
  • References (40)
    40 references, page 1 of 4

    [1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. NAACL, 2016.

    [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.

    [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual question answering. In ICCV, 2015.

    [4] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning deep structured models. In ICML, 2015.

    [5] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

    [6] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? EMNLP, 2016.

    [7] T.-M.-T. Do and T. Artieres. Neural conditional random fields. In AISTATS, 2010.

    [8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP, 2016.

    [9] Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, 2016.

    [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

  • Related Organizations (3)
  • Metrics
Share - Bookmark