Image Captioning with Deep Bidirectional LSTMs

Preprint English OPEN
Wang, Cheng ; Yang, Haojin ; Bartz, Christian ; Meinel, Christoph (2016)
  • Subject: Computer Science - Computation and Language | Computer Science - Computer Vision and Pattern Recognition | Computer Science - Multimedia

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models "translate" image to sentence. Our proposed models are evaluated on caption generation and image-sentence retrieval tasks with three benchmark datasets: Flickr8K, Flickr30K and MSCOCO datasets. We demonstrate that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art results on caption generation even without integrating additional mechanism (e.g. object detection, attention model etc.) and significantly outperform recent methods on retrieval task.
  • References (40)
    40 references, page 1 of 4

    [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.

    [2] X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422{2431, 2015.

    [3] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.

    [4] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625{2634, 2015.

    [5] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, and J. Platt. From captions to visual concepts and back. In CVPR, pages 1473{1482, 2015.

    [6] F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, pages 7{16. ACM, 2014.

    [7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121{2129, 2013.

    [8] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645{6649. IEEE, 2013.

    [9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Ca e: Convolutional architecture for fast feature embedding. In ACMMM, pages 675{678. ACM, 2014.

    [10] A. Karpathy, A. Joulin, and F-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889{1897, 2014.

  • Metrics
    No metrics available
Share - Bookmark