publication . Conference object . Preprint . 2016

Image Captioning with Deep Bidirectional LSTMs

Wang, Cheng; Yang, Haojin; Bartz, Christian; Meinel, Christoph;
Open Access
  • Published: 04 Apr 2016
  • Publisher: ACM Press
Abstract
Comment: accepted by ACMMM 2016 as full paper and oral presentation
Subjects
free text keywords: Artificial intelligence, business.industry, business, Sentence, Closed captioning, Computer vision, Machine learning, computer.software_genre, computer, Visual language, Attention model, Overfitting, Computer science, Speech recognition, Deep learning, Convolutional neural network, Object detection, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia
Related Organizations
40 references, page 1 of 3

[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.

[2] X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422{2431, 2015.

[3] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.

[4] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625{2634, 2015.

[5] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, and J. Platt. From captions to visual concepts and back. In CVPR, pages 1473{1482, 2015.

[6] F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, pages 7{16. ACM, 2014.

[7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121{2129, 2013.

[8] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645{6649. IEEE, 2013.

[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Ca e: Convolutional architecture for fast feature embedding. In ACMMM, pages 675{678. ACM, 2014.

[10] A. Karpathy, A. Joulin, and F-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889{1897, 2014. [OpenAIRE]

[11] A. Karpathy and F-F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128{3137, 2015.

[12] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, pages 595{603, 2014. [OpenAIRE]

[13] R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [OpenAIRE]

[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi cation with deep convolutional neural networks. In NIPS, pages 1097{1105, 2012.

[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 35(12):2891{2903, 2013. [OpenAIRE]

40 references, page 1 of 3
Abstract
Comment: accepted by ACMMM 2016 as full paper and oral presentation
Subjects
free text keywords: Artificial intelligence, business.industry, business, Sentence, Closed captioning, Computer vision, Machine learning, computer.software_genre, computer, Visual language, Attention model, Overfitting, Computer science, Speech recognition, Deep learning, Convolutional neural network, Object detection, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia
Related Organizations
40 references, page 1 of 3

[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.

[2] X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422{2431, 2015.

[3] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.

[4] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625{2634, 2015.

[5] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, and J. Platt. From captions to visual concepts and back. In CVPR, pages 1473{1482, 2015.

[6] F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, pages 7{16. ACM, 2014.

[7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121{2129, 2013.

[8] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645{6649. IEEE, 2013.

[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Ca e: Convolutional architecture for fast feature embedding. In ACMMM, pages 675{678. ACM, 2014.

[10] A. Karpathy, A. Joulin, and F-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889{1897, 2014. [OpenAIRE]

[11] A. Karpathy and F-F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128{3137, 2015.

[12] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, pages 595{603, 2014. [OpenAIRE]

[13] R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [OpenAIRE]

[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi cation with deep convolutional neural networks. In NIPS, pages 1097{1105, 2012.

[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 35(12):2891{2903, 2013. [OpenAIRE]

40 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . Preprint . 2016

Image Captioning with Deep Bidirectional LSTMs

Wang, Cheng; Yang, Haojin; Bartz, Christian; Meinel, Christoph;