• shareshare
  • link
  • cite
  • add
auto_awesome_motion View all 2 versions
Publication . Article . Preprint . 2020 . Embargo end date: 01 Jan 2020

UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning

Lam, Quan Hoang; Le, Quang Duy; Van Nguyen, Kiet; Nguyen, Ngan Luu-Thuy;
Open Access
Published: 01 Feb 2020
Publisher: arXiv
Image Captioning, the task of automatic generation of image captions, has attracted attentions from researchers in many fields of computer science, being computer vision, natural language processing and machine learning in recent years. This paper contributes to research on Image Captioning task in terms of extending dataset to a different language - Vietnamese. So far, there is no existed Image Captioning dataset for Vietnamese language, so this is the foremost fundamental step for developing Vietnamese Image Captioning. In this scope, we first build a dataset which contains manually written captions for images from Microsoft COCO dataset relating to sports played with balls, we called this dataset UIT-ViIC. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images. Following that, we evaluate our dataset on deep neural network models and do comparisons with English dataset and two Vietnamese datasets built by different methods. UIT-ViIC is published on our lab website for research purposes.
Comment: Submitted to the 2020 ICCCI Conference (The 12th International Conference on Computational Collective Intelligence)
Subjects by Vocabulary

ACM Computing Classification System: ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION


Computation and Language (cs.CL), FOS: Computer and information sciences, Computer Science - Computation and Language

25 references, page 1 of 3

1. Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Doll¡r, P. and Zitnick, C.L., 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

2. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K. and Darrell, T., 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).

3. Eyra, Horus,

4. Funaki, R. and Nakayama, H., 2015, September. Image-mediated learning for zeroshot cross-lingual document retrieval. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 585-590).

5. Gao, L., Guo, Z., Zhang, H., Xu, X. and Shen, H.T., 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 19(9), pp.2045-2055.

6. He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

7. Hodosh, M., Young, P. and Hockenmaier, J., 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, pp.853-899.

8. Hossain, M.D., Sohel, F., Shiratuddin, M.F. and Laga, H., 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6), p.118.

9. Karpathy, A. and Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).

10. Li, X., Xu, C., Wang, X., Lan, W., Jia, Z., Yang, G. and Xu, J., 2019. COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. IEEE Transactions on Multimedia.