Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jul 2017Embargo end date: 01 Jan 2017Publisher:IEEEJournal:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Authors: Yufei Wang 0001; Zhe Lin 0001; Xiaohui Shen; Scott Cohen; Garrison W. Cottrell;

doi: 10.1109/cvpr.2017.780 , 10.48550/arxiv.1704.06972

arXiv: 1704.06972

Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition

- Summary
- Subjects
- Metrics

Abstract

Recently, there has been a lot of interest in automatically generating descriptions for an image. Most existing language-model based approaches for this task learn to generate an image description word by word in its original word order. However, for humans, it is more natural to locate the objects and their relationships first, and then elaborate on each object, describing notable attributes. We present a coarse-to-fine method that decomposes the original image description into a skeleton sentence and its attributes, and generates the skeleton sentence and attribute phrases separately. By this decomposition, our method can generate more accurate and novel descriptions than the previous state-of-the-art. Experimental results on the MS-COCO and a larger scale Stock3M datasets show that our algorithm yields consistent improvements across different evaluation metrics, especially on the SPICE metric, which has much higher correlation with human ratings than the conventional metrics. Furthermore, our algorithm can generate descriptions with varied length, benefiting from the separate control of the skeleton and attributes. This enables image description generation that better accommodates user preferences.

Accepted by CVPR 2017

Related Organizations

Adobe Systems (United States)
United States
University of California, San Diego
United States
University of California
United States
University of California
United States
University of California, San Francisco
United States

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	60
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%

Found an issue? Give us feedback

60

Top 10%

Top 1%

Green

Fields of Science (4) View all

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

View all