descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 01 Jan 2022Embargo end date: 01 Jan 2022Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Funded by:NSERC | unidentified

Authors: Benno Krojer; Vaibhav Adlakha; Vibhav Vineet; Yash Goyal; Edoardo Maria Ponti; Siva Reddy;

doi: 10.18653/v1/2022.acl-long.241 , 10.5281/zenodo.6518944 , 10.60692/1p1hs-33n09 , 10.5281/zenodo.6518943 , 10.60692/tzwnk-zd096 , 10.48550/arxiv.2203.15867

arXiv: http://arxiv.org/abs/2203.15867

Image Retrieval from Contextual Descriptions

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

This upload contains the images of our dataset. For the rest, please refer to: https://github.com/McGill-NLP/imagecode Abstract: The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe. Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans. Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.

Related Organizations

Samsung (South Korea)
Korea (Republic of)
Microsoft Research (India)
India
McGill University
Canada
Centre Universitaire de Mila
Algeria
Microsoft Research (United Kingdom)
United Kingdom

View all View all

Keywords

FOS: Computer and information sciences, Artificial intelligence, Computer Vision and Pattern Recognition (cs.CV), Language Grounding, Computer Science - Computer Vision and Pattern Recognition, Image Retrieval, NLP, Image Feature Retrieval and Recognition Techniques, Artificial Intelligence, Shape Matching and Object Recognition, Image (mathematics), Information retrieval, Object Recognition, Multi-Modality, Computer Science - Computation and Language, Cross-Modal Retrieval, Vision-and-Language, Computer science, Computer Science, Physical Sciences, Computer vision, Computer Vision and Pattern Recognition, Content-Based Image Retrieval, Image retrieval, Computation and Language (cs.CL), Feature Matching

2 Research products, page 1 of 1

CLIP software on GitHub
IsRelatedTo
imagecode software on GitHub
IsRelatedTo

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	9
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%