Understanding Cross-modal Interactions in V&amp;L Models that Generate Scene Descriptions

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint , Other literature type 01 Jan 2022Embargo end date: 01 Jan 2022 Netherlands Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)Funded by:EC | NL4XAI

Authors: Michele Cafagna; Kees van Deemter; Albert Gatt;

doi: 10.18653/v1/2022.umios-1.6 , 10.5281/zenodo.7669907 , 10.5281/zenodo.7669908 , 10.48550/arxiv.2211.04971 , 10.5281/zenodo.10723000

arXiv: 2211.04971

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

- Summary
- Subjects
- Metrics

Abstract

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

Country

Netherlands

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, image captioning, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, multimodal grounding, vision and language, Computation and Language (cs.CL)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average