Vision-Language Pretraining for Variable-Shot Image Classification

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Conference object 01 Jan 2025 English Publisher:Springer Nature Singapore

Authors: Papadopoulos, Sotirios; Ioannidis, Konstantinos; Vrochidis, Stefanos; Kompatsiaris, Ioannis (Yiannis); Patras, Ioannis;

doi: 10.1007/978-981-96-2071-5_21 , 10.5281/zenodo.14024546 , 10.5281/zenodo.14024545

Vision-Language Pretraining for Variable-Shot Image Classification

- Summary
- Subjects
- Metrics

Abstract

Contrastively pretrained vision-language models (VLMs) such as CLIP have shown impressive zero-shot classification performance without any classification-specific training. They create a common embedding space by contrastively pretraining an image and a text encoder to align positive image-text pairs and repel negative pairs. Then zero-shot classification of an image can be performed by measuring the cosine similarities between the image embedding and embeddings of texts that describe the classes. However, relevant works do not address the scenario in which few image examples for some (not all) classes are available. In this novel task which we term variable-shot (v-shot) classification, these models fail due to the embedding space modality gap, i.e. the fact that image-to-image similarities are higher than image-to-text ones. To this end, we propose to enable v-shot capabilities in pre-trained VLMs with minimal training complexity by re-projecting embeddings of frozen pre-trained image encoders using a shallow network, RectNet, which we train both with the standard CLIP contrastive loss function, as well as a novel modality alignment loss function specifically constructed to bridge the modality gap. Finally, we introduce three v-shot classification benchmarks, on which the proposed architecture achieves 32.22%, 29.58% and 45.15% increases in top-1 classification accuracy respectively.

Related Organizations

Queen Mary University of London
United Kingdom
Centre for Research and Technology Hellas
Greece

Keywords

Machine learning, Computer vision

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average