<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
Contrastively pretrained vision-language models (VLMs) such as CLIP have shown impressive zero-shot classification performance without any classification-specific training. They create a common embedding space by contrastively pretraining an image and a text encoder to align positive image-text pairs and repel negative pairs. Then zero-shot classification of an image can be performed by measuring the cosine similarities between the image embedding and embeddings of texts that describe the classes. However, relevant works do not address the scenario in which few image examples for some (not all) classes are available. In this novel task which we term variable-shot (v-shot) classification, these models fail due to the embedding space modality gap, i.e. the fact that image-to-image similarities are higher than image-to-text ones. To this end, we propose to enable v-shot capabilities in pre-trained VLMs with minimal training complexity by re-projecting embeddings of frozen pre-trained image encoders using a shallow network, RectNet, which we train both with the standard CLIP contrastive loss function, as well as a novel modality alignment loss function specifically constructed to bridge the modality gap. Finally, we introduce three v-shot classification benchmarks, on which the proposed architecture achieves 32.22%, 29.58% and 45.15% increases in top-1 classification accuracy respectively.
Machine learning, Computer vision
Machine learning, Computer vision
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |