How does the performance of Flamingo compare to PaLI and BLIVA in zero-shot cross-modal retrieval tasks, parti

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

How does the performance of Flamingo compare to PaLI and BLIVA in zero-shot cross-modal retrieval tasks, parti

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20440902

How does the performance of Flamingo compare to PaLI and BLIVA in zero-shot cross-modal retrieval tasks, parti

- Summary

Abstract

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multiResearch goal: How does the performance of Flamingo compare to PaLI and BLIVA in zero-shot cross-modal retrieval tasks, particularly on benchmarks like MSCOCO and Flickr30K?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.8/10.

Found an issue? Give us feedback