
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multiResearch goal: How does the performance of Flamingo compare to PaLI and BLIVA in zero-shot cross-modal retrieval tasks, particularly on benchmarks like MSCOCO and Flickr30K?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.8/10.
