
Vision-Language Pretrained (VLP) models have achieved impressive performance on multimodal tasks, including text-image retrieval, based on dense representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction in text-only settings due to its interpretability and efficiency with fast term-based lookup via inverted indexes. Inspired by these advantages, recent work has extended LSR to the multimodal domain. However, these methods often rely on computationally expensive contrastive pre-training, or distillation from a frozen dense model, which limits the potential for mutual enhancResearch goal: What is the comparative effect of graph sparsity versus density on the F1-score performance of retrieval-augmented generation models in zero-shot document categorization?Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 7.6/10.
