ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

descriptionPublicationkeyboard_double_arrow_right Article 31 Jan 2026Publisher:GSC Online PressJournal:World Journal of Advanced Research and Reviews, volume 29, pages 1,886-1,901 (eissn: 2581-9615,

Copyright policy )

Authors: Chinda, Adele;

doi: 10.30574/wjarr.2026.29.1.0242 , 10.5281/zenodo.18477601 , 10.5281/zenodo.18477602

ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

- Summary
- Subjects
- Metrics

Abstract

Multimodal large language models (MLLMs) rely heavily on vision encoders to understand diverse image content. While recent approaches have explored combining multiple vision experts to address the limitations of single encoders, they typically perform image-level expert selection and fusion, ignoring the spatial heterogeneity within images where different regions may benefit from different experts. In this paper, we propose ViMoE (Vision Mixture of Experts with Multimodal Context Awareness), a novel MLLM that introduces three key innovations: (1) Token-Level Sparse Expert Activation (TLSEA) that enables different spatial tokens to utilize different expert combinations, allowing fine-grained, content-aware feature extraction; (2) Hierarchical Context Aggregation (HCA) that captures multi-scale visual context to guide expert routing at different granularities; and (3) Expert Confidence Calibration (ECC) that learns to estimate and calibrate expert contribution confidence to reduce noise from unreliable features. Through these innovations, ViMoE achieves more precise expert utilization by recognizing that a single image often contains diverse content requiring different visual expertise. Extensive experiments demonstrate that ViMoE achieves significant improvements over state-of-the-art methods across challenging multimodal benchmarks including MME, MMBench, and various VQA tasks, while maintaining computational efficiency through sparse activation patterns. Code is available at: https://arrel.github.io/vimoe/

Related Organizations

Georgia State University
United States

Keywords

Confidence calibration, Hierarchical context aggregation, Sparse expert activation, Vision Mixture of Experts, Token-level routing, Multimodal large language mode

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

gold