ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20417892

ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching

- Summary

Abstract

Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments such as single-GPU devices. Offloading alleviates this issue by storing inactive experts in CPU memory and loading them on demand, but existing methods remain limited: static caches disregard input-dependent routing, and methods that train separate models to predict expert usage aheadResearch goal: Does ExpertFlow's offloading and caching mechanism maintain inference throughput gains without degrading object-level hallucination metrics (e.g., POPE) across different MoE-VLM architectures when compared to static cache baselines?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Found an issue? Give us feedback