Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Report
Data sources: ZENODO
addClaim

Cut MoE Inference Costs by 60–80%

Authors: Tang, Rujing;

Cut MoE Inference Costs by 60–80%

Abstract

Modular MoE restructures how expert weights are stored, routed, and updated. Instead of keeping every expert resident across 5–8 GPUs, we extract a frozen shared core, compress the per-expert residuals by 8–16× (hierarchical shared-core extraction combined with S2LC—Shared Spectral Low-Rank Compression—spectral compression), and load only the active domain module on demand. The result: 1–2 GPUs per instance, sub-millisecond domain switching, and the ability to add or roll back capabilities without retraining.

Powered by OpenAIRE graph
Found an issue? Give us feedback