How do different expert routing strategies in MambaFormer affect throughput and FLOPs per token efficiency on

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

How do different expert routing strategies in MambaFormer affect throughput and FLOPs per token efficiency on

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20435649

How do different expert routing strategies in MambaFormer affect throughput and FLOPs per token efficiency on

- Summary

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeResearch goal: How do different expert routing strategies in MambaFormer affect throughput and FLOPs per token efficiency on code generation tasks?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.3/10.

Found an issue? Give us feedback