Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2026
License: CC BY
Data sources: ZENODO
addClaim

BitMamba-2: Efficient Scaling of 1.58-bit State Space Models

Authors: Salazar, Jesus;

BitMamba-2: Efficient Scaling of 1.58-bit State Space Models

Abstract

The scaling of Large Language Models (LLMs) is traditionally constrained by the quadraticcomplexity of Transformers and the memory bandwidth bottleneck associated with high-precisionweights. While State Space Models (SSMs) like Mamba have addressed the sequence scalinglimitation with linear-time complexity, the memory footprint remains a challenge for edge de-ployment. In this work, we introduce BitMamba-2, a hybrid architecture that integratesthe 1.58-bit ternary quantization of BitNet into the Mamba-2 framework. We train two mod-els from scratch: a 255M parameter baseline and a scaled-up 1B parameter model, utiliz-ing a high-quality dataset mix comprising FineWeb-Edu, Cosmopedia, and The Stack-Dedup.Our experiments, conducted on Google Cloud TPU v6e hardware, demonstrate strong scal-ing laws for ternary SSMs. The 1B model achieves a 7.8% improvement in ARC-Easy ac-curacy (63.3%) and a dramatic reduction in perplexity (from 51.69 to 29.62) compared tothe 255M baseline. Furthermore, we demonstrate that BitMamba-2 enables high-performanceinference on consumer CPUs, achieving ∼53 tokens/second on an Intel i3 processor with amemory footprint of just 621 MB. Code and pre-trained checkpoints are publicly available athttps://github.com/Zhayr1/BitMamba-2, https://huggingface.co/Zhayr1/BitMamba-2-1Band https://huggingface.co/Zhayr1/BitMamba-2-0.25B.

Keywords

Neural Scaling Laws, Large Language Models, Quantization, Mamba, Efficient Inference, Green AI, Ternary Weights, 1.58-bit, BitNet, State Space Models

Powered by OpenAIRE graph
Found an issue? Give us feedback