Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY NC ND
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY NC ND
Data sources: Datacite
versions View all 2 versions
addClaim

4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference

Authors: Pirolo, Andres;

4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference

Abstract

This work extends the adaptive per-block mode selector introduced in Part I with three main contributions. First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as switch statements and shared memory padding—do not improve performance on the Adreno 740 compiler.Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU. The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to 3.1× at low batch sizes, revealing a previously undocumented regime where instruction count per thread dominates over arithmetic intensity. The compiler ablation shows that if cascades and switch() statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to 2.7%, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures. Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves 34–44% execution overlap on the validation GPU. Overlap efficiency scales with layer size: at N = 16384, two-layer dispatch achieves 44.3% overlap, yielding a 1.58× forward-pass speedup for 32-layer transformer inference. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design. See license in repository zip.

License Note.This work is released under the PolyForm Noncommercial License 1.0.0 and is free for academic, research, and student use. Researchers, educators, and students are encouraged to study, reproduce, and build upon the methods described here. The scope of this work applies to any neural network system, regardless of architecture, model type, parameter count, or scale, and to implementations running on any hardware platform (including CPUs, GPUs, mobile processors, embedded devices, and specialized accelerators). As this work has not undergone formal peer review, it is shared in the spirit of open academic exchange. Students and researchers may learn from both the strengths and potential errors of the ideas presented here.

Keywords

BNN

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!