4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference

This work extends the adaptive per-block mode selector introduced in Part I with three main contributions. First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as switch statements and shared memory padding—do not improve performance on the Adreno 740 compiler.Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU. The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to 3.1× at low batch sizes, revealing a previously undocumented regime where instruction count per thread dominates over arithmetic intensity. The compiler ablation shows that if cascades and switch() statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to 2.7%, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures. Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves 34–44% execution overlap on the validation GPU. Overlap efficiency scales with layer size: at N = 16384, two-layer dispatch achieves 44.3% overlap, yielding a 1.58× forward-pass speedup for 32-layer transformer inference. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design. See license in repository zip.

License Note.This work is released under the PolyForm Noncommercial License 1.0.0 and is free for academic, research, and student use. Researchers, educators, and students are encouraged to study, reproduce, and build upon the methods described here. The scope of this work applies to any neural network system, regardless of architecture, model type, parameter count, or scale, and to implementations running on any hardware platform (including CPUs, GPUs, mobile processors, embedded devices, and specialized accelerators). As this work has not undergone formal peer review, it is shared in the spirit of open academic exchange. Students and researchers may learn from both the strengths and potential errors of the ideas presented here.

Keywords

BNN

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now