BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems

Name: BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems
Keywords: G.4, FOS: Computer and information sciences, 65Y10, Computer Science - Machine Learning, C.1.3, C.5.3, I.2.6, I.2.8, Computer Science - Mathematical Software, C.3

Burlachenko, Konstantin; Richtárik, Peter

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:arXiv

Authors: Burlachenko, Konstantin; Richtárik, Peter;

doi: 10.48550/arxiv.2503.13795

arXiv: 2503.13795

BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems

- Summary
- Subjects
- Metrics

Abstract

In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing $\nabla f(x)$ on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs $f(x)$ is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to $\times 2000$ in runtime and reduces memory consumption by up to $\times 3500$. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a $\times 20$ speedup and reduces memory up to $\times 80$ compared to PyTorch.

46 pages, 7 figures, 19 tables

Keywords

G.4, FOS: Computer and information sciences, 65Y10, Computer Science - Machine Learning, C.1.3, C.5.3, I.2.6, I.2.8, Computer Science - Mathematical Software, C.3, I.2.6; I.2.8; C.1.3; C.5.3; G.4; C.3, Mathematical Software (cs.MS), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green