TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 02 Nov 2024Embargo end date: 01 Jan 2023Publisher:IEEEJournal:2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)

Authors: William Won; Midhilesh Elavazhagan; Sudarshan Srinivasan; Swati Gupta 0001; Tushar Krishna;

doi: 10.1109/micro61859.2024.00068 , 10.48550/arxiv.2304.05301

arXiv: 2304.05301

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.

Contains 12 main pages, 21 figures, 5 tables. Artifact appendix attached

Related Organizations

Georgia Institute of Technology
United States
Massachusetts Institute of Technology
United States
Intel (United States)
United States

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

1 Research products, page 1 of 1

astra-sim software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average