LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 17 Nov 2024Embargo end date: 01 Jan 2024Publisher:IEEEJournal:SC24: International Conference for High Performance Computing, Networking, Storage and AnalysisFunded by:EC | RED-SEA

Authors: Shen, Siyuan; Huang, Langwen; Chrapek, Marcin; Schneider, Timo; Dayal, Jai; Gajbe, Manisha; Wisniewski, Robert; +1 Authors

doi: 10.1109/sc41406.2024.00070 , 10.48550/arxiv.2404.14193

arXiv: 2404.14193

handle: 20.500.11850/714252

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

The shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. As large-scale MPI applications often exhibit significant differences in their network latency tolerance, it is crucial to accurately determine the extent of network latency an application can withstand without significant performance degradation. Current approaches to assessing this metric often rely on specialized hardware or network simulators, which can be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain that offers an efficient, analytical approach to evaluating HPC applications' network latency tolerance using the LogGPS model and linear programming. LLAMP equips software developers and network architects with essential insights for optimizing HPC infrastructures and strategically deploying applications to minimize latency impacts. Through our validation on a variety of MPI applications like MILC, LULESH, and LAMMPS, we demonstrate our tool's high accuracy, with relative prediction errors generally below 2%. Additionally, we include a case study of the ICON weather and climate model to illustrate LLAMP's broad applicability in evaluating collective algorithms and network topologies.

19 pages

Related Organizations

Department of Computer Science
Spain
Department of Computer Sciences
Austria
Hewlett Packard Enterprise (United States)
United States
Samsung (South Korea)
Korea (Republic of)
ETH Zurich
Switzerland

Keywords

Computer Science - Networking and Internet Architecture, Networking and Internet Architecture (cs.NI), Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, C.4, Distributed, Parallel, and Cluster Computing (cs.DC), Network latency tolerance; linear programming; MPI applications; high-performance computing

4 Research products, page 1 of 1

cloverleaf software on GitHub
IsRelatedTo
lammps software on GitHub
IsRelatedTo
lammps software on GitHub
IsRelatedTo
LULESH software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average