
High-performance computing (HPC) is vital for resource-intensive scientific workflows like genome sequencing, weather predictions, and deep neural network (DNN) training. However, optimizing resource utilization during job execution requires selecting the right endpoint based on specific requirements. Different clusters, such as OSC or TACC, have diverse architectures and policies, making it challenging for users to fine-tune resource allocations. We introduce HARP (HPC Application Resource Predictor) as part of the ICICLE project (AI4CI), aiming to democratize AI and foster interdisciplinary collaboration. HARP profiles applications and recommends optimal resource allocation, reducing costs without compromising workflow execution. We propose extensions to enhance HARP's accuracy, including a new loss function biased towards overestimation, a metric prioritizing underpredictions, a cost function penalizing underestimation, and memory modeling leveraging the ZeRO paper's formula. Our framework simulates executions to build regression models, capturing consumption and allocations accurately. HARP is available for download on Linux-based systems, with the latest release enabling API-based storage and integration with TAPIS.
Resource Estimation, OPtimal Allocations, HPC, Walltime Prediction, AI4CI
Resource Estimation, OPtimal Allocations, HPC, Walltime Prediction, AI4CI
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
