Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object
Data sources: ZENODO
addClaim

VGAC: Predictive Queue Intelligence for GPU Cluster Observability

Authors: Andrew, espira;

VGAC: Predictive Queue Intelligence for GPU Cluster Observability

Abstract

This talk introduces VGAC — Visualize, Gate, Advise, Calibrate — a calibration-first approach to GPU-cluster queue intelligence. Submit-time queue-risk predictions are useful only if their probabilities are reliable: a "70% risk" should occur about 70% of the time. The talk presents empirical evidence from an Amazon EKS cluster (582 jobs) and an AWS ParallelCluster Slurm deployment showing that (i) Expected Calibration Error (ECE), not AUROC, is the deployment-relevant metric; (ii) a small queue-depth-dominant feature set is sufficient for strong discrimination; (iii) when the cluster scheduler changes (EKS → Slurm), discrimination mostly transfers but calibration does not, motivating per-cluster recalibration as a first-class operational concern. The talk closes with the integration path: validating-admission webhooks on Kubernetes, sacctmgr-based advisories on Slurm, and a roadmap toward conformal prediction intervals. Full paper, code, sample data, trained calibrators, and a one-command reproducibility notebook are released as a companion artifact at Reliability-First-Queue-Risk.

Powered by OpenAIRE graph
Found an issue? Give us feedback