
GPU Behavior Genome: Stable, Change-Sensitive Embeddings for Fleet-Level GPU Telemetry in NASA HPC introduces GBG, a self-supervised representation learning system that produces a per-GPU fingerprint—a compact embedding that remains stable under normal workload drift yet reacts quickly to meaningful configuration, firmware, or cooling changes. Unlike current DCGM dashboards and rule-based monitoring, GBG provides a semantic identity for each GPU across workloads and maintenance cycles. It enables: Early warning of degradation and misconfigurations with few-shot checks Fleet-scale forensics, answering “which nodes looked like those that later failed?” Cross-generation transfer across GPU families with safe onboarding for new architectures Adaptive verification via safety-aware contextual bandits that balance certainty with operational budgets Explainability through Integrated Gradients and TimeSHAP evidence packs for operator trust Benchmarked against strong baselines (DCGM+rules, SR-CNN, Matrix Profile, LSTM-AE, Isolation Forest), GBG achieves high stability, accurate detection of staged changes, and efficient fleet-level operation with bounded overhead. Designed for NASA HPC clusters but generalizable to large-scale GPU fleets, GBG reframes monitoring from “threshold and react” to “fingerprint and verify.” This work provides reproducibility artifacts, evaluation protocols, and deployment guidance, establishing a blueprint for embedding-centric GPU observability in mission operations and beyond.
Self-supervised learning, NASA HPC, Fleet-scale observability, GPU telemetry, Contextual bandits, TimeSHAP / Integrated Gradients, Predictive maintenance, Anomaly detection, Misconfiguration detection, Representation learning, DCGM monitoring, Cross-generation transfer, Change-point detection
Self-supervised learning, NASA HPC, Fleet-scale observability, GPU telemetry, Contextual bandits, TimeSHAP / Integrated Gradients, Predictive maintenance, Anomaly detection, Misconfiguration detection, Representation learning, DCGM monitoring, Cross-generation transfer, Change-point detection
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
