
The rapid growth of artificial intelligence (AI) has led to increased reliance on power-intensive Graphics Processing Units (GPUs), which are essential for training and deploying large-scale models. However, the escalating energy demands of AI workloads pose sustainability challenges, necessitating efficient power management strategies to reduce carbon footprints. Optimizing GPU server power consumption is complex due to the interdependence of various components. Conventional methods often involve trade-offs: increasing fan speed enhances cooling but raises overall power usage, whereas lowering GPU clock frequencies conserves energy at the cost of longer computation times. To address these challenges, we propose a data-driven optimization framework based on offline reinforcement learning (RL). Our approach collects operational data from a custom-designed workload that simulates varying server loads, capturing key metrics such as power consumption, temperature, and core frequency. The reward function balances power efficiency with performance. The reinforcement learning agent learns from pre-collected server logs, enabling intelligent real-time GPU clock control decisions without costly live experiments. Additionally, periodic fan speed adjustments and pre-training of the Q-network further enhance overall efficiency. Experimental results demonstrate that our method reduces power consumption by 3.62% while improving computation time by 1.51% for synthetic workloads. For LLaMA-2 fine-tuning, power consumption decreases by 6.40% with only a minor 1.27% increase in computation time, demonstrating its practical effectiveness. Our framework was validated on the latest NVIDIA L40S GPU, demonstrating its compatibility with cutting-edge hardware.
Data-driven optimization, dynamic GPU clock scaling, GPU server power management, Electrical engineering. Electronics. Nuclear engineering, offline reinforcement learning, TK1-9971
Data-driven optimization, dynamic GPU clock scaling, GPU server power management, Electrical engineering. Electronics. Nuclear engineering, offline reinforcement learning, TK1-9971
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
