Slurm - Power Management
Enegery Efficent Cluster Computing
Effectively manage the associated power of continuously growing HPC resources…
- …make energy consumption quantifiable…thereby evaluate the efficiency of energy consumption
- …enable HPC infrastructure to plan and bill power consumption in advance
- …develop knowledge and tools to find the best tradeoff between energy and performance
- …limit accounts/users to a defined energy budget (alongside other resource limits)
- …report power consumption for user awareness and to increase the need for optimization
Ultimate goal…maximum performance/throughput within a given energy budget
Most energy consumption in an HPC cluster…
- …result of application computation allocating compute resources at scale
- …should be treated as a job characteristic (similar to CPU time)
Capabilities of the SLURM workload management system related to power management…
- …integration of capabilities to monitor energy consumption
- …required power per node (Watts)
- …estimate per job power consumption (Joules)
- …collection of corresponding energy metrics in the SLURM accounting database
- …report energy consumption per account and users [^55]
- …to identify power-efficient workloads patterns and application parameters
- …helps to implement application code with energy-efficiency in mind
- …allow users control of energy efficiency of execution [^60] [^62]
Energy Monitoring
Collect energy consumption data for…
- Job/step accounting – Running and total energy consumption by a job or step
- Job/step profiling – Profile of power use by a job/step over time, per node
- Hardware monitoring – Instantaneous power and cumulative energy consumption per node
Collect resource usage data for accounting, profiling and monitoring…
- …energy consumption data generated in-band from hardware sensors
- …loaded by
slurmd
on each compute node - …called by
jobacct_gather
plugin to collect accounting data (jobs/steps) - …called via RPC from the
slurmctld
to collect energy consumption data for nodes - …calls
acct_gather_profile
plugin to provide energy data samples for profiling
acct_gather_profile
data reporting…
- …for running jobs, energy accounting data is reported by
sstat
- If accounting database is configured…
- …energy accounting data is included in accounting records
- …reported by
sacct
andsreport
- Energy consumption for nodes reported by
scontrol show node
- Metrics…
- Cumulative/total energy consumption is reported in units of joules
- Instantaneous rate of energy consumption (power) reported in watt
AcctGatherNodeFreq
configures global sampling interval for node accounting
Plugin to be loaded must be specified in slurm.conf
:
# Frequency of node energy sampling
AcctGatherNodeFreq=<seconds> # default 0 ...disables node energy sampling
# ...enable plugins
AcctGatherEnergyType=acct_gather_energy/ipmi
AcctGatherFilesystemType=acct_gather_filesystem/lustre
AcctGatherProfileType=acct_gather_profile/influxdb
AcctGatherInterconnectType=acct_gather_interconnect/ofed
Plugins read options from a dedicated file acct_gather.conf
acct_gather_energy/ipmi
In-band hardware sensors on compute nodes provided by the BMC…
- …energy consumption data is read via IPMI interface
- …requires FreeIPMI version 1.2.1 or later
Limitations…IPMI energy data includes all node energy consumption
- …reliable only for jobs/steps using unshared whole node allocation
- …basically jobs with option
--exclusive
Options used for AcctGatherEnergyType=acct_gather_energy/ipmi
in acct_gather.conf
# (dedicated) IPMI user...
EnergyIPMIUsername=USERNAME
EnergyIPMIPassword=PASSWORD
# number of seconds between BMC access samples
EnergyIPMIFrequency=<number>
# specify the ids of the sensors to used
EnergyIPMIPowerSensors=<key=values>
Specifying sensors with EnergyIPMIPowerSensors=
…
- …multiple
<key=values>
separated;
acct_gather_energy/rapl
Energy consumption data is collected from hardware sensors RAPL interface…
- …requires CPUs with support for RAPL
- …Linux
msr
module must be loaded (sudo modprobe msr
) - …doesn’t read any options from
acct_gather.conf
Limitations …RAPL energy data includes CPU, DRAM and cache energy
- …poor precision of energy accounting measurements for short jobs
- …depends on sampling rate
JobAcctGatherFrequency
andEnergyIPMIFrequency
Example…
# configuration
>>> scontrol show config | grep ^AcctGather
AcctGatherEnergyType = acct_gather_energy/rapl
AcctGatherNodeFreq = 30 sec
# reporting of energy consumption on a node
>>> scontrol show node $node
...
CurrentWatts=121 LowestJoules=69447 ConsumedJoules=8726863
...
ext_sensors
Plugins
Collects energy and temperature data generated out-of-band sensors (like wattmeters)…
- …loaded by
slurmctld
on management node - …independently of the
acct_gather
plugins - …does not support power profiling or energy reporting for running jobs/steps
sstat
Configuration in slurm.conf
# enable plugins...
ExtSensorsType=ext_sensors/rrd
# Frequency of node energy sampling controlled by:
ExtSensorsFreq=<seconds> # Default 0...disables node energy sampling
Plugins configured in dedicated ext_sensors.conf
file
Energy Saving
Mechanism to save energy…
- …power save mode
- …reduce power consumption by hibernation and dynamic voltage/frequency scaling
- …power-off idle resources (nodes, individual devices like CPUs/GPUs)
- …power management…power caps
- …limit total power consumption across all nodes
- …limit available power in advance to adapt to dynamic energy prices
- …throttles resources/performance available to users for a given energy budget
Power Save Mode
https://slurm.schedmd.com/power_save.html
Mechanism to throttle CPUs and/or power down idle nodes
- …nodes idle for a period of time placed in a power saving mode
- …restored to normal operation once work is assigned
- Power saving modes…
- …use a
cpufreq
governor to limit CPU frequency and voltage - …power down and resume nodes
- …use a
ResumeProgram=
removes nodes from power saving mode…
- …requires either wake-on-LAN or power-on over the IPMI interface
- …
ResumeRate=
limits the number of nodes resumed in parallel…- …prevent an instantaneous surge in power demand
- …boot nodes with increase power demands in a gradual fashion
Power Management
https://slurm.schedmd.com/power_mgmt.html
Monitors actual power consumption…
- …configured a power cap for the system
- …dynamically re-allocates power available per node…
- …based upon actual real-time usage
- …evenly distributing power cap across all nodes
- …optimizes throughput within power cap
- …responds quickly to changes in application power consumption
- …nodes using most of their power cap have the cap increased
- …nodes with newly initiated jobs have power cap reset
References
- Power capping in SLURM, CEA (2013)
- EcoFreq: compute with cleaner energy via carbon-aware power scaling, IT/EE-Palaver, GSI
- Energy Accounting and Control with SLURM Resource and Job Management Systems
- Energy Aware Runtime (EAR) energy management framework for super computers
- Energy Efficiency in HPC, Bull (2016)
- Profiling Power Consumption of Jobs with SLURM (2020)
- Energy Efficiency Features of the Modern HPC Hardware and Energy Consumption Measurement
- A Survey of the Research on Power Management Techniques for High Performance Systems
- Power Saving with Slurm, SLUG23
- CATS: The Climate Aware Task Scheduler, FOSDEM’24