Slurm - Power Management

Enegery Efficent Cluster Computing

HPC
Published

June 21, 2022

Modified

February 8, 2024

Effectively manage the associated power of continuously growing HPC resources…

Ultimate goal…maximum performance/throughput within a given energy budget

Most energy consumption in an HPC cluster…

Capabilities of the SLURM workload management system related to power management…

Energy Monitoring

Collect energy consumption data for…

  • Job/step accounting – Running and total energy consumption by a job or step
  • Job/step profiling – Profile of power use by a job/step over time, per node
  • Hardware monitoring – Instantaneous power and cumulative energy consumption per node

Collect resource usage data for accounting, profiling and monitoring…

  • …energy consumption data generated in-band from hardware sensors
  • …loaded by slurmd on each compute node
  • …called by jobacct_gather plugin to collect accounting data (jobs/steps)
  • …called via RPC from the slurmctld to collect energy consumption data for nodes
  • …calls acct_gather_profile plugin to provide energy data samples for profiling

acct_gather_profile data reporting…

  • …for running jobs, energy accounting data is reported by sstat
  • If accounting database is configured…
    • …energy accounting data is included in accounting records
    • …reported by sacct and sreport
  • Energy consumption for nodes reported by scontrol show node
  • Metrics…
    • Cumulative/total energy consumption is reported in units of joules
    • Instantaneous rate of energy consumption (power) reported in watt

AcctGatherNodeFreq configures global sampling interval for node accounting

Plugin to be loaded must be specified in slurm.conf:

# Frequency of node energy sampling
AcctGatherNodeFreq=<seconds> # default 0 ...disables node energy sampling
# ...enable plugins
AcctGatherEnergyType=acct_gather_energy/ipmi
AcctGatherFilesystemType=acct_gather_filesystem/lustre
AcctGatherProfileType=acct_gather_profile/influxdb
AcctGatherInterconnectType=acct_gather_interconnect/ofed

Plugins read options from a dedicated file acct_gather.conf

acct_gather_energy/ipmi

In-band hardware sensors on compute nodes provided by the BMC…

  • …energy consumption data is read via IPMI interface
  • …requires FreeIPMI version 1.2.1 or later

Limitations…IPMI energy data includes all node energy consumption

  • …reliable only for jobs/steps using unshared whole node allocation
  • …basically jobs with option --exclusive

Options used for AcctGatherEnergyType=acct_gather_energy/ipmi in acct_gather.conf

# (dedicated) IPMI user...
EnergyIPMIUsername=USERNAME
EnergyIPMIPassword=PASSWORD
# number of seconds between BMC access samples
EnergyIPMIFrequency=<number>
# specify the ids of the sensors to used
EnergyIPMIPowerSensors=<key=values>

Specifying sensors with EnergyIPMIPowerSensors=

  • …multiple <key=values> separated ;

acct_gather_energy/rapl

Energy consumption data is collected from hardware sensors RAPL interface…

  • …requires CPUs with support for RAPL
  • …Linux msr module must be loaded (sudo modprobe msr)
  • …doesn’t read any options from acct_gather.conf

Limitations …RAPL energy data includes CPU, DRAM and cache energy

  • …poor precision of energy accounting measurements for short jobs
  • …depends on sampling rate JobAcctGatherFrequency and EnergyIPMIFrequency

Example…

# configuration
>>> scontrol show config | grep ^AcctGather
AcctGatherEnergyType = acct_gather_energy/rapl
AcctGatherNodeFreq = 30 sec

# reporting of energy consumption on a node
>>> scontrol show node $node
...
CurrentWatts=121 LowestJoules=69447 ConsumedJoules=8726863
...

ext_sensors Plugins

Collects energy and temperature data generated out-of-band sensors (like wattmeters)…

  • …loaded by slurmctld on management node
  • …independently of the acct_gather plugins
  • …does not support power profiling or energy reporting for running jobs/steps sstat

Configuration in slurm.conf

# enable plugins...
ExtSensorsType=ext_sensors/rrd
# Frequency of node energy sampling controlled by:
ExtSensorsFreq=<seconds>   # Default 0...disables node energy sampling

Plugins configured in dedicated ext_sensors.conf file

Energy Saving

Mechanism to save energy…

  • power save mode
    • …reduce power consumption by hibernation and dynamic voltage/frequency scaling
    • …power-off idle resources (nodes, individual devices like CPUs/GPUs)
  • …power management…power caps
    • …limit total power consumption across all nodes
    • …limit available power in advance to adapt to dynamic energy prices
    • …throttles resources/performance available to users for a given energy budget

Power Save Mode

https://slurm.schedmd.com/power_save.html

Mechanism to throttle CPUs and/or power down idle nodes

  • …nodes idle for a period of time placed in a power saving mode
  • …restored to normal operation once work is assigned
  • Power saving modes…
    • …use a cpufreq governor to limit CPU frequency and voltage
    • …power down and resume nodes

ResumeProgram= removes nodes from power saving mode…

  • …requires either wake-on-LAN or power-on over the IPMI interface
  • ResumeRate= limits the number of nodes resumed in parallel…
    • …prevent an instantaneous surge in power demand
    • …boot nodes with increase power demands in a gradual fashion

Power Management

https://slurm.schedmd.com/power_mgmt.html

Monitors actual power consumption…

  • …configured a power cap for the system
  • …dynamically re-allocates power available per node…
    • …based upon actual real-time usage
    • …evenly distributing power cap across all nodes
  • …optimizes throughput within power cap
    • …responds quickly to changes in application power consumption
    • …nodes using most of their power cap have the cap increased
    • …nodes with newly initiated jobs have power cap reset

References