Linux Power Management

Linux
Published

May 23, 2022

Modified

August 11, 2022

Overview

Power management…

  • …reducing operational costs for energy
    • …lower heat emission (decrease load on cooling facilities)
    • …increase longevity of devices (panels, drives, etc.)
    • …increase battery life (on mobile devices)
  • Balance performance requirements and power saving configuration

Monitoring Levels…

  • Facility level…
    • …PDUs (Power Distribution Units)
    • …individual power sockets
    • …no per node measurements
  • System-level…
    • …typically vendor specific sensor readings from motherboards
    • …no per application measurements
    • …no cause suggestion (CPU, RAM, I/O…)
  • Application-level…
    • …supported by the operating System
    • …details on CPU, RAM, I/O resources
    • …no per function measurements
  • Function-level…
    • …cost of computations
    • …cost of memory access
    • …cost of I/O operations

Hardware Support

Hardware may support power regulation…

  • Display…blanking, dimming, power-save mode
  • Graphics device…power down
  • Storage device…spin down, power down
  • Network devices…wake on LAN
  • USB…wake on access
  • BIOS…
    • …ACPI platform behavior
    • …CPU power management
    • …chassis & CPU fan throttling
    • …thermal configuration limits

Energy Measurement

Informally, the terms energy and power are often used interchangeably…

  • …but they have distinct technical definitions
  • Energy
    • …quantity that represents the capacity to perform work
    • …standard (SI) unit of energy is the joule
  • Power
    • …rate at which energy is consumed (transferred or converted)
    • …standard unit of power is the watt
    • …1 watt = 1 joule/second
  • …generally advised…
    • …Joule for small and time limited measurements (energy usage of a function)
    • …Watt is better suited for longer or indefinitely running programs/systems
  • Electrical energy…
    • …often expressed in units of kilowatt-hours (kWh)
    • …1 kWh = 1000 watts for 3600 seconds = 3.6 megajoules

Three measurement techniques:

  • Instant power measurement…
    • …require some sort of physical instrumentation
    • Attached ampere meter or an internal measurement of a component…
    • …accuracy depends on the sampling frequency
    • Implemented by the hardware vendor of the corresponding component
  • Time measurement…
    • …based on the drainage of the energy stored in a battery
    • …suitable to cross-validate findings from the other two techniques
  • Model estimation…
    • …no dependency on physical instrumentation
    • …possible to isolate the energy consumption of a single application
    • …require one of the other two techniques for calibration

Power Consumption

…in Watts (unit of power)…

  • …varies widely over time depending on workload
  • …changes when a process transitions from being idle to running…
  • …in other words when CPU, RAM, I/O resources are allocated
  • …reconfiguration of hardware can change power consumption

Power Supply Unit

PSU (Power Supply Unit)

  • …sources power from the primary source (wall outlet)
  • …converts AC (Alternating Current) power to the DC (Direct Current) power
  • …delivers power to the motherboard (all connected components)
  • …amount of power the components need varies from 3.3V to 12V
  • Two types…
    • Linear power supplies…built-in transformer that steps down the voltage
    • Switch-mode power supply…uses switches for voltage regulation

Power supply efficiency rating…

  • …AC/DC conversion…power is wasted and converted to heat
  • …power supply must reach 80% efficiency to be certified

80 Plus certification levels…

                        Efficiency
Certification Levels    at 10% Load     at 20% Load     at 50% Load     at 100% Load
80 PLUS                 —               80%             80%             80%
80 PLUS Bronze          —               82%             85%             82%
80 PLUS Silver          —               85%             88%             85%
80 PLUS Gold            —               87%             90%             87%
80 PLUS Platinum        —               90%             92%             89%
80 PLUS Titanium        90%             92%             94%             90%

Power supply labels…

  • AC input…
    • …amount of power the PSU can convert into Direct Current
    • …for example 120-220V + 10A-5A + 50Hz-60Hz
  • DC output…
    • …range of voltage supplied to the motherboard (all connected hardware)
    • …in example +3.3V & +5V/22A/130W, +12V/50A/408W
    • …maximum combined wattage i.e. 500W

dmidecode prints information on the power supply units…

>>> dmidecode -t 39
Handle 0x004D, DMI type 39, 22 bytes
System Power Supply
        Power Unit Group: 1
        Location: Upper Slot
        ...
        Max Power Capacity: 2000 W
        Status: Not Present
        Type: Switching
        Input Voltage Range Switching: Auto-switch
        Plugged: Yes
        Hot Replaceable: No
        ...
Handle 0x004E, DMI type 39, 22 bytes
System Power Supply
        Power Unit Group: 2
        Location: Lower Slot
        ...

powerstat

…tool to measure power consumption

  • https://github.com/ColinIanKing/powerstat
  • Measures from…
    • …battery power source
    • …RAPL (Running Average Power Limit) interface
  • Calculates…
    • …average,
    • …standard deviation
    • …minimum & maximum
    • …geometic mean

powerjoular

Battery Power

Capacity…

  • … less than 75% is usually a sign that you should renew your battery
  • Wh (Watt hour)…used to measure its capacity
    • Watts x Hours = Wh…measurement for power over time (an hour)
    • 1250 Wh battery…maximum 100 watts for 12.50 hours
    • 60 W light bulb stays on for two hours…use 120Wh of energy

sysfs

sysfs values for battery current and voltage…

# current battery values... may include power_now
cat /sys/class/power_supply/BAT*/*_now
# ...otherwise multiply the values in the files current_now 
# and voltage_now from that directory to get power_now
echo - | awk "{printf \"%.1f\", \
$(( \
  $(cat /sys/class/power_supply/BAT0/current_now) * \
  $(cat /sys/class/power_supply/BAT0/voltage_now) \
)) / 1000000000000 }" ; echo " W "

Battery charge in percent…

cat /sys/class/power_supply/BAT0/capacity

upower

…display the battery status

>>> battery=$(upower -e | grep 'BAT') && echo $battery 
/org/freedesktop/UPower/devices/battery_BAT0
>>> upower -i $battery | grep -e state -e percent -e capacity
    state:               charging   # or fully-charged
    percentage:          63%        # amount of energy left in the power source
    capacity:            67,6017%   # capacity of the battery will reduce with age
>>> upower -i $battery | grep energy                         
    energy:              31,616 Wh  # energy currently available in the power source
    energy-empty:        0 Wh
    energy-full:         45,9724 Wh # energy in the power source when it's considered full
    energy-full-design:  68,0048 Wh # energy the power source is designed to hold
    energy-rate:         23,256 W   # energy being drained from the source

RAPL Power Meter

RAPL (Running Average Power Limit)…

  • …based on on-chip power sensors
  • …interface for exposing power meters and power limits
  • …exposed through MSRs and the PCI Express config space
  • …used by turbostat and powertop

Since Intel Sandy Bridge generation (released in 2011)…

  • …developed in combination with Dynamic Voltage and Frequency Scaling
  • Measurements can encompass four domains…
    • …package (total power consumption)
    • …(CPU) core
    • …uncore (neither Core or DRAM, i.e. integrated GPU)
    • …DRAM
  • Power domain…
    • …exposed with a MSR (Machine Specific Register)
    • Use energy units (correspond to a processor depended energy value in Joule)…
    • …updated approximately every millisecond

Linux Power Capping Framework

IPMI/BMC Power Sensors

IPMI (Intelligent Platform Management Interface)…packages {ipmitool,freeipmi}*.rpm

  • …only aggregate and approximate data available
  • …built-in component power sensor vendor specific

ipmi-dcmi & ipmi-oem

ipmi-oem…used to execute OEM specific IPMI commands…

  • …no guarantees that commands work on any particular motherboard
  • …consult vendor documentation for details
  • ipmi-oem --list to print supported OEM IDs and commands
>>> ipmi-oem intelnm get-node-manager-statistics mode=globalpower
Current Power                                 : 277 Watts
Minimum Power                                 : 40 Watts
Maximum Power                                 : 665 Watts
Average Power                                 : 403 Watts
...
>>> ipmi-oem dell get-power-consumption-data
...
Cumulative Energy            : 199.93
...
Peak Amp                     : 0.90 A
...
Peak Watt                    : 197 W

DCMI (Data Center Manageability Interface)…

  • …read power statistics…get/set power limits
    • …cannot directly access most vendor specific sensors
    • …provides low-resolution data from supported motherboards
  • …builds on top of IPMI 2.0
    • …introduces power monitoring sensor requirement
    • …offer power sampling rates on the order of seconds in the best case
# check if power management is supported...
>>> ipmi-dcmi --get-dcmi-capability-info | grep -i power
Power Management / Monitoring Support              : Available
Power Management Device Slave Address              : 10h
Power Management Controller Device Revision        : 1
Power Management Controller Channel Number         : 0
# ...system power statistics
>>> ipmi-dcmi --get-system-power-statistics | grep -i power
Current Power                        : 36 Watts
Minimum Power over sampling duration : 36 watts
Maximum Power over sampling duration : 270 watts
Average Power over sampling duration : 39 watts
Power Measurement                    : Active
# ...statistics for different time periods
>>> ipmi-dcmi --get-enhanced-system-power-statistics
# alternative
>>> ipmitool dcmi power reading
Instantaneous power reading:                    36 Watts
Minimum during sampling period:                 36 Watts
Maximum during sampling period:                270 Watts
Average power reading over sample period:       39 Watts
...

Prometheus IPMI-Exporter

IPMI Exporter, GitHub

https://github.com/prometheus-community/ipmi_exporter

Uses FreeIPMI ipmi-dcmi --get-system-power-statistics

>>> curl -s localhost:9290/metrics | grep dcmi
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts.
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 54
ipmi_up{collector="dcmi"} 1

Power Management Subsystem

Provides a unified sysfs interface located in the /sys/power/ directory

ACPI Power States

Linux supports two power management implementations…

  • AMP - Advanced Power Management (deprecated)
  • ACPI - Advanced Configuration and Power Interface
    • …CPU & device power Management
    • …thermal management (fans)
    • …button & lid events
    • …power sources (AC/battery)

Global system states (Gx states)…

Gx State Software runs Latency Power consumption OS restart Safe disassemble
G0 Working Yes 0 Large No No
G1 Sleeping No >0 Smaller No No
G2 Soft Off No Long Very near 0 Yes No
G3 Power Off No Long RTC battery Yes Yes

S-States & D-States

S-States (System States)…

  • …platform-wide power state transitions
  • S1
    • …CPU stops executing, cache is flushed
    • …power to CPU and RAM is maintained
  • S2…deeper sleep…CPU is powered off
  • S3…standby…suspend-to-RAM
  • S4…soft power off…power for wake-up event maintained

D-States (Device States)…

Sleep States

Sleep…low-power states of the entire system…

  • …user space code cannot be executed
  • …system activity is significantly reduced
  • /sys/power/state list sleep states…
    • freeze for suspend-to-idle
    • standby
    • mem suspend-to-RAM
    • disk for hibernation

Four system sleep states…

  • Suspend-to-Idle…
    • …light-weight variant of system suspend
    • More energy saved relative to runtime idle…
    • …freezing user space
    • …suspending timekeeping
    • …all I/O devices in low-power states
    • .Woken up by in-band interrupts (including devices)
  • Standby…
    • …offers moderate energy savings…
    • …no operating state is lost
    • Like Suspend-to-Idle plus…
    • …nonboot CPUs are taken offline
    • …all low-level system functions are suspended
    • Rely on platform for wakeup functionality
  • Suspend-to-RAM
    • …offers significant energy savings…
    • …everything put into a low-power state except memory
    • State of devices and CPUs is saved and held in memory
    • Peripheral buses may lose power (depends on configuration)
  • Hibernation…
    • …offers the greatest energy savings
    • Stops all system activity and creates a snapshot image of memory…
    • …image written into persistent storage
    • Power is cut from almost all of its hardware components…
    • …including memory…
    • …except for a limited set of wakeup devices

C-States (Idle)

Processor idle state…

  • …execution of a program is suspended
  • …part of the processor hardware not used
  • …allows power drawn by the processor to be reduced
  • Special “idle” task…
    • …runnable if there are no other runnable tasks assigned to a CPU
    • …cause the processor to be put into one of its idle states
  • …idle task executes the idle loop
    • Called CPUIdle governor to select an idle state…
    • …run driver to call hardware into idle state

CPU deactivate or use lower performance settings (known as C-states)

  • Reflect the capability of an idle CPU to turn off unused components
  • …downside is that they introduce latency (more time to go back to C0)
  • …disable deepest sleep states to increase overall performance
  • Deeper sleep states can save large amounts of energy…
  • Kernel command line argument processor.max_cstate=0 disables sleep

Table of CPU C-States…(incomplete)

C0     Operation        CPU fully turned one
C1     Halt             Stops main CPU internal clock with software...
C1E    Enhanced Halt    ...reduce CPU voltage
C2     Stop Clock       Stops main CPU internal and external clock via hardware
C3     Deep Sleep       Stops all CPU internal and external clocks...
C4     Deeper Sleep     ...reduce CPU voltage
C6     Deep Power Down  Reduce CPU internal voltage (including to 0V)
C7     Deep Enery Save  Flush L3 cache and cut power if able

Driver

Currently loaded kernel driver for CPUIdle and its governor…

>>> cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
>>> cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
  • acpi_idle
    • …retrieves available sleep states (C-states) from the ACPI BIOS tables
  • intel_idle
    • …serves recent Intel CPUs (Nehalem, Westmere, Sandybridge, Atoms or newer)
    • Knows the sleep state capabilities of the processor and ignores ACPI BIOS
    • Presents the kernel with the duration of the target residency and exit latency…
    • …used by CPU idle menu governor to predict how long the CPU will be idle

Latency

>>> cpupower idle-info
...
Number of idle states: 5
Available idle states: POLL C1 C1E C3 C6
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 4712
Duration: 81820
...
C6:
Flags/Description: MWAIT 0x20
Latency: 133
Usage: 51679556
Duration: 779100553418
  • POLL…not a real idle state…
    • …used if the kernel knows that work has to be processed very soon
    • …entering real hardware idle state may result in a performance penalty

Relative use if specific C-States…

>>> cpupower monitor -m Idle_Stats                                                                                                                                   
              | Idle_Stats
 PKG|CORE| CPU| POLL | C1   | C1E  | C3   | C6
   0|   0|   0|  0.00|  0.00|  0.00|  0.00| 96.72
   0|   0|  28|  0.00|  0.11|  0.19|  0.26| 97.58
...

CPU Frequency Scaling

Modern processors…

  • …capable of operating in a number of different clock frequency and voltage
  • Referred to as Operating Performance Points or P-states (in ACPI terminology)
  • Higher clock frequency and higher voltage…
    • …more instructions executed over a unit of time
    • …more energy is consumed over a unit of time
  • Trade-off between the CPU capacity vs power drawn by the CPU
  • Hardware interfaces allow CPUs to be switched between different frequency/voltage
  • …used along with algorithms to estimate the required CPU capacity

CPUFreq (CPU Frequency scaling)…Linux kernel sub-system…

  • Linux Kernel Guide - CPU Performance Scaling
  • …supports CPU performance scaling…
    • Scaling governors…estimate the required CPU capacity
    • Scaling drivers…interface with the hardware
  • Scaling algorithms for P-state selection in a platform-independent form
    • …in the majority of cases unless…
    • …algorithms based on information provided by the hardware itself
  • sysfs interface located in /sys/devices/system/cpu/

Scaling governors

…algorithms to compute the desired CPU frequency

Linux CPUFreq Goveners…https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

  • performance…maximum CPU frequency…no power saving benefit
    • Highest possible clock frequency…statically sets scaling_max_freq
    • Suitable for hours of a heavy workload…CPU is rarely or never idle
  • powersave…minimum CPU frequency…maximum power savings
    • Lowest possible clock frequency…statically sets scaling_min_freq
    • More of a speed limiter for the CPU than a power saver
    • Useful in systems and environments where overheating can be a problem
  • ondemand…CPU frequency dynamically according to current load
    • Maximum clock frequency on load…minimum clock frequency on idle
    • At the expense of latency between frequency switching
    • Compromise between heat emission, power consumption, performance, and manageability
  • conservative…similar to ondemand
    • …switches between frequencies more gradually
    • Adjusts to a clock frequency that it considers best for the load
    • Even greater latency than the ondemand governor
  • userspace…application specified CPU frequencies (requires root)
  • schedutil…scheduler-driven CPU frequency
# list available governors... 
cpupower frequency-info --governors
# set the governor temporarily
cpupower frequency-set --governor powersave

Scaling drivers

…implement the CPU-specific details of setting frequencies

# list modules supported by the kernel
ls /usr/lib/modules/$(uname -r)/kernel/drivers/cpufreq/

Commonly used modules…

  • acpi_cpufreq utilizes the ACPI Processor Performance States
  • intel_pstate…Intel CPUs Sandy Bridge and newer CPUs
  • amd_pstate…AMD Ryzen (some Zen 2 and newer) processors

Run-time configuration…/sys/devices/system/cpu/cpu*/cpufreq/scaling*

# CPU moduel
>>> grep 'model name' /proc/cpuinfo | uniq
model name      : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
# driver used bu CPUfreq
>>> cpupower frequency-info --driver
analyzing CPU 0:
  driver: intel_cpufreq

P-States & T-State

T-States (Throttling the CPU through ACPI)…

  • …legacy used before frequency scaling and ACPI “C” states were available
  • Does not decrease clock frequency…
  • …may interfere with the CPU reaching the higher C states

P-States (Processor Performance States)…CPU frequency…

  • …operational states that relate to CPU frequency and voltage
  • Higher P-state…lower frequency and voltage (power consumption)
    • P0 - Always the highest-performance state
    • P1 to Pn incrementally reduces processor speeds

CPU Frequency

turbostat

turbostat (package kernel-tools*.rpm) columns concerning CPU frequency…

>>> turbostat -qn1 -s Package,Core,CPU,Avg_MHz,Busy%,Bzy_MHz,TSC_MHz                                                                                                 
Package Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz
-       -       -       7       0.44    1691    2400
0       0       0       16      1.10    1482    2400
0       0       28      50      3.74    1324    2400
0       1       1       14      0.92    1555    2400
0       1       29      39      2.10    1866    2400
...
1       0       14      1       0.10    1422    2400
1       0       42      2       0.15    1498    2400
...
  • Avg_MHz…average frequency, based on APERF MSR registers
  • Busy%…CPU usage in percent
  • Bzy_MHz…busy frequency, based on MPERF registers
  • TSC_MHz… fixed frequency, TSC (Time Stamp Counter)

References…

powertop

PowerTOP Project…

https://github.com/fenrus75/powertop

  • …estimate of the total power usage of the system
  • …show individual power usage for each…
    • …process
    • …device
    • …kernel task, timer, interrupt
# calibrate the power estimation engine
sudo powertop --calibrate
# ...running on battery power if working with mobile device

Use Tab and Shift+Tab to cycle through tabs…

  • Idle stats…C-states for all processors and cores
  • Frequency stats… P-states including the Turbo mode
  • Device Stats…
  • Tunables…suggestions for optimizing the system for lower power consumption

Power-Profile Services

tuned

…pronounced “tune-D”

Configuration

tuned.service optimizes the performance profile of a node…

# typically installed by default
sudo dnf install -y tuned tuned-utils
sudo systemctl enable --now tuned

Relevant directories…

/etc/tuned/tuned-main.conf         # global configuration
man 5 tuned-main.conf
/etc/tuned/**/tuned.conf           # custom profiles 
/usr/lib/tuned                     # distribution-specific profiles
/var/log/tuned                     # log files

Profiles

Manage profiles with the tuned-adm commands:

  • Profiles in two broad categories…
    • power saving
    • performance-boosting (high-throughput, low latency)
  • Predefined profiles…packages tuned-profiles*
tuned-adm active     # current active profile
          list       # available profiles
          recommend  # most suitable profile
          verify     # is the active provfile applied?

# activate a combination of multiple profiles...
tuned-adm profile virtual-guest powersave # ...for example on a virtual machine
reboot

Profile customisation… (do not modify /usr/lib/tuned)

# modify an existing provfile...
cp -r /usr/lib/tuned/powersave /etc/tuned
vim /etc/tuned/powersave/tuned.conf

Create a new profile…

# site specific name recommended...
>>> mkdir /etc/tuned/site-powersave
>>> cat /etc/tuned/site-powersave/tuned.conf
[main]
include=powersave
# customize by overrides
...

powertop integration…

dnf install -y tuned-utils powertop
powertop2tuned site-profile
# enable what you need by uncommenting lines
vim /etc/tuned/site-profile/tuned.conf
tuned-adm profile site-profile

Plugins

Get a list of available plugins

rpm -ql tuned | grep 'plugins/plugin_.*.py$'

Two types of plugins…

  • monitoring plugins can be used by tuning plugins for dynamic tuning
    • automatically instantiated whenever their metrics are needed
    • disk disk load (number of IO operations) per device
    • net network load (number of transferred packets) per network card
    • load CPU load per CPU
  • tuning plugin tune an individual subsystem
    • …can have multiple devices (wildcards to match all devices)
    • cpu sets the CPU governor
    • net…wake-on-lan to the values…speed according to the interface utilization
    • sysctl various settings specified by the plugin parameters
    • usb autosuspend timeout of USB devices
    • vm…transparent huge pages
    • audio autosuspend timeout for audio codecs
    • disk…ALPM, ASPM…disk spindown timeout
    • mounts…barriers for mounts
    • script…execution of an external script
    • sysfs various settings specified by the plugin parameters
    • video powersave levels on video cards
    • bootloader…kernel boot command line… Grub configuration

tuned.conf contains sections to configure plugin instances…

# name of the plugin instance
[PLUGIN]                   
# type of the tuning plugin
type=TYPE
# list of devices...can contain a list, wildcard (*), negation (!)
devices=DEVICES

cpu Plugin

Modern CPUs capable of operating at different clock frequency and voltage configurations

  • …higher clock frequency requires a higher voltage (and thermal heat output)
  • …trade-off between the CPU capacity and power consumption
  • Referred to as Operating Performance Points or P-states (in ACPI terminology)
  • Linux kernel offers CPU performance scaling via the CPUFreq subsystem

The cpu TuneD plugin options…

  • governor sets the CPU scaling governor
    • Multiple governors are separated using | (represents or)…
    • …set the first governor that is available on the system
  • sampling_down_factor sampling rate…
    • …determines how frequently the governor checks to tune the CPU
    • Recommended setting for jitter reduction, values 1 to 100000
  • energy_perf_bias…Energy Performance Bias (EPB)
    • …energy vs. performance policy via x86 Model Specific Registers
  • Read the plugin help text for more information…

Configuration in default tuned profiles…

>>> grep -e ^governor  -R /usr/lib/tuned**/*
/usr/lib/tuned/accelerator-performance/tuned.conf:governor=performance
/usr/lib/tuned/balanced/tuned.conf:governor=conservative|powersave
/usr/lib/tuned/latency-performance/tuned.conf:governor=performance
/usr/lib/tuned/powersave/tuned.conf:governor=ondemand|powersave
/usr/lib/tuned/throughput-performance/tuned.conf:governor=performance