Linux Power Management
Overview
Power management…
- …reducing operational costs for energy
- …lower heat emission (decrease load on cooling facilities)
- …increase longevity of devices (panels, drives, etc.)
- …increase battery life (on mobile devices)
- Balance performance requirements and power saving configuration
Monitoring Levels…
- Facility level…
- …PDUs (Power Distribution Units)
- …individual power sockets
- …no per node measurements
- System-level…
- …typically vendor specific sensor readings from motherboards
- …no per application measurements
- …no cause suggestion (CPU, RAM, I/O…)
- Application-level…
- …supported by the operating System
- …details on CPU, RAM, I/O resources
- …no per function measurements
- Function-level…
- …cost of computations
- …cost of memory access
- …cost of I/O operations
Hardware Support
Hardware may support power regulation…
- Display…blanking, dimming, power-save mode
- Graphics device…power down
- Storage device…spin down, power down
- Network devices…wake on LAN
- USB…wake on access
- BIOS…
- …ACPI platform behavior
- …CPU power management
- …chassis & CPU fan throttling
- …thermal configuration limits
Energy Measurement
Informally, the terms energy and power are often used interchangeably…
- …but they have distinct technical definitions
- Energy
- …quantity that represents the capacity to perform work
- …standard (SI) unit of energy is the joule
- Power
- …rate at which energy is consumed (transferred or converted)
- …standard unit of power is the watt
- …1 watt = 1 joule/second
- …generally advised…
- …Joule for small and time limited measurements (energy usage of a function)
- …Watt is better suited for longer or indefinitely running programs/systems
- Electrical energy…
- …often expressed in units of kilowatt-hours (kWh)
- …1 kWh = 1000 watts for 3600 seconds = 3.6 megajoules
Three measurement techniques:
- Instant power measurement…
- …require some sort of physical instrumentation
- Attached ampere meter or an internal measurement of a component…
- …accuracy depends on the sampling frequency
- Implemented by the hardware vendor of the corresponding component
- Time measurement…
- …based on the drainage of the energy stored in a battery
- …suitable to cross-validate findings from the other two techniques
- Model estimation…
- …no dependency on physical instrumentation
- …possible to isolate the energy consumption of a single application
- …require one of the other two techniques for calibration
Power Consumption
…in Watts (unit of power)…
- …varies widely over time depending on workload
- …changes when a process transitions from being idle to running…
- …in other words when CPU, RAM, I/O resources are allocated
- …reconfiguration of hardware can change power consumption
Power Supply Unit
PSU (Power Supply Unit)
- …sources power from the primary source (wall outlet)
- …converts AC (Alternating Current) power to the DC (Direct Current) power
- …delivers power to the motherboard (all connected components)
- …amount of power the components need varies from 3.3V to 12V
- Two types…
- Linear power supplies…built-in transformer that steps down the voltage
- Switch-mode power supply…uses switches for voltage regulation
Power supply efficiency rating…
- …AC/DC conversion…power is wasted and converted to heat
- …power supply must reach 80% efficiency to be certified
80 Plus certification levels…
Efficiency
Certification Levels at 10% Load at 20% Load at 50% Load at 100% Load
80 PLUS — 80% 80% 80%
80 PLUS Bronze — 82% 85% 82%
80 PLUS Silver — 85% 88% 85%
80 PLUS Gold — 87% 90% 87%
80 PLUS Platinum — 90% 92% 89% 80 PLUS Titanium 90% 92% 94% 90%
Power supply labels…
- AC input…
- …amount of power the PSU can convert into Direct Current
- …for example 120-220V + 10A-5A + 50Hz-60Hz
- DC output…
- …range of voltage supplied to the motherboard (all connected hardware)
- …in example +3.3V & +5V/22A/130W, +12V/50A/408W
- …maximum combined wattage i.e. 500W
dmidecode
prints information on the power supply units…
>>> dmidecode -t 39
Handle 0x004D, DMI type 39, 22 bytes
System Power Supply
Power Unit Group: 1
Location: Upper Slot
...
Max Power Capacity: 2000 W
Status: Not Present
Type: Switching
Input Voltage Range Switching: Auto-switch
Plugged: Yes
Hot Replaceable: No
...
Handle 0x004E, DMI type 39, 22 bytes
System Power Supply
Power Unit Group: 2
Location: Lower Slot
...
powerstat
…tool to measure power consumption
- https://github.com/ColinIanKing/powerstat
- Measures from…
- …battery power source
- …RAPL (Running Average Power Limit) interface
- Calculates…
- …average,
- …standard deviation
- …minimum & maximum
- …geometic mean
powerjoular
- Project page & source code…
Battery Power
Capacity…
- … less than 75% is usually a sign that you should renew your battery
- Wh (Watt hour)…used to measure its capacity
- Watts x Hours = Wh…measurement for power over time (an hour)
- 1250 Wh battery…maximum 100 watts for 12.50 hours
- 60 W light bulb stays on for two hours…use 120Wh of energy
sysfs
sysfs
values for battery current and voltage…
# current battery values... may include power_now
cat /sys/class/power_supply/BAT*/*_now
# ...otherwise multiply the values in the files current_now
# and voltage_now from that directory to get power_now
echo - | awk "{printf \"%.1f\", \
$(( \
$(cat /sys/class/power_supply/BAT0/current_now) * \
$(cat /sys/class/power_supply/BAT0/voltage_now) \
)) / 1000000000000 }" ; echo " W "
Battery charge in percent…
cat /sys/class/power_supply/BAT0/capacity
upower
…display the battery status
>>> battery=$(upower -e | grep 'BAT') && echo $battery
/org/freedesktop/UPower/devices/battery_BAT0
>>> upower -i $battery | grep -e state -e percent -e capacity
state: charging # or fully-charged
percentage: 63% # amount of energy left in the power source
capacity: 67,6017% # capacity of the battery will reduce with age
>>> upower -i $battery | grep energy
energy: 31,616 Wh # energy currently available in the power source
energy-empty: 0 Wh
energy-full: 45,9724 Wh # energy in the power source when it's considered full
energy-full-design: 68,0048 Wh # energy the power source is designed to hold
energy-rate: 23,256 W # energy being drained from the source
RAPL Power Meter
RAPL (Running Average Power Limit)…
- …based on on-chip power sensors
- …interface for exposing power meters and power limits
- …exposed through MSRs and the PCI Express config space
- …used by
turbostat
andpowertop
Since Intel Sandy Bridge generation (released in 2011)…
- …developed in combination with Dynamic Voltage and Frequency Scaling
- Measurements can encompass four domains…
- …package (total power consumption)
- …(CPU) core
- …uncore (neither Core or DRAM, i.e. integrated GPU)
- …DRAM
- Power domain…
- …exposed with a MSR (Machine Specific Register)
- Use energy units (correspond to a processor depended energy value in Joule)…
- …updated approximately every millisecond
Linux Power Capping Framework…
- https://www.kernel.org/doc/html/latest/power/powercap/powercap.html
sysfs
interface located in/sys/devices/virtual/powercap/intel-rapl
- Power zones
intel-rapl:{0,1,...}
represent CPU packages - Powercap, Sysfs C Bindings and Utilities
- PowerAPI, Software-Defined Power Meters
IPMI/BMC Power Sensors
IPMI (Intelligent Platform Management Interface)…packages {ipmitool,freeipmi}*.rpm
- …only aggregate and approximate data available
- …built-in component power sensor vendor specific
ipmi-dcmi
& ipmi-oem
ipmi-oem
…used to execute OEM specific IPMI commands…
- …no guarantees that commands work on any particular motherboard
- …consult vendor documentation for details
- …
ipmi-oem --list
to print supported OEM IDs and commands
>>> ipmi-oem intelnm get-node-manager-statistics mode=globalpower
Current Power : 277 Watts
Minimum Power : 40 Watts
Maximum Power : 665 Watts
Average Power : 403 Watts
...
>>> ipmi-oem dell get-power-consumption-data
...
Cumulative Energy : 199.93
...
Peak Amp : 0.90 A
...
Peak Watt : 197 W
DCMI (Data Center Manageability Interface)…
- …read power statistics…get/set power limits
- …cannot directly access most vendor specific sensors
- …provides low-resolution data from supported motherboards
- …builds on top of IPMI 2.0
- …introduces power monitoring sensor requirement
- …offer power sampling rates on the order of seconds in the best case
# check if power management is supported...
>>> ipmi-dcmi --get-dcmi-capability-info | grep -i power
Power Management / Monitoring Support : Available
Power Management Device Slave Address : 10h
Power Management Controller Device Revision : 1
Power Management Controller Channel Number : 0
# ...system power statistics
>>> ipmi-dcmi --get-system-power-statistics | grep -i power
Current Power : 36 Watts
Minimum Power over sampling duration : 36 watts
Maximum Power over sampling duration : 270 watts
Average Power over sampling duration : 39 watts
Power Measurement : Active
# ...statistics for different time periods
>>> ipmi-dcmi --get-enhanced-system-power-statistics
# alternative
>>> ipmitool dcmi power reading
Instantaneous power reading: 36 Watts
Minimum during sampling period: 36 Watts
Maximum during sampling period: 270 Watts
Average power reading over sample period: 39 Watts
...
Prometheus IPMI-Exporter
IPMI Exporter, GitHub
https://github.com/prometheus-community/ipmi_exporter
Uses FreeIPMI ipmi-dcmi --get-system-power-statistics
…
>>> curl -s localhost:9290/metrics | grep dcmi
# HELP ipmi_dcmi_power_consumption_watts Current power consumption in Watts.
# TYPE ipmi_dcmi_power_consumption_watts gauge
ipmi_dcmi_power_consumption_watts 54
ipmi_up{collector="dcmi"} 1
Power Management Subsystem
Provides a unified sysfs
interface located in the /sys/power/
directory
ACPI Power States
Linux supports two power management implementations…
- AMP - Advanced Power Management (deprecated)
- ACPI - Advanced Configuration and Power Interface
- …CPU & device power Management
- …thermal management (fans)
- …button & lid events
- …power sources (AC/battery)
Global system states (Gx states)…
Gx | State | Software runs | Latency | Power consumption | OS restart | Safe disassemble |
---|---|---|---|---|---|---|
G0 | Working | Yes | 0 | Large | No | No |
G1 | Sleeping | No | >0 | Smaller | No | No |
G2 | Soft Off | No | Long | Very near 0 | Yes | No |
G3 | Power Off | No | Long | RTC battery | Yes | Yes |
S-States & D-States
S-States (System States)…
- …platform-wide power state transitions
S1
…- …CPU stops executing, cache is flushed
- …power to CPU and RAM is maintained
S2
…deeper sleep…CPU is powered offS3
…standby…suspend-to-RAMS4
…soft power off…power for wake-up event maintained
D-States (Device States)…
- …device-specific power states transitions
D0
…fully onD1
…intermediate state defined by deviceD2
…intermediate state defined by deviceD3
…powered off
- Linux Kernel - PCI Power Management
Sleep States
Sleep…low-power states of the entire system…
- …user space code cannot be executed
- …system activity is significantly reduced
/sys/power/state
list sleep states…freeze
for suspend-to-idlestandby
mem
suspend-to-RAMdisk
for hibernation
Four system sleep states…
- Suspend-to-Idle…
- …light-weight variant of system suspend
- More energy saved relative to runtime idle…
- …freezing user space
- …suspending timekeeping
- …all I/O devices in low-power states
- .Woken up by in-band interrupts (including devices)
- Standby…
- …offers moderate energy savings…
- …no operating state is lost
- Like Suspend-to-Idle plus…
- …nonboot CPUs are taken offline
- …all low-level system functions are suspended
- Rely on platform for wakeup functionality
- Suspend-to-RAM
- …offers significant energy savings…
- …everything put into a low-power state except memory
- State of devices and CPUs is saved and held in memory
- Peripheral buses may lose power (depends on configuration)
- Hibernation…
- …offers the greatest energy savings
- Stops all system activity and creates a snapshot image of memory…
- …image written into persistent storage
- Power is cut from almost all of its hardware components…
- …including memory…
- …except for a limited set of wakeup devices
C-States (Idle)
Processor idle state…
- …execution of a program is suspended
- …part of the processor hardware not used
- …allows power drawn by the processor to be reduced
- Special “idle” task…
- …runnable if there are no other runnable tasks assigned to a CPU
- …cause the processor to be put into one of its idle states
- …idle task executes the idle loop
- Called CPUIdle governor to select an idle state…
- …run driver to call hardware into idle state
CPU deactivate or use lower performance settings (known as C-states)
- Reflect the capability of an idle CPU to turn off unused components
- …downside is that they introduce latency (more time to go back to C0)
- …disable deepest sleep states to increase overall performance
- Deeper sleep states can save large amounts of energy…
- Kernel command line argument
processor.max_cstate=0
disables sleep
Table of CPU C-States…(incomplete)
C0 Operation CPU fully turned one
C1 Halt Stops main CPU internal clock with software...
C1E Enhanced Halt ...reduce CPU voltage
C2 Stop Clock Stops main CPU internal and external clock via hardware
C3 Deep Sleep Stops all CPU internal and external clocks...
C4 Deeper Sleep ...reduce CPU voltage
C6 Deep Power Down Reduce CPU internal voltage (including to 0V)
C7 Deep Enery Save Flush L3 cache and cut power if able
Driver
Currently loaded kernel driver for CPUIdle and its governor…
>>> cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
>>> cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
acpi_idle
…- …retrieves available sleep states (C-states) from the ACPI BIOS tables
intel_idle
…- …serves recent Intel CPUs (Nehalem, Westmere, Sandybridge, Atoms or newer)
- Knows the sleep state capabilities of the processor and ignores ACPI BIOS
- Presents the kernel with the duration of the target residency and exit latency…
- …used by CPU idle
menu
governor to predict how long the CPU will be idle
Latency
>>> cpupower idle-info
...
Number of idle states: 5
Available idle states: POLL C1 C1E C3 C6
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 4712
Duration: 81820
...
C6:
Flags/Description: MWAIT 0x20
Latency: 133
Usage: 51679556
Duration: 779100553418
POLL
…not a real idle state…- …used if the kernel knows that work has to be processed very soon
- …entering real hardware idle state may result in a performance penalty
Relative use if specific C-States…
>>> cpupower monitor -m Idle_Stats
| Idle_Stats
PKG|CORE| CPU| POLL | C1 | C1E | C3 | C6
0| 0| 0| 0.00| 0.00| 0.00| 0.00| 96.72
0| 0| 28| 0.00| 0.11| 0.19| 0.26| 97.58
...
CPU Frequency Scaling
Modern processors…
- …capable of operating in a number of different clock frequency and voltage
- Referred to as Operating Performance Points or P-states (in ACPI terminology)
- Higher clock frequency and higher voltage…
- …more instructions executed over a unit of time
- …more energy is consumed over a unit of time
- Trade-off between the CPU capacity vs power drawn by the CPU
- Hardware interfaces allow CPUs to be switched between different frequency/voltage
- …used along with algorithms to estimate the required CPU capacity
CPUFreq (CPU Frequency scaling)…Linux kernel sub-system…
- Linux Kernel Guide - CPU Performance Scaling
- …supports CPU performance scaling…
- Scaling governors…estimate the required CPU capacity
- Scaling drivers…interface with the hardware
- Scaling algorithms for P-state selection in a platform-independent form
- …in the majority of cases unless…
- …algorithms based on information provided by the hardware itself
sysfs
interface located in/sys/devices/system/cpu/
Scaling governors
…algorithms to compute the desired CPU frequency
Linux CPUFreq Goveners…https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
performance
…maximum CPU frequency…no power saving benefit- Highest possible clock frequency…statically sets
scaling_max_freq
- Suitable for hours of a heavy workload…CPU is rarely or never idle
- Highest possible clock frequency…statically sets
powersave
…minimum CPU frequency…maximum power savings- Lowest possible clock frequency…statically sets
scaling_min_freq
- More of a speed limiter for the CPU than a power saver
- Useful in systems and environments where overheating can be a problem
- Lowest possible clock frequency…statically sets
ondemand
…CPU frequency dynamically according to current load- Maximum clock frequency on load…minimum clock frequency on idle
- At the expense of latency between frequency switching
- Compromise between heat emission, power consumption, performance, and manageability
conservative
…similar toondemand
- …switches between frequencies more gradually
- Adjusts to a clock frequency that it considers best for the load
- Even greater latency than the
ondemand
governor
userspace
…application specified CPU frequencies (requires root)schedutil
…scheduler-driven CPU frequency
# list available governors...
cpupower frequency-info --governors
# set the governor temporarily
cpupower frequency-set --governor powersave
Scaling drivers
…implement the CPU-specific details of setting frequencies
# list modules supported by the kernel
ls /usr/lib/modules/$(uname -r)/kernel/drivers/cpufreq/
Commonly used modules…
acpi_cpufreq
utilizes the ACPI Processor Performance Statesintel_pstate
…Intel CPUs Sandy Bridge and newer CPUsamd_pstate
…AMD Ryzen (some Zen 2 and newer) processors
Run-time configuration…/sys/devices/system/cpu/cpu*/cpufreq/scaling*
# CPU moduel
>>> grep 'model name' /proc/cpuinfo | uniq
model name : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
# driver used bu CPUfreq
>>> cpupower frequency-info --driver
analyzing CPU 0:
driver: intel_cpufreq
P-States & T-State
T-States (Throttling the CPU through ACPI)…
- …legacy used before frequency scaling and ACPI “C” states were available
- Does not decrease clock frequency…
- …may interfere with the CPU reaching the higher C states
P-States (Processor Performance States)…CPU frequency…
- …operational states that relate to CPU frequency and voltage
- Higher P-state…lower frequency and voltage (power consumption)
P0
- Always the highest-performance stateP1
toPn
incrementally reduces processor speeds
CPU Frequency
turbostat
turbostat
(package kernel-tools*.rpm
) columns concerning CPU frequency…
>>> turbostat -qn1 -s Package,Core,CPU,Avg_MHz,Busy%,Bzy_MHz,TSC_MHz
Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz
- - - 7 0.44 1691 2400
0 0 0 16 1.10 1482 2400
0 0 28 50 3.74 1324 2400
0 1 1 14 0.92 1555 2400
0 1 29 39 2.10 1866 2400
...
1 0 14 1 0.10 1422 2400
1 0 42 2 0.15 1498 2400
...
Avg_MHz
…average frequency, based on APERF MSR registersBusy%
…CPU usage in percentBzy_MHz
…busy frequency, based on MPERF registersTSC_MHz
… fixed frequency, TSC (Time Stamp Counter)
References…
- CoreFreq, CPU monitoring software
powertop
PowerTOP Project…
https://github.com/fenrus75/powertop
- …estimate of the total power usage of the system
- …show individual power usage for each…
- …process
- …device
- …kernel task, timer, interrupt
# calibrate the power estimation engine
sudo powertop --calibrate
# ...running on battery power if working with mobile device
Use Tab and Shift+Tab to cycle through tabs…
- Idle stats…C-states for all processors and cores
- Frequency stats… P-states including the Turbo mode
- Device Stats…
- Tunables…suggestions for optimizing the system for lower power consumption
Power-Profile Services
power-profiles-daemon
, FreeDesktop- TLP - Optimize Linux Laptop Battery Life
tuned
…pronounced “tune-D”
- Tunes system settings dynamically depending on usage…
- …specific monitoring plugin per hardware subsystem
- …changes between lower or higher power saving modes
- References to the TuneD Project
Configuration
tuned.service
optimizes the performance profile of a node…
# typically installed by default
sudo dnf install -y tuned tuned-utils
sudo systemctl enable --now tuned
Relevant directories…
/etc/tuned/tuned-main.conf # global configuration
man 5 tuned-main.conf
/etc/tuned/**/tuned.conf # custom profiles
/usr/lib/tuned # distribution-specific profiles
/var/log/tuned # log files
Profiles
Manage profiles with the tuned-adm
commands:
- Profiles in two broad categories…
- power saving
- performance-boosting (high-throughput, low latency)
- Predefined profiles…packages
tuned-profiles*
tuned-adm active # current active profile
list # available profiles
recommend # most suitable profile
verify # is the active provfile applied?
# activate a combination of multiple profiles...
tuned-adm profile virtual-guest powersave # ...for example on a virtual machine
reboot
Profile customisation… (do not modify /usr/lib/tuned
)
# modify an existing provfile...
cp -r /usr/lib/tuned/powersave /etc/tuned
vim /etc/tuned/powersave/tuned.conf
Create a new profile…
# site specific name recommended...
>>> mkdir /etc/tuned/site-powersave
>>> cat /etc/tuned/site-powersave/tuned.conf
[main]
include=powersave
# customize by overrides
...
powertop
integration…
dnf install -y tuned-utils powertop
powertop2tuned site-profile
# enable what you need by uncommenting lines
vim /etc/tuned/site-profile/tuned.conf
tuned-adm profile site-profile
Plugins
Get a list of available plugins
rpm -ql tuned | grep 'plugins/plugin_.*.py$'
Two types of plugins…
- …monitoring plugins can be used by tuning plugins for dynamic tuning
- automatically instantiated whenever their metrics are needed
disk
disk load (number of IO operations) per devicenet
network load (number of transferred packets) per network cardload
CPU load per CPU
- …tuning plugin tune an individual subsystem
- …can have multiple devices (wildcards to match all devices)
cpu
sets the CPU governornet
…wake-on-lan to the values…speed according to the interface utilizationsysctl
various settings specified by the plugin parametersusb
autosuspend timeout of USB devicesvm
…transparent huge pagesaudio
autosuspend timeout for audio codecsdisk
…ALPM, ASPM…disk spindown timeoutmounts
…barriers for mountsscript
…execution of an external scriptsysfs
various settings specified by the plugin parametersvideo
powersave levels on video cardsbootloader
…kernel boot command line… Grub configuration
tuned.conf
contains sections to configure plugin instances…
# name of the plugin instance
[PLUGIN]
# type of the tuning plugin
type=TYPE
# list of devices...can contain a list, wildcard (*), negation (!)
devices=DEVICES
cpu
Plugin
Modern CPUs capable of operating at different clock frequency and voltage configurations
- …higher clock frequency requires a higher voltage (and thermal heat output)
- …trade-off between the CPU capacity and power consumption
- Referred to as Operating Performance Points or P-states (in ACPI terminology)
- Linux kernel offers CPU performance scaling via the
CPUFreq
subsystem
The cpu
TuneD plugin options…
governor
sets the CPU scaling governor- Multiple governors are separated using
|
(represents or)… - …set the first governor that is available on the system
- Multiple governors are separated using
sampling_down_factor
sampling rate…- …determines how frequently the governor checks to tune the CPU
- Recommended setting for jitter reduction, values 1 to 100000
energy_perf_bias
…Energy Performance Bias (EPB)- …energy vs. performance policy via x86 Model Specific Registers
- Read the plugin help text for more information…
Configuration in default tuned
profiles…
>>> grep -e ^governor -R /usr/lib/tuned**/*
/usr/lib/tuned/accelerator-performance/tuned.conf:governor=performance
/usr/lib/tuned/balanced/tuned.conf:governor=conservative|powersave
/usr/lib/tuned/latency-performance/tuned.conf:governor=performance
/usr/lib/tuned/powersave/tuned.conf:governor=ondemand|powersave
/usr/lib/tuned/throughput-performance/tuned.conf:governor=performance