Applies to openSUSE Leap 42.1

11 Power Management

Abstract

Power management aims at reducing operating costs for energy and cooling systems while at the same time keeping the performance of a system at a level that matches the current requirements. Thus, power management is always a matter of balancing the actual performance needs and power saving options for a system. Power management can be implemented and used at different levels of the system. A set of specifications for power management functions of devices and the operating system interface to them has been defined in the Advanced Configuration and Power Interface (ACPI). As power savings in server environments can primarily be achieved at the processor level, this chapter introduces some main concepts and highlights some tools for analyzing and influencing relevant parameters.

11.1 Power Management at CPU Level

At the CPU level, you can control power usage in various ways. For example by using idling power states (C-states), changing CPU frequency (P-states), and throttling the CPU (T-states). The following sections give a short introduction to each approach and its significance for power savings. Detailed specifications can be found at http://www.acpi.info/spec.htm.

11.1.1 C-States (Processor Operating States)

Modern processors have several power saving modes called C-states. They reflect the capability of an idle processor to turn off unused components in order to save power.

When a processor is in the C0 state, it is executing instructions. A processor running in any other C-state is idle. The higher the C number, the deeper the CPU sleep mode: more components are shut down to save power. Deeper sleep states can save large amounts of energy. Their downside is that they introduce latency. This means, it takes more time for the CPU to go back to C0. Depending on workload (threads waking up, triggering CPU usage and then going back to sleep again for a short period of time) and hardware (for example, interrupt activity of a network device), disabling the deepest sleep states can significantly increase overall performance. For details on how to do so, refer to Section 11.3.2, “Viewing Kernel Idle Statistics with cpupower.

Some states also have submodes with different power saving latency levels. Which C-states and submodes are supported depends on the respective processor. However, C1 is always available.

Table 11.1, “C-States” gives an overview of the most common C-states.

Table 11.1: C-States

Mode

Definition

C0

Operational state. CPU fully turned on.

C1

First idle state. Stops CPU main internal clocks via software. Bus interface unit and APIC are kept running at full speed.

C2

Stops CPU main internal clocks via hardware. State in which the processor maintains all software-visible states, but may take longer to wake up through interrupts.

C3

Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts.

To avoid needless power consumption, it is recommended to test your workloads with deep sleep states enabled versus deep sleep states disabled. For more information, refer to Section 11.3.2, “Viewing Kernel Idle Statistics with cpupower or the cpupower-idle-set(1) man page.

11.1.2 P-States (Processor Performance States)

While a processor operates (in C0 state), it can be in one of several CPU performance states (P-states). Whereas C-states are idle states (all but C0), P-states are operational states that relate to CPU frequency and voltage.

The higher the P-state, the lower the frequency and voltage at which the processor runs. The number of P-states is processor-specific and the implementation differs across the various types. However, P0 is always the highest-performance state (except for Section 11.1.3, “Turbo Features”). Higher P-state numbers represent slower processor speeds and lower power consumption. For example, a processor in P3 state runs more slowly and uses less power than a processor running in the P1 state. To operate at any P-state, the processor must be in the C0 state, which means that it is working and not idling. The CPU P-states are also defined in the ACPI specification, see http://www.acpi.info/spec.htm.

C-states and P-states can vary independently of one another.

11.1.3 Turbo Features

Turbo features allow to dynamically overtick active CPU cores while other cores are in deep sleep states. This increases the performance of active threads while still complying with Thermal Design Power (TDP) limits.

However, the conditions under which a CPU core can use turbo frequencies are architecture-specific. Learn how to evaluate the efficiency of those new features in Section 11.3, “The cpupower Tools”.

11.2 In-Kernel Governors

The in-kernel governors belong to the Linux kernel CPUfreq infrastructure and can be used to dynamically scale processor frequencies at runtime. You can think of the governors as a sort of preconfigured power scheme for the CPU. The CPUfreq governors use P-states to change frequencies and lower power consumption. The dynamic governors can switch between CPU frequencies, based on CPU usage, to allow for power savings while not sacrificing performance.

The following governors are available with the CPUfreq subsystem:

Performance Governor

The CPU frequency is statically set to the highest possible for maximum performance. Consequently, saving power is not the focus of this governor.

See also Section 11.5.1, “Tuning Options for P-States”.

Powersave Governor

The CPU frequency is statically set to the lowest possible. This can have severe impact on the performance, as the system will never rise above this frequency no matter how busy the processors are.

However, using this governor often does not lead to the expected power savings as the highest savings can usually be achieved at idle through entering C-states. With the powersave governor, processes run at the lowest frequency and thus take longer to finish. This means it takes longer until the system can go into an idle C-state.

Tuning options: The range of minimum frequencies available to the governor can be adjusted (for example, with the cpupower command line tool).

On-demand Governor

The kernel implementation of a dynamic CPU frequency policy: The governor monitors the processor usage. As soon as it exceeds a certain threshold, the governor will set the frequency to the highest available. If the usage is less than the threshold, the next lowest frequency is used. If the system continues to be underemployed, the frequency is again reduced until the lowest available frequency is set.

Important
Important: Drivers and In-kernel Governors

Not all drivers use the in-kernel governors to dynamically scale power frequency at runtime. For example, the intel_pstate driver adjusts power frequency itself. Use the cpupower frequency-info command to find out which driver your system uses.

11.3 The cpupower Tools

The cpupower tools are designed to give an overview of all CPU power-related parameters that are supported on a given machine, including turbo (or boost) states. Use the tool set to view and modify settings of the kernel-related CPUfreq and cpuidle systems as well as other settings not related to frequency scaling or idle states. The integrated monitoring framework can access both, kernel-related parameters and hardware statistics, and is thus ideally suited for performance benchmarks. It also helps you to identify the dependencies between turbo and idle states.

After installing the cpupower package, view the available cpupower subcommands with cpupower --help. Access the general man page with man cpupower, and the man pages of the subcommands with man cpupower-subcommand.

11.3.1 Viewing Current Settings with cpupower

The cpupower frequency-info command shows the statistics of the cpufreq driver used in the Kernel. Additionally, it shows if turbo (boost) states are supported and enabled in the BIOS. Run without any options, it shows an output similar to the following:

Example 11.1: Example Output of cpupower frequency-info
root # cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 3.80 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 3.80 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 3.40 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
    3500 MHz max turbo 4 active cores
    3600 MHz max turbo 3 active cores
    3600 MHz max turbo 2 active cores
    3800 MHz max turbo 1 active cores

To get the current values for all CPUs, use cpupower -c all frequency-info.

11.3.2 Viewing Kernel Idle Statistics with cpupower

The idle-info subcommand shows the statistics of the cpuidle driver used in the Kernel. It works on all architectures that use the cpuidle Kernel framework.

Example 11.2: Example Output of cpupower idle-info
root # cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu

Analyzing CPU 0:
Number of idle states: 6
Available idle states: POLL C1-SNB C1E-SNB C3-SNB C6-SNB C7-SNB
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 163128
Duration: 17585669
C1-SNB:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 16170005
Duration: 697658910
C1E-SNB:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 4421617
Duration: 757797385
C3-SNB:
Flags/Description: MWAIT 0x10
Latency: 80
Usage: 2135929
Duration: 735042875
C6-SNB:
Flags/Description: MWAIT 0x20
Latency: 104
Usage: 53268
Duration: 229366052
C7-SNB:
Flags/Description: MWAIT 0x30
Latency: 109
Usage: 62593595
Duration: 324631233978

After finding out which processor idle states are supported with cpupower idle-info, individual states can be disabled using the cpupower idle-set command. Typically one wants to disable the deepest sleep state, for example:

cpupower idle-set -d 5

Or, for disabling all CPUs with latencies equal to or higher than 80:

cpupower idle-set -D 80

11.3.3 Monitoring Kernel and Hardware Statistics with cpupower

Use the monitor subcommand to report processor topology, and monitor frequency and idle power state statistics over a certain period of time. The default interval is 1 second, but it can be changed with the -i. Independent processor sleep states and frequency counters are implemented in the tool—some retrieved from kernel statistics, others reading out hardware registers. The available monitors depend on the underlying hardware and the system. List them with cpupower monitor -l. For a description of the individual monitors, refer to the cpupower-monitor man page.

The monitor subcommand allows you to execute performance benchmarks. To compare Kernel statistics with hardware statistics for specific workloads, concatenate the respective command, for example:

cpupower monitor db_test.sh
Example 11.3: Example cpupower monitor Output
root # cpupower monitor
|Mperf                   || Idle_Stats
 1                         2 
CPU | C0   | Cx   | Freq || POLL | C1   | C2   | C3
   0|  3.71| 96.29|  2833||  0.00|  0.00|  0.02| 96.32
   1| 100.0| -0.00|  2833||  0.00|  0.00|  0.00|  0.00
   2|  9.06| 90.94|  1983||  0.00|  7.69|  6.98| 76.45
   3|  7.43| 92.57|  2039||  0.00|  2.60| 12.62| 77.52

1

Mperf shows the average frequency of a CPU, including boost frequencies, over a period of time. Additionally, it shows the percentage of time the CPU has been active (C0) or in any sleep state (Cx). As the turbo states are managed by the BIOS, it is impossible to get the frequency values at a given instant. On modern processors with turbo features the Mperf monitor is the only way to find out about the frequency a certain CPU has been running in.

2

Idle_Stats shows the statistics of the cpuidle kernel subsystem. The kernel updates these values every time an idle state is entered or left. Therefore there can be some inaccuracy when cores are in an idle state for some time when the measure starts or ends.

Apart from the (general) monitors in the example above, other architecture-specific monitors are available. For detailed information, refer to the cpupower-monitor man page.

By comparing the values of the individual monitors, you can find correlations and dependencies and evaluate how well the power saving mechanism works for a certain workload. In Example 11.3 you can see that CPU 0 is idle (the value of Cx is near 100%), but runs at a very high frequency. This is because the CPUs 0 and 1 have the same frequency values which means that there is a dependency between them.

11.3.4 Modifying Current Settings with cpupower

You can use cpupower frequency-set command as root to modify current settings. It allows you to set values for the minimum or maximum CPU frequency the governor may select or to create a new governor. With the -c option, you can also specify for which of the processors the settings should be modified. That makes it easy to use a consistent policy across all processors without adjusting the settings for each processor individually. For more details and the available options, refer to the cpupower-freqency-set man page or run cpupower frequency-set --help.

11.4 Monitoring Power Consumption with powerTOP

You can monitor system power consumption with powerTOP. It helps you to identify the reasons for unnecessary high power consumption (for example, processes that are mainly responsible for waking up a processor from its idle state) and to optimize your system settings to avoid these. It supports both Intel and AMD processors. The powertop package is available from the SUSE Linux Enterprise SDK.

The SDK is a module for SUSE Linux Enterprise and is available via an online channel from the SUSE Customer Center. Alternatively, go to http://download.suse.com/, search for SUSE Linux Enterprise Software Development Kit and download it from there. Refer to Book “Start-Up”, Chapter 10 “Installing Add-On Products” for details.

powerTOP combines various sources of information (analysis of programs, device drivers, kernel options, amounts and sources of interrupts waking up processors from sleep states) and shows them in one screen. Example 11.4, “Example powerTOP Output” shows which information categories are available:

Example 11.4: Example powerTOP Output
Cn               Avg  residency       P-states   (frequencies)
1                 2      3              4            5
C0 (cpu running)        (11.6%)       2.00 Ghz       0.1%
polling         0.0ms   ( 0.0%)       2.00 Ghz       0.0%
C1              4.4ms   (57.3%)       1.87 Ghz       0.0%
C2             10.0ms   (31.1%)       1064 Mhz      99.9%


Wakeups-from-idle per second : 11.2     interval: 5.0s 6
no ACPI power usage estimate available 7


Top causes for wakeups: 8
96.2% (826.0)       <interrupt> : extra timer interrupt
 0.9% (  8.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func)
 0.3% (  2.4)       <interrupt> : megasas
 0.2% (  2.0)     <kernel core> : clocksource_watchdog (clocksource_watchdog)
 0.2% (  1.6)       <interrupt> : eth1-TxRx-0
 0.1% (  1.0)       <interrupt> : eth1-TxRx-4

[...]

Suggestion: 9 Enable SATA ALPM link power management via:
echo min_power > /sys/class/scsi_host/host0/link_power_management_policy
or press the S key.

1

The column shows the C-states. When working, the CPU is in state 0, when resting it is in some state greater than 0, depending on which C-states are available and how deep the CPU is sleeping.

2

The column shows average time in milliseconds spent in the particular C-state.

3

The column shows the percentages of time spent in various C-states. For considerable power savings during idle, the CPU should be in deeper C-states most of the time. In addition, the longer the average time spent in these C-states, the more power is saved.

4

The column shows the frequencies the processor and kernel driver support on your system.

5

The column shows the amount of time the CPU cores stayed in different frequencies during the measuring period.

6

Shows how often the CPU is awoken per second (number of interrupts). The lower the number, the better. The interval value is the powerTOP refresh interval which can be controlled with the -t option. The default time to gather data is 5 seconds.

7

When running powerTOP on a laptop, this line displays the ACPI information on how much power is currently being used and the estimated time until discharge of the battery. On servers, this information is not available.

8

Shows what is causing the system to be more active than needed. powerTOP displays the top items causing your CPU to awake during the sampling period.

9

Suggestions on how to improve power usage for this machine.

For more information, refer to the powerTOP project page at https://01.org/powertop.

11.5 Special Tuning Options

The following sections highlight some of the most relevant settings that you might want to touch.

11.5.1 Tuning Options for P-States

The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors, influence minimum or maximum CPU frequency to be used or change individual governor parameters.

To switch to another governor at runtime, use cpupower frequency-set with the -g option. For example, running the following command (as root) will activate the performance governor:

cpupower frequency-set -g performance

To set values for the minimum or maximum CPU frequency the governor may select, use the -d or -u option, respectively.

11.6 Troubleshooting

BIOS options enabled?

To use C-states or P-states, check your BIOS options:

  • To use C-states, make sure to enable CPU C State or similar options to benefit from power savings at idle.

  • To use P-states and the CPUfreq governors, make sure to enable Processor Performance States options or similar.

In case of a CPU upgrade, make sure to upgrade your BIOS, too. The BIOS needs to know the new CPU and its frequency stepping to pass this information on to the operating system.

Log file information?

Check the systemd journal (see Book “Reference”, Chapter 11 “journalctl: Query the systemd Journal”) for any output regarding the CPUfreq subsystem. Only severe errors are reported there.

If you suspect problems with the CPUfreq subsystem on your machine, you can also enable additional debug output. To do so, either use cpufreq.debug=7 as boot parameter or execute the following command as root:

echo 7 > /sys/module/cpufreq/parameters/debug

This will cause CPUfreq to log more information to dmesg on state transitions, which is useful for diagnosis. But as this additional output of kernel messages can be rather comprehensive, use it only if you are fairly sure that a problem exists.

11.7 For More Information

Platforms with a Baseboard Management Controller (BMC) may have additional power management configuration options accessible via the service processor. These configurations are vendor specific and therefore not subject of this guide. For more information, refer to the manuals provided by your vendor. For example, HP ProLiant Server Power Management on SUSE Linux Enterprise Server 11—Integration Note provides detailed information how the HP platform specific power management features interact with the Linux Kernel. The paper is available from http://h18004.www1.hp.com/products/servers/technology/whitepapers/os-techwp.html.

Print this page