HPM Vsphere7 Perf
HPM Vsphere7 Perf
VMware, Inc. 3401 Hillview Avenue Palo Alto CA 94304 USA Tel 877-486-9273 Fax 650-427-5001 www.vmware.com
Copyright © 2021 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property
laws. VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents. VMware is a registered trademark
or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks
of their respective companies.
Table of Contents
Executive Summary ............................................................................................................................. 3
Introduction ........................................................................................................................................ 3
Background .......................................................................................................................................... 4
Hardware .............................................................................................................................................. 9
Results ............................................................................................................................................. 12
LoadGen .............................................................................................................................................. 14
SPECpower .......................................................................................................................................... 15
VMmark ............................................................................................................................................... 17
References ....................................................................................................................................... 26
Acknowledgments............................................................................................................................. 27
Appendix A ...................................................................................................................................... 28
While observing the performance and power consumption on a range of workloads, the
paper recommends the use of power policies as follows: (a) The “Balanced” policy is the
default recommended policy in vSphere, because it maximizes performance-per-watt
overall and provides an optimal point between performance and power consumption for a
wide variety of application characteristics. (b) The “High Performance” profile is preferred
in the case of latency-sensitive applications at the cost of higher power consumption, and
(c) “Low Power” profile for low utilization servers, resulting in higher power savings at the
cost of performance.
Introduction
The power used by servers and data centers accounts for ~1.5% of the total power
consumption in the world [3]. CPUs in modern data centers consume ~30% of the overall
system power. As CPU hardware complexity has grown significantly in recent years,
computer architects have introduced several power-focused enhancements. These
designs have enabled operating systems to capitalize on architectural tools to improve
performance-per-watt on various applications. The introduction of demand-based
switching (DBS) from Intel has been the critical factor behind power savings in modern
processors.
Background
The ACPI standard defines C-States (commonly known as idle states or power states) and
P-States (widely known as operational states or performance states). Hardware vendors
such as Intel and AMD have introduced hardware support for P-States and C-States,
which are accessible through the on-chip power control unit (PCU) [6] [7].
C-States are power states that aid in saving power by turning off sub-sections of the CPU
when not in use. CPUs are designed to support various power levels. C0-state is the
operational state where all the components are active, and the processor can actively
execute instructions. C1 is a shallow state where the clock is gated (switched off).
However, all the modules remain active, and the processor can go back to the active C0
state instantaneously.
Furthermore, C2-Cn are sleep states where specific sections of the CPU are turned off.
The higher the C-State, the deeper into sleep mode the CPU goes. Thus, higher C-States
result in significant power savings. However, it takes more time for the CPU to return to
the operational state from deeper sleep states. Therefore, changing the C-state in the
BIOS setting doesn’t have an impact on throughput. However, the latency to get back
from a deeper sleep state would be higher.
P-States correspond to different performance levels that are applied while the processor
is actively executing instructions. P-States are relevant only when the processor is in the
active C0-state. P-State is both a frequency and voltage operating point defined as
performance states in the ACPI specification. Both frequency and voltage are scaled as
the P-State increases. This process is referred to as dynamic voltage and frequency
scaling (DVFS). Hardware vendors such as Intel incorporate several hardware-level P-
States that divide the energy and frequency demands into several tiers. P0 is the highest
frequency (with the highest voltage). It is often referred to as the ‘turbo mode”. P1 is the
nominal/base operating frequency. Higher P-States correspond to lower operating
frequencies. P-States play a crucial role in saving CPU power when the workload does not
fully load a CPU.
P0 P1 P2 P3 . . . Pn
C0
C1 Power
States
(C-States)
C2
..
.
Cn
An example of how to select the BIOS settings to allow vSphere to control power
management in a Dell PowerEdge system is shown below in Figure 2. Similar options can
be found in servers from other vendors as well.
3. The default System Profile on Dell systems is Performance Per Watt (DAPC). Change
this to Performance Per Watt (OS) to transfer control to vSphere/ESXi.
• High Performance: This policy tries to maximize performance by disabling C-State and
P-State management. It always keeps the CPU in the highest possible operating
frequency (P0-State) and only uses the top two shallow C-States (C0 when running and
C1 when idle). The High-Performance policy is geared toward latency-sensitive
applications and provides predictable and consistent performance.
• Low Power: This policy is designed to reduce power consumption compared to the
other policies substantially. The P-State and C-State selection algorithms are more
aggressive toward power saving. This policy, however, can impact the performance of
latency-sensitive applications.
• Custom: This policy, by default, has the same settings as Balanced. However, it
provides the user the ability to tune individual parameters for specific use cases.
Steps to change the power policies using the vSphere Web Client:
1. Select the host from the inventory and click the Manage tab and then select Settings.
3. Click Change Policy and select the policy of interest as shown in Figure 3.
When the Custom policy is selected, vSphere provides various options to the user, as
shown in Figure 4. Note that changing parameters in the Custom policy without a
comprehensive understanding of the options may negatively impact performance and
power consumption.
Power management parameters that affect the custom policy have descriptions that begin
with the phrase, In custom policy. All other power parameters affect all power
management policies. Table 1A in Appendix A describes these parameters in detail.
Table 1 summarizes the use of P-States and C-States in each of the power policies in
vSphere. To recap, the High-Performance policy always requests the highest operating
frequency (P0) in hardware and does not use deep sleep states (C2). It is designed for
latency-sensitive workloads. Low Power policy has P-State, and C-State management
focuses on power savings; this generally comes at a performance cost. Finally, the
Balanced policy is designed to be an optimal middle ground for various applications.
Experimental Setup
In this section, we describe the hardware and software setup used for evaluating the
power management features of vSphere.
Hardware
Table 2 shows the test machine configurations used for this paper. The test system is a
Dell Power Edge R640 server housing an Intel Xeon Platinum 8260 from the Cascade
Lake architecture. The System Profile in the BIOS settings was changed to Performance
Per Watt (OS), as shown in Figure 2. All other BIOS options are left at their default values.
# Sockets 2
CPU
Physical Cores 24 * 2
Logical Core 48 * 2
• Power Usage: The total host power reading is obtained from the internal sensors
embedded in the host’s power supply unit through the Intelligent Platform
Management Interface (IPMI) driver. This reading includes CPU Power, Memory Power,
and Other Power (PCIe devices, cooling fans, etc.). If a particular machine does not
have a built-in sensor or vSphere cannot read the values, ‘0’ is displayed.
• P-State MHz: Provides the available P-States and the respective frequency of
operation in each P-States on the Intel Xeon Platinum 8260. This metric is specific to a
particular CPU and can vary among different CPU families and vendors.
• CPU: The test system has 96 logical Cores (hyper-threads). Each row gives the power
statistics for each of the logical cores.
• C-State Selection: The available C-States are shown as different columns. The available
C-states are numbered consecutively without gaps (C2, in this case, corresponds to
Intel’s C6 deep C-state). Each row shows the percentage of time the logical CPU votes
to stay in that particular C-State.
• P-State Selection: The screenshot shows 16 different P-States, P0 to P15. P0 has the
highest clock frequency (2.401 GHz), and P15 has the lowest clock frequency (1.00
GHz). Each logical core on the physical core can vote for a particular P-State
(frequency), but the P-State assignment happens at the physical core level. Thus, the
lower P-State vote between the two requests is selected. On the other hand, if a
processor has a socket-level P-State granularity, then the chosen operating P-State
would be the lowest vote across all cores on the socket.
• %Aperf/Mperf: This column shows the ratio of Aperf to Mperf, which indicates the
actual frequency at which the core is running. A value of 100 means the core is running
at nominal/base frequency; anything higher indicates running in turbo and anything
lower suggests the core is running at lower frequencies. Note that the %Aperf/Mperf
counter is updated only when the core is in the running state (C0-state). So, the
counter value is not valid when the core spends time in C1 or C2 idle sleep states.
Some vsi nodes can also be queried to see the core’s actual time in deep C-States
through the C-state residency counters. Note that this depends on the architectural
supports provided by the vendor. For example, the following command can be used:
Figure 6 shows the expanded view of the available P-States, the equivalent operating
frequency, and the socket TDP. Note that the difference between P0 and P1 is shown to
be only 1 MHz. The main difference between P0 and P1 is that P0 enables turbo mode,
whereas P1 does not. Turbo frequency is hardware controlled (invisible to vSphere) and is
based on the load, number of active threads, power, and the thermal budget. The OS can
only request P0, and the hardware determines the actual frequency of operation.
3.70
Boost range
165 W P0
All Core Max Turbo
3.10
Frequency
2.41
2.30 154 W P2
2.20 146 W P3
...
...
Lowest Operating
1.00 57 W P15
Frequency
Figure 6: Intel Xeon Platinum 8260 operating states and equivalent power estimates
Results
This section walks through the experimental results of the test machine for several
workloads. The study utilizes synthetic and three real-life workloads to demonstrate the
impact of the vSphere power management policies on metrics such as latency,
performance, and power consumption.
• Power management disabled (P-States and C-States are disabled) – orange line
• Only P-state management enabled – blue line
• Both P-State and C-State management enabled – green line
In this case, only P-state management is enabled. While active, the system can operate
between P0 and P15, depending on the load. During idle times, the system will be in the
C1 state (C2 is disabled). During high load levels (~100% load), the processor operates in
P0, and consequently, we see that enabling P-state management saves only ~1% power.
However, as the load decreases, enabling P-States provides an average power gain of
~6% (with a maximum of ~8%, which is the difference between the orange and blue lines).
Note that enabling P-state management does not affect idle power consumption because
it only uses C1 when idle.
In this case, both C-State and P-State management are enabled. The system can operate
between P0 to P15 while active and C1, and C2 are used during idle times. As the load
reduces, we can observe a clear distinction between the orange and green lines. This is
due to the system going into a deeper C2 state during idle times in low load levels. An
average power gain of ~9% can be observed across load levels (with of maximum of
~13%). When the system is idle, ~13% gain can be observed when power management is
enabled due to the use of the C2 state.
350
300
250
200
150
100
50
Host Power (W)
P-States & C-States Disabled P-States Enabled P-States & C-States Enabled
-
-------------------------------------------------------------------------------------------------------------
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Active Idle
CPU Load
Figure 8 shows the total host power consumed while executing the Load Gen
microbenchmark at various threads and intensity levels. The X-Axis shows the number of
threads, with ten intensity levels from 10% to 100% in each thread combination. The Y-Axis
shows the total host power consumed. Again, the contrast between the three policies can
be observed, indicating the impact of P-States and C-States. Note that Load Gen applies
thread affinity, the first 48 threads are pinned to the CPUs in the first physical socket, and
the subsequent threads are pinned to the cores in the second socket in that order.
We can draw a clear distinction between High Performance and Balanced/Low Power in
the case of low thread count. This is precisely due to the use of the C2 state to save power
on the idle cores. However, when the load increases, the power consumption in all three
policies is comparable, signifying that the power budget is not hindered by policy
selection. Another key difference is the increase in power consumption rate for the three
policies as the intensity increases within each thread count. This is because each policy is
tuned with different thresholds for P-State management. Finally, we can see that all three
policies consume the same amount of power when the system is fully loaded.
500
400
300
200
100
0
---------------------------------------------------------------------------------------
Power (W)
1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
Threads & Intensity Per Thread
Figure 8: Host power consumption for various vSphere power policies for the LoadGen workload
The execution of the benchmark takes ~70 minutes. The benchmark goes through various
phases of execution:
• Calibration Phase: The benchmark runs on the system with the maximum throughput
possible, determined by running the workload unconstrained for at least three
calibration levels. The maximum throughput is selected as the average of the
throughput achieved during the final two calibration levels.
• Execution Phase: The workload is then run in a controlled manner, with delays
inserted into the workload stream, to obtain total throughputs of 100%, 90%, 70%,
60%, 50%, 40%, 30%, 20%, and 10% of the maximum throughput calculated in the
calibration phase. During each of these target loads, the power characteristics of the
system under test (SUT) and the temperature are recorded.
• Inference Phase: Finally, the power characteristics and temperature are measured and
recorded during an idle interval during which the SUT processes no Java transactions.
When the benchmark completes, an efficiency score (performance per watt),
maximum performance, and power consumption for each load level are generated.
We execute SPECpower on a RHEL 7.8 VM with 96 vCPUs and 764 GB vRAM while
monitoring total host power consumption. Figure 9 shows the total host power consumed
and the percentage power savings in different policies compared to High Performance.
The primary Y-Axis on the graph shows the total host power consumed in Watts for each
of the load levels on the system. The secondary Y-Axis represents the percentage savings
in Balanced and Low Power policies compared to High Performance in each of those load
levels.
We find that during higher load levels, the power consumption in all three policies is
close. However, the difference in power consumption starts to manifest as the load
decreases. Starting at 50% CPU load, the Balanced and Low Power policies start saving
power compared to High Performance with a peak power savings of ~17% at 10% load and
~30% for the idle system.
30
500
25
400
20
300 15
10
200
5
100
0
% Power Savings
0 -5
--------------------------------------------------------------------------------------------------------------
Power (W)
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Active Idle
CPU Load
High Performance Balanced Low Power % Savings (Balanced) % Savings (LowPower)
Figure 9: Host power consumption for various vSphere power policies during a SPECpower run
Figure 10 shows the SPECpower performance metric (ssj_ ops) on the primary Y-Axis and
the average host power consumed during each load level on the secondary Y-Axis. Note
that SPECpower is a throughput-oriented workload, and all the three policies achieve
similar performance. However, Balanced and Low Power policies could reduce the power
consumption at lower load levels by using P-States and C-States.
750
500 100
Avg. Power (W)
250
0 -
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Active
Idle
CPU Load
Figure 10: SPECpower performance & average host power consumption for vSphere power policies
Figure 11 shows the performance per watt for the SPECpower benchmark on the test
machine. The higher performance per watt seen for Balanced and Low Power, especially
for load levels less than 50%, is directly attributed to the extra power savings in those two
policies.
6
Performance Per Watt in Thousands
-
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Active
Idle
CPU Load
Overall, we observe that using Balanced or Low Power policies for throughput-oriented
applications like SPECpower can result in power savings with no loss in performance.
VMmark
A cloud environment typically comprises several diverse workloads on a virtualization
platform — a collection of physical servers accessing shared storage and network
resources. VMmark is an open-source benchmark suite from VMware used by hardware
vendors and others to evaluate virtual platforms’ performance, scalability, and power
consumption [11] [12].
The VMmark benchmark suite uses a tiled approach. Each tile consists of 19 VMs requiring
47 vCPUs with ~190 GB of DRAM and ~800 GB of storage space.
We ran VMmark with 1-tile, 2-tiles, 3-tiles, and 4-tiles on the test systems. Figure 12 shows
the VMmark score obtained for each of the runs. The secondary Y-Axis shows the PCPU
percentage utilization time in each of the runs. The VMmark score depends on the latency
observed in the different workloads. The High Performance policy can achieve lower
latencies by not using a deeper C2 state and, consequently, a better overall score.
3.5 70
3.0 60
2.5 50
2.0 40
1.5 30
1.0 20
0.5 10
0.0 0
1 Tile 2 Tiles 3 Tiles 4 Tiles
VMmark Tiles
VMmark Score: High Performance VMmark Score: Balanced VMmark Score: Low Power PCPU % Util. Time
Table 3 shows the individual average benchmark score and the overall VMmark score for
a 4-tile execution. We notice that the web simulation benchmark (Weathervane) results
are similar with no noticeable difference in performance. However, the e-commerce
simulation benchmarks (DVD Store) show visible performance variations. For example,
the High Performance profile has a performance improvement of ~5% over the Balanced
profile and ~10% over the Low Power profile. As a result, the final VMmark score shows
the expected behavior of the power policies, with High Performance providing the best
performance result.
High
3,549 565 671 419 282 3.61
Performance
Table 4 shows the average host power consumed for three power policies during runs.
Again, we observed very similar power consumption in all three policies. This is because
even one tile of VMmark needs 47 vCPUs and can keep all the available physical cores
(48) of the underlying system active.
Table 4: Average host power for vSphere power policies for the VMmark run
Figure 13 provides a run-time power consumption view during the execution of a 2-tile
VMmark run. The three different lines indicate the power consumed for the three different
power policies. We observed that power consumption for all policies is very similar;
however, we notice that the High Performance policy has higher peaks than Balanced and
Low Power. We can also observe the difference in power consumption during ramp up or
ramp down stages.
500
450
400
350
300
250
Power (W)
200
2 Tiles: High Performance 2 Tiles: Balanced 2 Tiles: Low Power
Figure 13: 2-Tile VMmark run for host power consumption for vSphere power policies
Figure 14 shows the host power consumption when running 1, 2, 3, and 4 tiles on the test
system in the Balanced policy. We observed the machine reaches saturation at 4-Tiles as
the host power consumption stays steady at its peak (550 W).
550
500
450
400
350
300
250
Power (W)
200
1 Tile: Balanced 2 Tiles: Balanced 3 Tiles: Balanced 4 Tiles: Balanced
Figure 14: Host power consumption during VMmark tile execution on Balanced policy
The results presented in the paper include five different loads by changing the number of
VDI VMs from 48, 96, 144, 192, to 240, which represent 1x to 5x CPU overcommitment.
View Planner was run in local mode with the standard profile workload. View Planner
workload mix consists of multiple applications running in the desktop virtual machines and
performing user operations. The quality of service (QoS) shown in milliseconds is the
metric of interest for the benchmark.
Table 5 shows the quality of service for the 95th percentile latency in milliseconds. The first
three columns are CPU sensitive (Group A), and the last three columns are storage
sensitive (Group B). The default thresholds are 1000 ms for Group A and 6000 ms for
Group B. In the case of CPU, we see a minor reduction in the latency using High
Performance. Overall, the results are identical for all three profiles. While considering the
sensitive storage group, we observe a slight increase in the latency as we change from
High Performance to Low Power policy. However, this did not affect the overall system
performance.
Figure 15 shows the average host power consumed in watts for various VM runs on the
primary Y-Axis. The secondary Y-Axis represents the average CPU utilization and the
maximum CPU utilization during each run. We observed that the High Performance policy
is consuming higher power, especially for 48 and 96 VMs. This is in line with our prior
observations because the C2 idle state is disabled in High Performance—when CPU
utilization is low, idle power consumption is higher. As the CPU load increases, the C2
residency in Balanced and Low Power decreases significantly, eroding the differences
between High Performance and the other two policies.
View Planner (VDI): Avg. Host Power & Avg. CPU Utilization
600 120
500 100
400 80
300 60
200 40
100 20
CPU Utilization
- -
Avg. Power
Figure 15: VDI host power consumption and CPU utilization for vSphere power policies
Figure 16 shows the run-time host power consumption for a 96-VM VDI run. The workload
(standard_ profile) runs five iterations and can be observed by the power fluctuations. The
450
400
350
300
250
200
150
100
50
Power (W)
0
96 VM: High Performance 96 VM: Balanced 96 VM: Low Power
Figure 16: 96-VM VDI run of host power consumption for vSphere power policies
Figure 17 shows the host power consumption while executing the workload on 48, 96, 144,
192, and 240 VMs on the test system in the Balanced policy. The machine reaches
saturation at 240 VMs, as the host power consumption peaks throughout the execution.
500
400
300
200
100
Power (W)
0
48 VM: Balanced 96 VM: Balanced 144 VM: Balanced
192 VM: Balanced 240 VM: Balanced
Figure 17: VDI run of host power consumption with the Balanced policy
On the other hand, if the idle cores are in the C2 state, they consume much less power,
leaving higher thermal and power budgets for the active cores. For instance, in the test
system used for the study with the Cascade Lake processor, the max turbo frequency
achieved is 3.10 GHz when all cores are running. But when only four cores are active, the
max turbo frequency can go up to 3.70 GHz, given all idle cores are in C2. However, if the
idle cores are in C1, the max turbo frequency will be limited to 3.10 GHz.
Since Balanced and Low Power policies use deeper C2-State, they can achieve higher
turbo frequencies than the High Performance policy, especially when few cores are active.
We ran SPECpower in a 4-vCPU VM on the 48-core system under High Performance and
Balanced policies. Figure 18 shows performance on the primary Y-Axis and the average
host power consumed on the secondary Y-Axis in the 4-vCPU VM. While only 4 CPU
cores are active on the entire system, all other cores remain idle for most of the execution
period.
350,000 300
300,000
250
250,000
200
200,000
150
150,000
100
100,000
50,000 50
Host Power (W)
0 -
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Active Idle
ssj_ ops
Figure 19 shows the %aperf/mperf from esxtop while executing SPECpower in High
Performance and Balanced policies. We notice that on active cores, High Performance
achieves a maximum %aperf/mperf of ~130% (~3.10 GHz), and Balanced achieves a
maximum of ~151% (~3.70 GHz). This indicates that there can be cases when the Balanced
policy can provide slightly better performance than the High Performance policy.
120
100
80
60
40
20
%aperf/mperf
- -----------------------------------------------------------------------------------------------------------------
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Active Idle
CPU Load
• Configure your BIOS settings to allow vSphere the most flexibility in using the power
management features offered by your hardware. Specifically, select the OS Control
mode (Performance Per Watt (OS)) under the BIOS power management options.
• To achieve the best performance-per-watt for various workloads, leave the power
policy setting at the default, Balanced.
• Turn Turbo mode on with all C-States enabled to get the maximum performance and
power benefits.
• For maximum power savings, switch the power policy to Low Power.
• If the host is idle or remains under-utilized for a prolonged amount of time, select Low
Power profile to maximize power savings.
High Performance Latency-sensitive applications. Reduces latency at the cost of higher idle
(Higher CPU Load) power
Balanced A mix of latency and throughput The balance between latency and throughput
Low Power Low CPU Load Maximizes power savings at the cost of slightly
(Underutilizes systems) increased latency
Custom User configuration Allows the user to tune P-States and C-States
based on application requirements
Acknowledgments
The authors thank all members of the VMware Performance Engineering team who
contributed to and reviewed this paper. In addition, the authors would like to
acknowledge Qasim Ali for providing constant technical background, valuable feedback,
and support for the article. They also thank Tim Mann, Xunjia (Richard) Lu, and Tony Lin
for helping with the experimental section and providing insights to interpret the results.
Parameter Description
Power.CStateMaxLatency Do not use C-States whose latency is greater than this value.
A parameter in the ESXi algorithm for predicting how long a CPU that
Power.CStatePredictionCoef
becomes idle will remain idle. Changing this value is not recommended.
When a CPU becomes idle, choose the deepest C-State whose latency
multiplied by this value is less than the host’s prediction of how long the
Power.CStateResidencyCoef
CPU will remain idle. Larger values make ESXi more conservative about
using deep C-States; smaller values are more aggressive.
Use P-States to save power on a CPU only when the CPU is busy for less
Power.MaxCpuLoad
than the given percentage of real-time.
Do not use P-States faster than the given percentage of full CPU speed,
Power.MaxFreqPct
rounded up to the next available P-State.
Do not use any P-States slower than the given percentage of full CPU
Power.MinFreqPct
speed.
Performance Energy Bias Hint (Intel only). Sets an MSR on Intel processors
to an Intel-recommended value. For example, Intel recommends 0 for high
Power.PerfBias
performance, 6 for balanced, and 15 for low power. Other values are
undefined.
Controls how many times per second ESXi reevaluates which P-State each
Power.TimerHz
CPU should be operating on.
Power.UseCStates Use deep ACPI C-States (C2 or below) when the processor is idle.
Power.UsePStates Use ACPI P-States to save power when the processor is busy.