Professional Documents
Culture Documents
1
HA and DRS
Technical Deepdive
Frank Denneman is a Consulting Architect working for VMware as part of the Professional
Services Organization. Frank works primarily with large Enterprise customers and Service
Providers. He is focused on designing large vSphere Infrastructures and specializes in Resource
Management, DRS in general and storage. Frank is a VMware Certified Professional and among the
first VMware Certified Design Experts (VCDX 029). Frank is the owner of FrankDenneman.nl which
has recently been voted number 6 worldwide on vsphere-land.com. He can be followed on twitter
at http://twitter.com/FrankDenneman.
Table of Contents
About the Authors
Acknowledgements
Foreword
Introduction to VMware High Availability
How Does High Availability Work?
Pre-requisites
Firewall Requirements
Configuring VMware High Availability
Components of High Availability
VPXA
VMAP Plug-In
AAM
Nodes
Promoting Nodes
Failover Coordinator
Preferred Primary
High Availability Constructs
Isolation Response
Split-Brain
Isolation Detection
Selecting an Additional Isolation Address
Failure Detection Time
Adding Resiliency to HA (Network Redundancy)
Single Service Console with vmnics in Active/Standby Configuration
Secondary Management Network
Admission Control
Admission Control Policy
Admission Control Mechanisms
Host Failures Cluster Tolerates
Unbalanced Configurations and Impact on Slot Calculation
Percentage of Cluster Resources Reserved
Failover Host
Impact of Admission Control Policy
Host Failures Cluster Tolerates
Percentage as Cluster Resources Reserved
Specify a Failover Host
Recommendations
VM Monitoring
Why Do You Need VM/Application Monitoring?
How Does VM/App Monitoring Work?
Is AAM enabling VM/App Monitoring?
Screenshots
vSphere 4.1 HA and DRS Integration
Affinity Rules
Resource Fragmentation
DPM
Flattened Shares
Summarizing
What is VMware DRS?
Cluster Level Resource Management
Requirements
Rules
VM-VM Affinity Rules
VM-Host Affinity Rules
Impact of Rules on Organization
Virtual Machine Automation Level
Impact of VM Automation Level on DRS Load Balancing Calculation
Resource Pools and Controls
Root Resource Pool
Resource Pools
Resource pools and simultaneous vMotions
Under Committed versus Over Committed
Resource Allocation Settings
Shares
Reservation
VM Level Scheduling: CPU vs Memory
Impact of Reservations on VMware HA Slot Sizes.
Behavior of Resource Pool Level Memory Reservations
Setting a VM Level Reservation inside a Resource Pool
VMkernel CPU reservation for vMotion
Reservations Are Not Limits.
Memory Overhead Reservation
Expandable Reservation
Limits
CPU Resource Scheduling
Memory Scheduler
Distributed Power Management
Enable DPM
Templates
DPM Threshold and the Recommendation Rankings
Evaluating Resource Utilization
Virtual Machine Demand and ESX Host Capacity Calculation
Evaluating Power-On and Power-Off Recommendations
Resource LowScore and HighScore
Host Power-On Recommendations
Host Power-Off Recommendations
DPM Power-Off Cost/Benefit Analysis
Integration with DRS and High Availability
Distributed Resource Scheduler
High Availability
DPM awareness of High Availability Primary Nodes
DPM Standby Mode
DPM WOL Magic Packet
Baseboard Management Controller
Protocol Selection Order
DPM and Host Failure Worst Case Scenario
DRS, DPM and VMware Fault Tolerance
DPM Scheduled Tasks
Summarizing
Appendix A Basic Design Principles
VMware High Availability
VMware Distributed Resource Scheduler
Appendix B HA Advanced Settings
Acknowledgements
The authors of this book work for VMware. The opinions expressed here are the authors personal
opinions. Content published was not read or approved in advance by VMware and does not
necessarily reflect the views and opinions of VMware. This is the authors book, not a VMware book.
First of all we would like to thank our VMware management team (Steve Beck, Director; Rob
Jenkins, Director) for supporting us on this and other projects.
A special thanks goes out to our Technical Reviewers: fellow VCDX Panel Member Craig Risinger
(VMware PSO), Marc Sevigny (VMware HA Engineering), Anne Holler (VMware DRS Engineering)
and Bouke Groenescheij (Jume.nl) for their very valuable feedback and for keeping us honest.
A very special thanks to our families and friends for supporting this project. Without your support
we could have not have done this.
We would like to dedicate this book to the VMware Community. We highly appreciate all the effort
everyone is putting in to take VMware, Virtualization and Cloud to the next level. This is our gift to
you.
Foreword
Since its inception, server virtualization has forever changed how we build and manage the
traditional x86 datacenter. In its early days of providing an enterprise-ready hypervisor, VMware
focused their initial virtualization efforts to meet the need for server consolidation. Increased
optimization of low-utilized systems and lowering datacenter costs of cooling, electricity, and floor
space requirements was a surefire recipe for VMwares early success. Shortly after introducing
virtualization solutions, customers started to see the significant advantages introduced by the
increased portability and recoverability that were all of a sudden available.
Its this increased portability and recoverability that significantly drove VMwares adoption during
its highest growth period. Recovery capabilities and options that were once reserved for the most
critical of workloads within the worlds largest organizations became broadly available to the
masses. Replication, High-Availability, and Fault Tolerance were once synonymous with Expensive
Enterprise Solutions, but are now available to even the smallest of companies. Data protection
enhancements, when combined with the intelligence of intelligent resource management, placed
VMware squarely at the top market leadership board. VMwares virtualization platform can
provide near instant recovery time with increasingly more recent recovery points in a properly
designed environment.
Now, if youve read this far, you likely understand the significant benefits that virtualization can
provide, and are probably well on your way to building out your virtual infrastructure and strategy.
The capabilities provided by VMware are not ultimately what dictates the success and failure of a
virtualization project, especially as increasingly more critical applications are introduced and
require greater availability and recoverability service levels. It takes a well-designed virtual
infrastructure and a full understanding of how the business requirements of the organization align
to the capabilities of the platform.
This book is going to arm you with the information necessary to understand the in-depth details of
what VMware can provide you when it comes to improving the availability of your systems. This
will help you better prepare for, and align to, the requirements of your business as well as set the
proper expectations with the key stakeholders within the IT organization. Duncan and Frank have
used their extensive field experience into this book to enable you to drive broader virtualization
adoption across more complex and critical applications. This book will enable you to make the
most educated decisions as you attempt to achieve the next level of maturity within your virtual
environment.
Scott Herold
Lead Architect, Virtualization Business, Quest Software
Part 1
VMware High Availability
Chapter 1
Introduction to VMware High Availability
VMware High Availability (HA) provides a simple and cost effective clustering solution to increase
uptime for virtual machines. HA uses a heartbeat mechanism to detect a host or virtual machine
failure. In the event of a host failure, affected virtual machines are automatically restarted on other
production hosts within the cluster with spare capacity. In the case of a failure caused by the Guest
OS, HA restarts the failed virtual machine on the same host. This feature is called VM Monitoring,
but sometimes also referred to as VM HA.
Figure 1: High Availability in action
Unlike many other clustering solutions HA is literally configured and enabled with 4 clicks.
However HA is not, and lets repeat it, is not a 1:1 replacement for solutions like Microsoft
Clustering Services. (MSCS). MSCS and for instance Linux Clustering are stateful clustering solutions
where the state of the service or application is preserved when one of the nodes fails. The service is
transitioned to one of the other nodes and it should resume with limited downtime or loss of data.
With HA the virtual machine is literally restarted and this incurs downtime. HA is a form of stateless
clustering.
One might ask why would you want to use HA when a virtual machine is restarted and service is
temporarily lost. The answer is simple; not all virtual machines (or services) need 99.999% uptime.
For many services the type of availability HA provides is more than sufficient. Stateful clustering
does not guarantee 100% uptime, can be complex and need special skills and training. One example
is managing patches and updates/upgrades in a MSCS environment; this could even cause more
downtime if not operated correctly. Just like MSCS a service or application is restarted during a
failover, the same happens with HA and the effected virtual machines.
Besides that, HA reduces complexity, costs (associated with downtime and MSCS), resource
overhead and unplanned downtime for minimal additional costs. It is important to note that HA,
contrary to MSCS, does not require any changes to the guest as HA is provided on the hypervisor
level. Also, VM Monitoring does not require any additional software or OS modifications except for
VMware Tools, which should be installed anyway.
We cant think of a single reason not to use it.
Pre-requisites
For those who want to configure HA, the following items are the pre-requisites in order for HA to
function correctly:
We recommend against using a mixed cluster. With that we mean a single cluster containing both
ESX and ESXi hosts. Differences in build numbers has led to serious issues in the past when using
VMware FT. (KB article: 1013637)
Firewall Requirements
The following list contains the ports that are used by HA for communication. If your environment
contains firewalls ensure these ports are opened for HA to function correctly.
High Availability port settings:
When the HA cluster has been created ESX hosts can be added to the cluster simply by dragging
them into the cluster. When an ESX host is added to the cluster the HA agent will be loaded.
Chapter 2
Components of High Availability
Now that we know what the pre-requisites are and how to configure HA the next steps will be
describing which components form HA. This is still a high level overview however. There is more
under the cover that we will explain in following chapters. The following diagram depicts a two
host cluster and shows the key HA components.
Figure 3: Components of High Availability
As you can clearly see there are three major components that form the foundation for HA:
VPXA
VMAP
AAM
VPXA
The first and probably the most important is VPXA. This is not an HA agent, but it is the vCenter
agent and it allows your vCenter Server to interact with your ESX host. It is also takes care of
stopping and starting virtual machines if and when needed.
HA is loosely coupled with vCenter Server. Although HA is configured by vCenter Server, it does not
need vCenter to manage an HA failover. It is comforting to know that in case of a host failure
containing the virtualized vCenter server, HA takes care of the failure and restarts the vCenter
server on another host, including all other configured virtual machines from that failed host.
When a virtual vCenter is used we do however recommend setting the correct restart priorities
within HA to avoid any dependency problems.
Its highly recommended to register ESX hosts with their FQDN in vCenter. VMware vCenter
supplies the name resolution information that HA needs to function. HA stores this locally in a file
called FT_HOSTS. In other words, from an HA perspective there is no need to create local host files
and it is our recommendation to avoid using local host files. They are too static and will make
troubleshooting more difficult.
To stress my point even more as of vSphere 4.0 Update 1 host files (i.e. /etc/hosts) are corrected
automatically by HA. In other words if you have made a typo or for example forgot to add the short
name HA will correct the host file to make sure nothing interferes with HA.
VMAP Plug-In
Next on the list is VMAP. Where vpxa is the process for vCenter to communicate with the host
VMAP is the translator for the HA agent (AAM) and vpxa. When vpxa wants to communicate with
the AAM agent VMAP will translate this into understandable instructions for the AAM agent. A good
example of what VMAP would translate is the state of a virtual machine: is it powered on or
powered off? Pre-vSphere 4.0 VMAP was a separate process instead of a plugin linked into vpxa.
VMAP is loaded into vpxa at runtime when a host is added to an HA cluster.
The vpxa communicates with VMAP and VMAP communicates with AAM. When AAM has received it
and flushed the info it well tell VMAP and VMAP on its turn will acknowledge to vpxa that info has
been processed. The VMAP plug-in acts as a proxy for communication to AAM.
One thing you are probably wondering is why do we need VMAP in the first place? Wouldnt this be
something vpxa or AAM should be able to do? The answer is yes, either vpxa or AAM should be able
to carry this functionality. However, when HA was first introduced it was architecturally more
prudent to create a separate process for dealing with this which has now been turned into a plugin.
AAM
That brings us to our next and final component, the AAM agent. The AAM agent is the core of HA
and actually stands for Automated Availability Manager. As stated above, AAM was originally
developed by Legato. It is responsible for many tasks such as communicating host resource
information, virtual machine states and HA properties to other hosts in the cluster. AAM stores all
this info in a database and ensures consistency by replicating this database amongst all primary
nodes. (Primary nodes are discussed in more detail in chapter 4.) It is often mentioned that HA uses
an In-Memory database only, this is not the case! The data is stored in a database on local storage or
in FLASH memory on diskless ESXi hosts.
One of the other tasks AAM is responsible for is the mechanism with which HA detects
isolations/failures: heartbeats.
All this makes the AAM agent one of the most important processes on an ESX host, when HA is
enabled of course, but we are assuming for now it is. The engineers recognized the importance and
added an extra level of resiliency to HA. The agent is multi-process and each process acts as a
watchdog for the other. If one of the processes dies the watchdog functionality will pick up on this
and restart the process to ensure HA functionality remains without anyone ever noticing it failed. It
is also resilient to network interruptions and component failures. Inter-host communication
automatically uses another communication path (if the host is configured with redundant
management networks) in the case of a network failure. The underlying message framework
exactly-once guarantees message delivery.
Chapter 3
Nodes
Now that you know what the components of HA are, it is time to start talking about the one of the
most crucial concepts when it comes to designing HA clusters.
Everyone who has implemented VMware VI3 or vSphere knows that multiple hosts can form a
cluster. A cluster can best be seen as a collection of resources. These resources can be carved up
with the use of VMware Distributed Resource Scheduler (DRS) into separate pools of resources or
used to increase availability by enabling HA.
Before we discuss the various options one has during the configuration of HA there is one
important aspect that needs to be discussed first, and that is the concept of nodes. It is important to
understand the concepts of nodes as how they work can and will influence your design.
The following diagram depicts the concepts of nodes:
Figure 4: Primary and secondary hosts
An HA cluster consists of hosts, or nodes as HA calls them. There are two types of nodes. A node is
either a primary or a secondary node. This concept was introduced to enable scaling up to 32 hosts
in a cluster and each type of node has a different role. Primary nodes hold cluster settings and all
node states. The data a primary node holds is stored in a persistent database and synchronized
between primaries as depicted in the diagram above.
An example of node state data would be host resource usage. In case vCenter is not available the
primary nodes will always have a very recent calculation of the resource utilization and can take
this into account when a failover needs to occur. Secondary nodes send their state info to primary
nodes. This will be sent when changes occur, generally within seconds after a change. As of vSphere
4.1 by default every host will send an update of its status every 10 seconds. Pre-vSphere 4.1 this
used to be every second.
This interval can be controlled by an advanced setting called das.sensorPollingFreq. As stated
before the default value of this advanced setting is 10. Although a smaller value will lead to a more
update view of the status of the cluster overall it will also increase the amount of traffic between
nodes. It is not recommended to decrease this value as it might lead to decreased scalability due to
the overhead of these status updates. The maximum value of the advanced setting is 30.
As discussed earlier, HA uses a heartbeat mechanism to detect possible outages or network
isolation. The heartbeat mechanism is used to detect a failed or isolated node. However, a node will
recognize it is isolated by the fact that it isnt receiving heartbeats from any of the other nodes.
Nodes send a heartbeat to each other. Primary nodes send heartbeats to all primary nodes and all
secondary nodes. Secondary nodes send their heartbeats to all primary nodes, but not to
secondaries. Nodes send out these heartbeats every second by default. However, this is a
configurable value through the use of the following cluster advanced setting:
das.failuredetectioninterval. We do however not recommend changing this interval as it was
carefully selected by VMware.
The first 5 hosts that join the HA cluster are automatically selected as primary nodes. All other
nodes are automatically selected as secondary nodes. When you do a reconfigure for HA, the
primary nodes and secondary nodes are selected again; this is virtually random.
Except for the first host that is added to the cluster; any host that joins the cluster must
communicate with an existing primary node to complete its configuration. At least one primary host
must be available for HA to operate correctly. If all primary hosts are unavailable, you will not be
able to add or remove a host from your cluster.
The vCenter client normally does not show which host is a primary node and which is a secondary
node. As of vCenter 4.1 a new feature has been added which is called Operational Status and can
be found on the HA section of the Clusters summary tab. It will give details around errors and will
show the primary and secondary nodes. There is one gotcha however; it will only show which
nodes are primary and secondary in case of an error.
This however can also be revealed from the Service Console or via PowerCLI. The following are two
examples of how to list the primary nodes via the Service Console (ESX 4.0):
With PowerCLI the primary nodes can be listed with the following lines of code:
Now that you have seen that it is possible that you can list all nodes with the CLI you probably
wonder what else is possible Lets start with a warning - this is not supported! Currently the
supported limit of primaries is 5. This is a soft limit however. It is possible to manually add a 6th
primary but this is not supported nor encouraged.
Having more than 5 primaries in a cluster will significantly increase network and CPU overhead.
There should be no reason to increase the number of primaries beyond 5. For the purpose of
education we will demonstrate how to promote a secondary node to primary and vice versa.
To promote a node:
To demote a node:
Figure 9: Demote node command
This method however is unsupported and there is no guarantee this will remain working in the
future. On earlier versions of ESX ftcli should be used, this command cant be run however
without setting the required environment variables first. You can execute
/config/agent_env.[platform] to set these.
Promoting Nodes
A common misunderstanding about HA with regards to primary and secondary nodes is the reelection process. When does a re-election, or promotion, occur?
It is a common misconception that a promotion of a secondary occurs when a primary node fails.
This is not the case. Lets stress that, this is not the case! The promotion of a secondary node to
primary only occurs in one of the following scenarios:
This is particularly important for the operational aspect of a virtualized environment. When a host
fails it is important to ensure its role is migrated to any of the other hosts in case it was an HA
primary node. To simplify it; when a host fails we recommend placing it in maintenance mode, to
disconnect it or to remove it from the cluster to avoid any risks!
If all primary hosts fail simultaneously no HA initiated restart of the virtual machines can take
place. HA needs at least one primary node to restart virtual machines. This is why you can configure
HA to tolerate only up to 4 host failures when you have selected the host failures Admission
Control Policy (Remember 5 primaries). The amount of primaries is definitely something to take
into account when designing for uptime.
Failover Coordinator
As explained in the previous section, you will need at least one primary to restart virtual machines.
The reason for this is that one of the primary nodes will hold the failover coordinator role. This
role will be randomly assigned to a primary node; this role is also sometimes referred to as active
primary. We will use failover coordinator for now.
The failover coordinator coordinates the restart of virtual machines on the remaining primary and
secondary hosts. The coordinator takes restart priorities in account when coordinating the restarts.
Pre-vSphere 4.1 when multiple hosts would fail at the same time it would handle the restarts
serially. In other words, restart the virtual machines of the first failed host (taking restart priorities
in account) and then restart the virtual machines of the host that failed as second (again taking
restart priorities in account). As of vSphere 4.1 this mechanism has been severely improved. In the
case of multiple near-simultaneous host failures, all the host failures that occur within 15 seconds
will have all their VMs aggregated and prioritized before the power-on operations occur.
If the failover coordinator fails, one of the other primaries will take over. This node is again
randomly selected from the pool of available primary nodes. As any other process within the HA
stack, the failover coordinator process is carefully watched by the watchdog functionality of HA.
Pre-vSphere 4.1 the failover coordinator would decide where a virtual machine would be restarted.
Basically it would check which host had the highest percentage of unreserved and available
memory and CPU and select it to restart that particular virtual machine. For the next virtual
machine the same exercise would be done by HA, select the host with the highest percentage of
unreserved memory and CPU and restart the virtual machine.
HA does not coordinate with DRS when making the decision on where to place virtual machines. HA
would rely on DRS. As soon as the virtual machines were restarted, DRS would kick in and
redistribute the load if and when needed.
As of vSphere 4.1 virtual machines will be evenly distributed across hosts to lighten the load on the
hostd service and to get quicker power-on results. HA then relies on DRS to redistribute the load
later if required. This improvement results in faster restarts of the virtual machines and less stress
on the ESX hosts. DRS also re-parents the virtual machine when it is booted up as virtual machines
are failed over into the root resource pool by default. This re-parent process however did already
exist pre-vSphere 4.1.
The failover coordinator can restart up to 32 VMs concurrently per host. The number of concurrent
failovers can be controlled by an advanced setting called das.perHostConcurrentFailoversLimit. As
stated the default value is 32. Setting a larger value will allow more VMs to be restarted
concurrently and might reduce the overall VM recovery time, but the average latency to recover
individual VMs might increase.
In blade environments it is particularly important to factor the primary nodes and failover
coordinator concept into your design. When designing a multi chassis environment the impact of a
single chassis failure needs to be taken into account. When all primary nodes reside in a single
chassis and the chassis fails, no virtual machines will be restarted as the failover coordinator is the
only one who initiates the restart of your virtual machines. When it is unavailable, no restart will
take place.
It is a best practice to have the primaries distributed amongst the chassis in case an entire chassis
fails or a rack loses power, there is still a running primary to coordinate the failover. This can even
be extended in very large environments by having no more than 2 hosts of a cluster in a chassis.
The following diagram depicts the scenario where four 8 hosts clusters are spread across four
chassis.
Basic design principle:In blade environments, divide hosts over all blade chassis and never exceed
four hosts per chassis to avoid having all primary nodes in a single chassis.
Preferred Primary
With vSphere 4.1 a new advanced setting has been introduced. This setting is not even
experimental, it is currently considered unsupported. We don't recommend anyone using it in a
production environment, if you do want to play around with it use your test environment.
This new advanced setting is called das.preferredPrimaries. With this setting multiple hosts of a
cluster can be manually designated as a preferred node during the primary node election process.
The list of nodes can either be comma or space separated and both hostnames and IP addresses are
allowed. Below you can find an example of what this would typically look like. The = sign has been
used as a divider between the setting and the value.
das.preferredPrimaries = hostname1,hostname2,hostname3
or
das.preferredPrimaries = 192.168.1.1 192.168.1.2 192.168.1.3
As shown there is no need to specify 5 hosts; you can specify any number of hosts. If you specify 5
hosts, or less, and all 5 hosts are available they will become the primary nodes in your cluster. If you
specify more than 5 hosts, the first 5 hosts of your list will become primary.
Again, please be warned that this is considered unsupported at times of writing and please verify in
the VMware Availability Guide or online in the knowledge base (kb.vmware.com) what the status is
of the support on this feature before even thinking about implementing it.
A work around found by some pre-vSphere 4.1 was using the promote/demote option of HAs CLI
as described earlier in this chapter. Although this solution could fairly simply be scripted it is
unsupported and as opposed to das.preferredPrimaries a rather static solution.
Chapter 4
High Availability Constructs
When configuring HA two major decisions will need to be made.
Isolation Response
Admission Control
Both are important to how HA behaves. Both will also have an impact on availability. It is really
important to understand these concepts. Both concepts have specific caveats. Without a good
understanding of these it is very easy to increase downtime instead of decreasing downtime.
Isolation Response
One of the first decisions that will need to be made when HA is configured is the isolation
response. The isolation response refers to the action that HA takes for its VMs when the host has
lost its connection with the network. This does not necessarily means that the whole network is
down; it could just be this hosts network ports or just the ports that are used by HA for the
heartbeat. Even if your virtual machine has a network connection and only your heartbeat
network is isolated the isolation response is triggered.
Today there are three isolation responses, Power off, Leave powered on and Shut down. This
answers the question what a host should do when it has detected it is isolated from the network. In
any of the three chosen options, the remaining non isolated, hosts will always try to restart the
virtual machines no matter which of the following three options is chosen as the isolation
response:
Power off When network isolation occurs all virtual machines are powered off. It is a hard
stop, or to put it bluntly, the power cable of the VMs will be pulled out!
Shut down When network isolation occurs all virtual machines running on the host will be
shut down using VMware Tools. If this is not successful within 5 minutes, a power off will
be executed. This time out value can be adjusted by setting the advanced option
das.isolationShutdownTimeout. If VMware Tools is not installed, a power off will be
initiated immediately.
Leave powered on When network isolation occurs on the host, the state of the virtual
machines remains unchanged.
This setting can be changed on the cluster settings under virtual machine options.
The default setting for the isolation response has changed multiple times over the last couple of
years. Up to ESX 3.5 U2 / vCenter 2.5 U2 the default isolation response when creating a new cluster
was Power off. This changed to Leave powered on as of ESX 3.5 U3 / vCenter 2.5 U3. However
with vSphere 4.0 this has changed again. The default setting for newly created clusters, at the time
of writing, is Shut down which might not be the desired response. When installing a new
environment; you might want to change the default setting based on your customers requirements
or constraints.
The question remains, which setting should you use? The obvious answer applies here; it depends.
We prefer Shut down because we do not want to use a degraded host to run our virtual machines
on and it will shut down your virtual machines in clean manner. Many people however prefer to use
Leave powered on because it eliminates the chances of having a false positive and the associated
down time with a false positive. A false positive in this case is an isolated heartbeat network but a
non-isolated virtual machine network and a non-isolated iSCSI / NFS network.
That leaves the question how the other HA nodes know if the host is isolated or failed.
HA actually does not know the difference. The other HA nodes will try to restart the affected virtual
machines in either case. When the host is unavailable, a restart attempt will take place no matter
which isolation response has been selected. If a host is merely isolated, the non-isolated hosts will
not be able to restart the affected virtual machines. The reason for this is the fact that the host that
is running the virtual machine has a lock on the VMDK and swap files. None of the hosts will be able
to boot a virtual machine when the files are locked. For those who dont know, ESX locks files to
prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a
host fails, this lock expires and a restart can occur.
To reiterate, the remaining nodes will always try to restart the failed virtual machines. The
possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation
event, prevents them from being started. This assumes that the isolated host can still reach the files,
which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE based
storage. HA however will repeatedly try starting the failed virtual machines when a restart is
unsuccessful.
The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option
das.maxvmrestartcount. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying
forever which could lead to serious problems as described in KB article 1009625 where multiple
virtual machines would be registered on multiple hosts simultaneously leading to a confusing and
inconsistent state. (http://kb.vmware.com/kb/1009625)
HA will try to start the virtual machine on one of your hosts in the affected cluster; if this is
unsuccessful on that host, the restart count will be increased by 1. The next restart attempt will
than occur after two minutes. If that one fails, the next will occur after 4 minutes, and if that one
fails the following will occur after 8 minutes until the das.maxvmrestartcount has been reached.
T+0 Restart
T+2 Restart retry 1
T+4 Restart retry 2
T+8 Restart retry 3
T+8 Restart retry 4
T+8 Restart retry 5
As shown above in the bullet list and clearly depicted in the diagram below; a successful power-on
attempt could take up to 30 minutes in the case multiple power-on attempts are unsuccessful.
However HA does not give a guarantee and a successful power-on attempt might not ever take
place.
Figure 12: High Availability restart timeline
Split-Brain
When creating your design, make sure you understand the isolation response setting. For instance
when using an iSCSI array or NFS based storage choosing Leave powered on as your default
isolation response might lead to a split-brain situation.
A split-brain situation can occur when the VMDK file lock times out. This could happen when the
iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on
a different host while it is not being powered off on the original host because the selected isolation
response is Leave powered on. Which could potentially leave vCenter in an inconsistent state as
two VMs with a similar UUID would be reported as running on both hosts. This would cause a
ping-pong effect where the VM would appear to live on ESX host 1 at one moment and on ESX
host 2 soon after.
VMwares engineers have recognized this as a potential risk and developed a solution for this
unwanted situation. (This not well documented, but briefly explained by one of the engineers on the
VMTN Community forums. http://communities.vmware.com/message/1488426#1488426.)
In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a
question if the virtual machine should be powered off and auto answers the question with yes.
However, you will only see this question if you directly connect to the ESX host. HA will generate an
event for this auto-answer though, which is viewable within vCenter. Below you can find a
screenshot of this question.
Figure 13: Virtual machine message
As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine
will be powered off to recover from the split brain scenario.
The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them
powered on?
As described above in earlier versions, "Leave powered on" could lead to a split-brain scenario. You
would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not
know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is
however not the case anymore and it should be safe to use Leave powered on.
Isolation Detection
We have explained what the options are to respond to an isolation event. However we have not
extensively discussed how isolation is detected. This is one of the key mechanisms of HA. Isolate
detection is a mechanism that takes place on the host that is isolated. The remaining, non-isolated,
hosts dont know if that host has failed completely or if it is isolated from the network, they only
know it is unavailable.
The mechanism is fairly straightforward though and works as earlier explained with heartbeats.
When a node receives no heartbeats from any of the other nodes for 13 seconds (default setting)
HA will ping the isolation address. Remember primary nodes send heartbeats to primaries and
secondaries, secondary nodes send heartbeats only to primaries.
The isolation address is the gateway specified for the Service Console network (or management
network on ESXi), but there is a possibility to specify one or multiple additional isolation addresses
with an advanced setting. This advanced setting is called das.isolationaddress and could be used
to reduce the chances of having a false positive. We recommend to set at least one additional
isolation address.
Figure 14: das.isolationaddress
When isolation has been confirmed, meaning no heartbeats have been received and HA was unable
to ping any of the isolation addresses, HA will execute the isolation response. This could be any of
the above-described options, power down, shut down or leave powered on.
If only one heartbeat is received or just a single isolation address can be pinged the isolation
response will not be triggered, which is exactly what you want.
The default value for failure detection is 15 seconds. (das.failuredetectiontime) In other words the
failed or isolated host will be declared failed by the other hosts in the HA cluster on the fifteenth
second and a restart will be initiated by the failover coordinator after one of the primaries has
verified that the failed or isolated host is unavailable by pinging the host on its management
network.
It should be noted that in the case of a dual management network setup both addresses will be
pinged and a 1 second will need to be added to the timeline. Meaning that the failover coordinator
will initiate the restart on the 17th second.
Lets stress that again, a restart will be initiated after one of the primary nodes has tried to ping all
of the management network addresses of the failed host.
Lets assume the isolation response is Power off. The isolation response Power off will be
triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a
Power off will be initiated on the fourteenth second. A restart will be initiated on the sixteenth
second by the failover coordinator if the host has a single management network.
Does this mean that you can end up with your virtual machines being down and HA not restarting
them?
Yes, when the heartbeat returns between the 14th and 16th second the Power off might have
already been initiated. The restart however will not be initiated because the received heartbeat
indicates that the host is not isolated anymore.
Chapter 5
Adding Resiliency to HA (Network Redundancy)In the previous chapter we have extensively
covered Isolation Detection that triggers the selected Isolation Response and the impact of a false
positive. The Isolation Response enables HA to restart virtual machines when Power off or Shut
down has been selected and the host is isolated from the network.
To increase resiliency of the heartbeat network (Service Console for ESX and Management
Network for ESXi) VMware introduced the concept of NIC teaming.
NIC teaming is the process of
grouping together several
physical nics into one single
logical nic, which can
be used for network fault tolerance
and load balancing.
Using this mechanism it is possible to add redundancy to the Management Network or Service
Console network to decrease the chances of a false positive. (This is of course also possible for other
Portgroups but that is not the topic of this book.) Another option is configuring a secondary
Management Network or Service Console network. VMware supports both of these configurations
and each have their own pros and cons that are listed in the section below. To simplify the concepts
we used ESX as an example, however these recommendations are also valid for ESXi. We have
included the vMotion (VMkernel) network in our examples as combining the Service Console and
the VMkernel is the most commonly used configuration and a VMware best practice.
2 physical NICs
VLAN trunking
Recommended:
2 physical switches
The vSwitch should be configured as follows:
Each portgroup has a VLAN ID assigned and runs dedicated on its own physical NIC; only in the
case of a failure it is switched over to the standby NIC. We highly recommend setting failback to
No to avoid chances of a false positive which can occur when a physical switch routes no traffic
during boot but the ports are reported as up. (NIC Teaming Tab)
Pros: Only 2 NICs in total are needed for the Service Console and VMkernel, especially useful in
Blade environments. This setup is also less complex.
Cons: Just a single active path for heartbeats.
The following diagram depicts the active/standby scenario:
Figure 16: Active-standby Service Console network layout
3 physical NICs
VLAN trunking
Recommended:
2 physical switches
The vSwitch should be configured as follows:
vSwitch0 3 Physical NICs (vmnic0 & vmnic2)
3 Portgroup (Service Console, secondary Service Console and VMkernel)
The primary Service Console runs on vSwitch0 and active on vmnic0, with a VLAN assigned on
either the physical switch port or the portgroup and is connected to the first physical switch. (We
recommend using a VLAN trunk for all network connections for consistency and flexibility.)
The secondary Service Console will be active on vmnic2 and connected to the second physical
switch.
The VMkernel is active on vmnic1 and standby on vmnic2.
Pros - Decreased chances of false alarms due to Spanning Tree problems as the setup contains
two Service Consoles that are both connected to only 1 physical switch. Subsequently both Service
Consoles will be used for the heartbeat mechanism that will increase resiliency.
Cons - Need to set advanced settings. It is mandatory to set an additional isolation address
(das.isolationaddress2) in order for the secondary Service Console to verify network isolation via a
different route.
The following diagram depicts the secondary Service Console scenario:
Figure 17: Secondary management network
The question remains; which would we recommend? Both scenarios are fully supported and
provide a highly redundant environment either way. Redundancy for the Service Console or
Management Network is important for HA to function correctly and avoid false alarms about the
host being isolated from the network. We however recommend the first scenario. Redundant NICs
for your Service Console adds a sufficient level of resilience without leading to an overly complex
environment.
Chapter 6
Admission Control
Admission Control is often misunderstood and disabled because of this. However Admission
Control is a must when availability needs to be guaranteed and isnt that the reason for enabling HA
in the first place?
What is HA Admission Control about? Why does HA contain Admission Control?
The Availability Guide a.k.a HA bible states the following:
vCenter Server uses Admission
Control to ensure that sufficient
resources are available in
a cluster to provide failover
protection and to ensure that
virtual machine resource
reservations are respected.
Admission Control guarantees capacity is available for an HA initiated failover by reserving
resources within a cluster. It calculates the capacity required for a failover based on available
resources. In other words if a host is placed into maintenance mode, or disconnected, it is taken out
of the equation. Available resources also mean that the virtualization overhead has already been
subtracted from the total. To give an example; Service Console Memory and VMkernel memory is
subtracted from the total amount of memory that results in the available memory for the virtual
machines.
There is one gotcha with Admission Control that we want to bring to your attention before drilling
into the different policies.
When Admission Control is set to strict, VMware Distributed Power Management in no way will
violate availability constraints. This means that it will always ensure multiple hosts are up and
running. (For more info on how DPM calculates read Chapter 18)
When Admission Control was disabled and DPM was enabled in a pre-vSphere 4.1 environment you
could have ended up with all but one ESX host placed in sleep mode, which could lead to potential
issues when that particular host failed or resources were scarce as there would be no host available
to power-on your virtual machines. (KB: http://kb.vmware.com/kb/1007006)
With vSphere 4.1 however; if there are not enough resources to power on all hosts, DPM will be
asked to take hosts out of standby mode to make more resources available and the virtual machines
can then get powered on by HA when those hosts are back online.
Below we have listed all three options currently available as the Admission Control Policy. Each
option has a different mechanism to ensure resources are available for a failover and each option
has its caveats.
more details on memory overhead per virtual machine configuration) The following example will
clarify what worst-case actually means.
Example - If virtual machine VM1 has 2GHz of CPU reserved and 1024MB of memory reserved
and virtual machine VM2 has 1GHz of CPU reserved and 2048MB of memory reserved the slot
size for memory will be 2048MB (+memory overhead) and the slot size for CPU will be 2GHz. It is a
combination of the highest reservation of both virtual machines. Reservations defined at the
Resource Pool level however, will not affect HA slot size calculations.
Basic design principle:Be really careful with reservations, if theres no need to have them on a per
virtual machine basis; dont configure them, especially when using Host Failures Cluster Tolerates.
If reservations are needed, resort to resource pool based reservations.
Now that we know the worst case scenario is always taken into account when it comes to slot size
calculations we will describe what dictates the amount of available slots per cluster.
We will need to know what the slot size for memory and CPU is first. Then we will divide the total
available CPU resources of a host by the CPU slot size and the total available Memory Resources of a
host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most
restrictive number, again worst-case scenario is the number of slots for this host. If you have 25
CPU slots but only 5 memory slots, the amount of available slots for this host will be 5 as HA always
will always take the worst case scenario into account to guarantee all virtual machines can be
powered on in case of a failure or isolation.
The question we receive a lot is how do I know what my slot size is? The details around slot sizes
can be monitored on the HA section of the Clusters summary tab by clicking the Advanced
Runtime Info line.
Figure 19: High Availability cluster summary tab
This will show the following screen that specifies the slot size and more useful details around the
amount of slots available.
As you can see using reservations on a per-VM basis can lead to very conservative consolidation
ratios. However, with vSphere this is something that is configurable. If you have just one virtual
machine with a really high reservation you can set the following advanced settings to lower the slot
size used for these calculations: das.slotCpuInMHz or das.slotMemInMB.
To avoid not being able to power on the virtual machine with high reservations the virtual machine
will take up multiple slots. When you are low on resources this could mean that you are not able to
power-on this high reservation virtual machine as resources may be fragmented throughout the
cluster instead of available on a single host. As of vSphere 4.1 HA will notify DRS that a power-on
attempt was unsuccessful and a request will be made to defragment the resources to accommodate
the remaining virtual machines that need to be powered on.
The following diagram depicts a scenario where a virtual machine spans multiple slots:
Notice that because the memory slot size has been manually set to 1024MB one of the virtual
machines (grouped with dotted lines) spans multiple slots due to a 4GB memory reservation. As
you might have noticed none of the hosts has 4 slots left. Although in total there are enough slots
available; they are fragmented and HA will not be able to power-on this particular virtual machine
directly but will request DRS to defragment the resources to accommodate for this virtual machines
resource requirements.
Admission control does not take fragmentation of slots into account when slot sizes are manually
defined with advanced settings. It will take the number of slots this virtual machine will consume
into account by subtracting them from the total number of available slots, but it will not verify the
amount of available slots per host to ensure failover. As stated earlier though HA will request DRS,
as of vSphere 4.1, to defragment the resources. However, this is no guarantee for a successful
power-on attempt or slot availability.
Basic design principle:Avoid using advanced settings to decrease the slot size as it could lead to
more down time and adds an extra layer of complexity. If there is a large discrepancy in size and
reservations are set it might help to put similar sized virtual machines into their own cluster.
The sixth host is a brand new host that has just been bought and as prices of memory dropped
immensely the decision was made to buy 32GB instead of 16GB.
The cluster contains a virtual machine that has 1 vCPU and 4GB of memory. A 1024MB memory
reservation has been defined on this virtual machine. As explained earlier a reservation will dictate
the slot size, which in this case leads to a memory slot size of 1024MB+memory overhead. For the
sake of simplicity we will however calculate with 1024MB.
When Admission Control is enabled and a number of host failures has been specified as the
Admission Control Policy, the amount of slots will be calculated per host and the cluster in total.
This will result in:
ESX01 - 16 Slots
ESX02 - 16 Slots
ESX03 - 16 Slots
ESX04 - 16 Slots
ESX05 - 16 Slots
ESX06 - 32 Slots
As Admission Control is enabled a worst-case scenario is taken into account. When a single host
failure has been specified, this means that the host with the largest number of slots will be taken
out of the equation. In other words for our cluster this would result in:
esx01 + esx02 + esx03 + esx4 + esx5 = 80 slots available
Although you have doubled the amount of memory in one of your hosts you are still stuck with only
80 slots in total. As clearly demonstrated there is absolutely no point in buying additional memory
for a single host when your cluster is designed with Admission Control enabled and a number of
host failures as the Admission Control Policy has been selected.
In our example the memory slot size happened to be the most restrictive, the same principle is
applied when CPU slot size is most restrictive.
Basic design principle:When using Admission Control, balance your clusters and be conservative
with reservations as it leads to decreased consolidation ratios.
Now what would happen in the scenario above when the number of allowed host failures is to 2?
In this case ESX06 is taken out of the equation and one of any of the remaining hosts in the cluster is
also taken out. It would result in 64 slots. This makes sense doesnt it?
Can you avoid large HA slot sizes due to reservations without resorting to advanced settings? Thats
the question we get almost daily. The answer used to be NO if per virtual machine reservations
were required. HA uses reservations to calculate the slot size and theres no way to tell HA to ignore
them without using advanced settings pre-vSphere. With vSphere, the new Percentage method is
an alternative.
In other words:
((Total amount of available resources total reserved virtual machine resources)/total amount of
available resources) <= (percentage ha should reserve as spare capacity)
Here total reserved virtual machine resources include the default reservation of 256MHz and the
memory overhead of the virtual machine.
Lets use a diagram to make it a bit clearer:
Figure 23: Percentage of cluster resources reserved
Total cluster resources are 24GHz (CPU) and 96GB (MEM). This would lead to the following
calculations:
((24GHz-(2Gz+1GHz+256MHz+4GHz))/24GHz) = 69 % available
((96GB-(1,1GB+114MB+626MB+3,2GB)/96GB= 85 % available
As you can see the amount of memory differs from the diagram. Even if a reservation has been set,
the amount of memory overhead is added to the reservation. For both metrics HA Admission
Control will constantly check if the policy has been violated or not. When one of either two
thresholds is reached, memory or CPU, Admission Control will disallow powering on any additional
virtual machines. These thresholds can be monitored on the HA section of the Clusters summary
tab
If you have an unbalanced cluster (hosts with different sizes of CPU or memory resources) your
percentage should be equal or preferably larger than the percentage of resources provided by the
largest host. This way you ensure that all virtual machines residing on this host can be restarted in
case of a host failure.
As earlier explained this Admission Control Policy does not use slots, as such resources might be
fragmented throughout the cluster. Although as of vSphere 4.1 DRS is notified to rebalance the
cluster, if needed, to accommodate for these virtual machines resource requirements a guarantee
cannot be given. We recommend ensuring you have at least one host with enough available capacity
to boot the largest virtual machine (reservation CPU/MEM). Also make sure you select the highest
restart priority for this virtual machine (of course depending on the SLA) to ensure it will be able to
boot.
The following diagram will make it more obvious. You have 5 hosts, each with roughly 80%
memory usage, and you have configured HA to reserve 20% of resources. A host fails and all virtual
machines will need to failover. One of those virtual machines has a 4GB memory reservation, as you
can imagine, the first power-on attempt for this particular virtual machine will fail due to the fact
that none of the hosts has enough memory available to guarantee it.
Basic design principle:Although vSphere 4.1 will utilize DRS to try to accommodate for the resource
requirements of this virtual machine a guarantee cannot be given. Do the math; verify that any
single host has enough resources to power-on your largest virtual machine. Also take restart
priority into account for this/these virtual machine(s).
Failover Host
The third option one could choose is a designated Failover host. This is commonly referred to as a
hot standby. There is actually not much to tell around this mechanism, as it is what you see is what
you get. When you designate a host as a failover host it will not participate in DRS. You will not be
able to power on virtual machines on this host! It is almost like it is in maintenance mode and it will
only be used in case a failover needs to occur.
Chapter 7
Impact of Admission Control Policy
As with any decision when architecting your environment there is an impact. This especially goes
for the Admission Control Policy. The first decision that will need to be made is if Admission Control
is enabled or not. We recommend enabling Admission Control but carefully select the policy and
ensure it fits your or your customers needs.
Pros:
Fully automated (When a host is added to a cluster, HA re-calculates how many slots are
available.)
Ensures failover by calculating slot sizes.
Cons:
Can be very conservative and inflexible when reservations are used as the largest reservation
dictates slot sizes.
Percentage based Admission Control is the latest addition to the HA Admission Control Policy. The
percentage based Admission Control is based on per VM reservation calculations instead of slots.
Pros:
Cons:
Manual calculations needed when adding additional hosts in a cluster and number of host failures
need to remain unchanged.
Unbalanced clusters can be a problem when chosen percentage is too low and resources are
fragmented, which means failover of a virtual machine cant be guaranteed as the reservation of this
virtual machine might not be available as resources on a single host.
Pros:
Cons:
Recommendations
We have been asked many times for our recommendation on Admission Control and it is difficult to
answer as each policy has its pros and cons. However, we generally recommend a Percentage based
Admission Control Policy. It is the most flexible policy as it uses the actual reservation per virtual
machine instead of taking a worse case scenario approach like the number of host failures does.
However, the number of host failures policy guarantees the failover level under all circumstances.
Percentage based is less restrictive, but offers lower guarantees that in all scenarios, HA will be able
to restart all virtual machines. With the added level of integration between HA and DRS we believe
a Percentage based Admission Control Policy will fit most environments.
Basic design principle: Do the math, and take customer requirements into account.
We recommend using a Percentage based Admission Control Policy, as it is the most
flexible policy.
Chapter 8
VM Monitoring
VM monitoring or VM level HA is an often overlooked but really powerful feature of HA. The reason
for this is most likely that it is disabled by default and relatively new compared to HA. We have
tried to gather all the info we could around VM Monitoring but it is a pretty straightforward
product that actually does what you expect it would do.
With vSphere 4.1 VMware also introduced VM and Application Monitoring. Application Monitoring
is a brand new feature that Application Developers can leverage to increase resiliency as shown in
the screenshot below.
Figure 26: VM and Application Monitoring
As of writing there was little information around Application Monitoring besides the fact that the
Guest SDK is be used by application developers or partners like for instance Symantec to develop
solutions against the SDK. In the case of Symantec a simplified version of Veritas Cluster Server
(VCS) is used to enable application availability monitoring including of course responding to issues.
Note that it is not a multi-node clustering solution like VCS itself but a single node solution.
Symantec ApplicationHA as it is called is triggered to get the application up and running again by
restarting it. Symantec's ApplicationHA is aware of dependencies and knows in which order
services should be started or stopped. If however for whatever reason this fails for an "X" amount
(configurable option within ApplicationHA) of times HA will be asked to take action. This action will
be a restart of the virtual machine.
Although Application Monitoring is relatively new and there are only a few partners currently
exploring the capabilities it does add a whole new level of resiliency in our opinion. We have tested
ApplicationHA by Symantec and personally feel it is the missing link. It enables you as System
Admin to integrate your virtualization layer with your application layer. It ensures you as a System
Admin that services, which are protected, are restarted in the correct order and it avoids the
common pitfalls associated with restarts and maintenance.
When enabling VM/App Monitoring, the level of sensitivity can be configured. The default setting
should fit most situations. Low sensitivity basically means that the amount of allowed missed
heartbeats is higher and as such the chances of running into a false positive are lower. However if a
failure occurs and the sensitivity level is set to low the experienced downtime will be higher. When
quick action is required in case of a possible failure high sensitivity can be selected, and as
expected this is the opposite of low sensitivity.
Table 1: VM monitoring sensitivity
It is important though to remember that VM Monitoring does not infinitely reboot virtual machines;
to avoid a problem from repeating. By default when a virtual machine has been rebooted three
times within an hour no further attempts will be taken. Well unless of course the specified time has
elapsed. The following advanced settings can be set to change this default behavior:
High Availability advanced settings:
das.maxFailureWindow
das.iostatsInterval - Amount of seconds VM Monitoring will look back to see if any Storage
or Network I/O has taken place before deciding to reboot a virtual machine in the case no VMware
Tools heartbeats are received. The default value is 120 seconds.
Although the heartbeat produced by VMware Tools is reliable, VMware added a further verification
mechanism. To avoid false positives, VM Monitoring also monitors I/O activity of the virtual
machine. When heartbeats are not received AND no disk or network activity occurred over the last
120 seconds (per default) the virtual machine will be reset. This 120 second interval can be
modified of course by changing the advanced setting das.iostatsInterval as described above.
Screenshots
The cool thing about VM Monitoring is the fact that it takes screenshots of the VM console. They are
taken right before a virtual machine is reset by VM Monitoring. This has been added as of vCenter
4.0. It is a very useful feature when a virtual machine freezes every once in a while with no
apparent reason. This screenshot can be used to debug the virtual machine operating system, if and
when needed, and is stored in the virtual machines working directory.
Chapter 9
vSphere 4.1 HA and DRS Integration
HA integrates on multiple levels with DRS as of vSphere 4.1. It is a huge improvement and it is
something that we wanted to stress as it has changed the behavior and the reliability of HA.
Affinity Rules
VMware introduced VM-Host affinity rules with vSphere 4.1. VM-Host affinity rules are specified
within the DRS configuration and are typically used to bind a group of virtual machines to a group
of hosts.
There are two types of VM-Host affinity rules must and should. If a rule is created of the type
must HA will need to adhere to this rule when a failover needs to occur. However, if it is not
possible to perform a failover without violating the rule the failover will not be performed. Affinity
rules are covered in-depth in the DRS section of this book.
Resource Fragmentation
As of vSphere 4.1 HA is closely integrated with DRS. When a failover occurs HA will first check if
there are resources available on that host for the failover. If for instance that particular virtual
machine has a very large reservation, and the Admission Control Policy is based on a percentage, it
could happen that resources are fragmented across multiple hosts. (For more details on this
scenario see Chapter 7.) HA, as of vSphere 4.1, will ask DRS to defragment the resources to
accommodate for this virtual machines resource requirements. Although HA will request a
defragmentation of resources, a guarantee cannot be given. As such, even with this additional
integration you should still be cautious when it comes to resource fragmentation.
DPM
In the past there barely was integration between DRS/DPM and HA. Especially when DPM was
enabled this could lead to some weird behavior when resources where scarce and an HA failover
would need to happen. With vSphere 4.1 this has changed. In such cases, HA will use DRS to try to
adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual
machines to defragment the cluster resources) so that HA can perform the failovers.
Flattened Shares
Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When
HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool.
However, the virtual machines shares were scaled for its appropriate place in the resource pool
hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too
many or too few resources relative to its entitlement.
A scenario where and when this can occur would be the following:
VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs
and both will have 50% of those 20003 shares. The following diagram depicts this scenario:
Figure 28: Flatten shares starting point
When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a
custom shares value of 10.000 was specified on both VM2 and VM3 they will completely blow away
VM1 in times of contention. This is depicted in the following diagram:
This situation would persist until the next invocation of DRS would re-parent the virtual machine to
its original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual
machines shares and limits before fail-over. This flattening process ensures that the virtual
machine will get the resources it would have received if it had failed over to the correct Resource
Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed
under the Root Resource Pool with a shares value of 1000.
Figure 30: Flatten shares after host failure before DRS invocation
Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and
will receive the amount of shares they had originally assigned again.
Chapter 10
Summarizing
The integration of HA with DRS has been vastly improved and so has HA in general. We hope
everyone sees the benefits of these improvements and of HA and VM and Application Monitoring in
general. We have tried to simplify some of the concepts to make it easier to understand, still we
acknowledge that some concepts are difficult to grasp. We hope though that after reading this
section of the book everyone is confident enough to make the changes to HA needed to increase
resiliency and essentially uptime of your environment because that is what it is all about.
If there are any questions please do not hesitate to reach out to either of the authors.
Part 2
VMware Distributed Resource Scheduler
Chapter 11
What is VMware DRS?
VMware Distributed Resource Scheduler (DRS) is an infrastructure service run by VMware vCenter
Server (vCenter). DRS aggregates ESX host resources into clusters and automatically distributes
these resources to the virtual machines.
DRS monitors resource usage and continuously optimizes the virtual machine resource distribution
across ESX hosts.
DRS computes the resource entitlement for each virtual machine based on static resource allocation
settings and dynamic settings such as active usage and level of contention.
DRS attempts to satisfy the virtual machine resource entitlement with the resources available in the
cluster by leveraging vMotion. vMotion is used to either migrate the virtual machines to alternative
ESX hosts with more available resources or migrating virtual machines away to free up resources.
Because DRS is an automated solution and easy to configure, we recommend enabling DRS to
achieve higher consolidation ratios at low costs.
A DRS-enabled cluster is often referred to as a DRS cluster. In vSphere 4.1, a DRS cluster can
manage up to 32 hosts and 3000 VMs.
Initial placement
Load balancing
DRS distributes virtual machine workload across the ESX hosts inside the
cluster. DRS continuously monitors the active workload and the available resources, DRS compares
the results to the ideal resource distribution and performs or recommends virtual machines
migration to ensure workloads receive the resources to which they are entitled and with the goal of
allocating resources to maximize workload performance.
Power management
Constraint correction
Requirements
In order for DRS to function correctly the environment must meet the following requirements:
For DRS to allow automatic load-balancing, vMotion is required. For initial placement though,
vMotion is not a requirement.
The configuration of cluster sizes, combination of workload types, virtual machine management
and amount of virtual machines have impact on the behavior and performance of the vCenter and
therefore influence the performance of DRS threads, which in turn can impact the performance of
the virtual machines due to slow or insufficient load-balancing migration recommendations and
resource entitlement calculations.
For example; vCenter servers in Virtual Desktop Infrastructure environments experience more load
due to the amount of virtual machines and higher frequency of virtual machine power state
changes, which leads to invoking DRS threads more often. (Table 3 of Chapter 14 list the events
invoking DRS calculations)
Separate workloads
In large environments its recommended to separate VDI workloads and server workloads and
assign different clusters to each workload to reduce the DRS invocations. By isolating server
workloads from VDI workloads, only the VDI cluster experience increased DRS invocations,
reducing the complexity and the amount of calculations performed by DRS per cluster.
Amount of clusters
A lower amount of virtual machines inside a cluster will reduce the amount of load-balancing
calculations, therefore ensuring fast DRS performance. A lower amount of virtual machines
generally results in a smaller amount of hosts per cluster. However the potential danger is creating
too much small size clusters, having 200 x 3 host clusters instead of 100 x 6 host clusters could
drive up CPU utilization of the vCenter as each cluster will at least invoke the periodic loadbalancing every 5 minutes.
It is believed that the current sweetspot of hosts per cluster ranges between the 16 and 24 hosts
per cluster, offering sufficient options to load-balance the virtual machines across the host inside
the cluster without introducing too many DRS threads in vCenter.
Chapter 12
DRS Cluster Settings
When DRS is enabled on the cluster you need to select the automation level and set the migration
threshold. DRS settings can be modified when the cluster is in use and without disruption of
service. The following steps however will show you how to create a cluster and how to enable
DRS:
1. Select the Hosts & Clusters view.
2. Right Click your Datacenter in the Inventory tree and click New Cluster
3. Give the new cluster an appropriate name. We recommend at a minimum including the location
of the cluster and a sequence number ie. ams-hadrs-001.
4. In the Cluster Features section of the page, select Turn On VMware DRS and click Next
5. Verify the Automation Level is set to Fully Automated and select Next
6. Leave the Swapfile Policy set to default and click Next
7. Click Finish to complete the creation of the cluster
Figure 32: Enable DRS
Automation Level
The automation level determines the level of autonomy of DRS, ranging from generating placement
and load-balancing recommendations to automatically applying the generated recommendations.
Three automation levels exist:
Manual - DRS generates recommendations for initial placement of the virtual machines and if the
cluster becomes unbalanced, DRS suggests migration recommendations for the virtual machines.
The recommendations will not be automatically applied, but they will be applied if an administrator
accepts each one.
Partially automated -
Fully automated
- DRS places the virtual machine on the most suitable host when it is
powered on. If the cluster is unbalanced, DRS migrates the virtual machine to a more suitable hosts.
Table 2: DRS automation level
Initial Placement
Initial placement occurs when a virtual machine is powered on or resumed. By default DRS selects
an ESX host based on the virtual machine resource entitlement. DRS will create a prioritized list of
recommended hosts for virtual machine placement, if the cluster is configured with the manual
automation level. This list is presented to the user to help the user selecting the appropriate host.
Also the DRS policy settings are taken into account, discussed in a later chapter.
administrators should check the recommendations after each DRS invocation to solve the cluster
imbalance. Besides inefficiency, it is possible that DRS rules are violated because the administrators
apply the recommendations infrequently. DRS rules are explained in section Rules of chapter 16.
The automation level of the cluster can be changed without disrupting virtual machines. Its easy
to change it, so why not try fully automatic for a while to get comfortable with it?
Chapter 13
Resource Management
As stated before, the primary goal of DRS is to ensure each virtual machine receives its entitled
resources and to do this it rebalances virtual machine workload across the hosts in the cluster. But
contrary to popular believe DRS is not concerned with performance per se. Instead DRS focuses on
whether each virtual machine inside the cluster or resource pool gets its specified resource
allocation. DRS examines the current demand and contention in the environment and uses the
resource allocation settings of the virtual machine to determine the resource entitlement for it.
By trying to ensure that all virtual machines receive enough resources to satisfy their resource
entitlement, DRS assumes that a virtual machine should not have any performance problems if it
receives those resources. In other words, the entitlements are adequate to ensure the virtual
machines performance goals. To satisfy each virtual machine resource entitlement, DRS
dynamically moves a virtual machine across the cluster to optimize the cluster load balance. To do
this as effectively as possible, DRS computes the cluster imbalance and creates recommendations
for migrating virtual machines to solve the resource supply and resource demand imbalance.
To properly interact with the local resource scheduler of each ESX host, DRS converts cluster level
resource pool settings into host level settings. Let us look at the scheduler architecture layer.
The global scheduler supervises the entire cluster and the local scheduler manages the resource
allocation of the virtual machines on each host. Both DRS and the local ESX host scheduler use the
virtual machines resource allocation settings to compute the resource entitlement of the virtual
machines. This entitlement is the allocation of resources that a virtual machine should receive.
DRS relies on host level scheduling to allocate resources, the global scheduler calculates resource
entitlement when virtual machines are placed inside a resource pool and sends these calculations
to the host. The host-level CPU and memory schedulers handle the resource entitlement of the
virtual machine.
Resource Entitlement
Every virtual machine has a resource entitlement for CPU and memory. This is how much of the
physical resources ESX thinks the virtual machine should get. It is the target ESX defines for how
much resource to give the virtual machine. By default this will be everything the virtual machines
wants, unless there is too little resources to meet all the virtual machines aggregated demand (in
other words contention), or if there is an artificial limit imposed. A virtual machines resource
entitlement changes as the virtual machines runs.
A virtual machines resource entitlement is based on static entitlement and dynamic entitlement
which subsequently consist of static settings and dynamic metrics. The static entitlement consists of
resource allocation settings, reservations, shares, and limits. The dynamic entitlement consists of
dynamic metrics, such as the estimated active memory, which is also known as the working set size,
its CPU demand which is an estimated amount of CPU the virtual machine would consume if no
contention exists and the utilization or degree of contention of the host.
Resource allocation settings:
Shares - Shares specify the relative importance of the virtual machine. Shares are always
measured against other powered-on sibling virtual machines and resource pools on the
host.
Limit - Also referred as MAX. A limit specifies an upper bound for resources that can be
allocated to a virtual machine. By default, if no limit is explicitly set, the limit will implicitly
be the amount installed as virtual hardware in the virtual machine.
Working set - Estimated amount of active memory of the virtual machine.
CPU demand - Estimated amount of CPU the virtual machine would consume if no
contention.
Idle memory tax - Mechanism by which ESX reallocates unreserved idle memory from
virtual machines
It is very important to know that if a cluster or host is under-committed the virtual machine
resource entitlement will be the same as its resource demand. In other words, the virtual machine
will be allocated whatever it wants to consume within its configured limit. The virtual machine will
receive its CPU cycles and the memory pages issued by the virtual machine will be mapped on
machine pages (physical memory of the ESX host). A limit is the only exception; the ESX host can
still revert to swapping when a memory limit is set on the virtual machine because this introduces
an arbitrary cap not due to genuine contention.
When a cluster is overcommitted, the cluster might experience more resource demand than its
current capacity; in that case, DRS and the VMkernel will distribute and allocate resources based on
the resource entitlement of each virtual machine
idle memory tax to the virtual machine and all the inactive memory can be reallocated to other
virtual machines during contention. The amount of shares determines which virtual machine has a
priority over other virtual machines. ESX compares the shares value of each sibling virtual machine
and select victims to confiscate memory from.
This is the memory entitlement of the virtual machine. If the virtual machine uses more than its
entitlement during contention, the excess memory is ballooned, compressed or swapped depending
on the free memory state of the ESX host. The host will keep on reclaiming memory until the virtual
machine resource usage is at or below its resource entitlement.
Chapter 14
Calculating DRS Recommendations
DRS takes several metrics into account when calculating migration recommendations to load
balance the cluster; the current resource demand of the virtual machines, host resource availability
and the applied high-level resource policies. The following section explores how DRS use these
metrics to create a new and better placement of virtual machine than the existing location of the
virtual machine, while still satisfying all the requirements and constraints.
Each recommendation generated by DRS and the recommendations not launched are retired at the
next invocation of DRS. DRS might generate the exact recommendation again if the imbalance is not
solved. The interval in which the DRS algorithm is invoked can be controlled through the vpxd
configuration file (vpxd.cfg) with the following option:
vpxd config file:
<config>
<drm>
<pollPeriodSec>
300
</pollPeriodSec>
</drm>
</config>
The default frequency is 300 seconds, but can be set to anything in the range of 60 seconds to 3600
seconds. It is strongly discouraged to change the default value. A less frequent interval might
reduce the number of vMotions and therefore overhead but would risk leaving the cluster
imbalanced for a longer period of time. Shortening the interval will likely generate extra overhead,
for little added benefit.
MaxMovesPerHost
Adjusting the interval impacts the amount of migrations DRS will recommend. There are limits to
how many migrations DRS will recommend per interval per ESX host. A limit is imposed because
there is no advantage in recommending migrations that cannot be complement before the next DRS
invocation. During the next re-evaluation cycle, virtual machine resource demand can have changed
rendering the previous recommendations obsolete
Please note that there is no limit on max moves per host for a host entering maintenance or standby
mode, but only a limit on the maximum amount of moves per host for load balancing. This can, but
usually shouldn't, be changed by setting the DRS Advanced Option MaxMovesPerHost. The
default value is of this parameter is 8.
In vSphere 4.1, the limit on moves per host is dynamic, based on how many moves DRS estimates
that can be completed during one DRS evaluation interval cycle.
The MaxMovesPerHost value is adaptive to the maximum number of concurrent vMotions per
host and the average migration time observed from previous migrations. These improvements
make DRS less conservative compared to that in vSphere 4.0 and allow DRS to reach a steady state
more quickly when a significant load imbalance in the cluster exist.
The MaxMovesPerHost parameter still exist, but can be exceeded by DRS. The
MaxMovesPerHost setting should adapt to the DRS invocation frequency & average time per
vMotion, so there is no need to tweak the value.
Recommendation Calculation
To generate a migration recommendation, DRS executes a series of calculations and passes in which
it determines the level of cluster imbalance and which virtual machines it needs to migrate to solve
the imbalance.
Constraints Correction
Before DRS runs its load-balancing pass, it runs a pass to consider and correct constraints,
including:
Evacuating hosts that the user requested enter maintenance or standby mode.
Correcting Mandatory VM-Host affinity/anti-affinity rule violations.
Correcting VM/VM anti-affinity mode violations.
Correcting VM/VM affinity mode violations.
Correcting host resource overcommitment (rare, since DRS is controlling resources).
These constraints are respected during load-balancing. The constraints may cause imbalance, and
that imbalance may not be fixable due to these constraints. The imbalance info on the cluster
summary page informs the administrator if there was an unfixable imbalance.
VM-Host affinity rules are a special case. More details about VM-Host affinity rules can be found in
chapter 16. But a quick primer on VM-Host affinity rules should help you to understand the
Constraints Correction Pass better. vSphere 4.1 introduces VM-Host affinity/anti affinity rules in
addition to the VM-VM affinity (or anti-affinity) rules.
The VM-Host affinity (or anti-affinity) rules specify which virtual machines must or should run on a
group of ESX hosts. Two types of VM-Host affinity/anti-affinity rules exists; Must-rules and Shouldrules. Must-rules are mandatory rules for HA, DRS and DPM. The Should-rule is a preferential rule
for DRS and DPM and both DRS and DPM use their best effort to apply the Should- rules.
The Should-rules are a special case in DRS algorithm. The entire DRS algorithm (constraint
correction + load-balancing) is executed and DRS tries to place the virtual machines listed in the
Should-rules. DRS essentially treats the Should-rules as Hard-Rules during this phase. If all virtual
machines can be placed without introducing violations or over-utilized hosts the results are output.
If the Should-rules introduce constraints or over-utilization of hosts the DRS algorithm is repeated
again with the Should-rules dropped (since they are best effort) and retried with only the Mustrules in place.
Imbalance Calculation
DRS needs to establish if the cluster is imbalanced, it does this by comparing the "current hosts load
standard deviation" metric to the "target host load standard deviation". If the Current Host Load
Standard Deviation (CHLSD) exceeds the Target Host Load Standard Deviation (THLSD), the cluster
is considered imbalanced. To calculate the CHLSD and THLSD DRS needs to determine the load of
each host first. It does this by computing the resource entitlement of each active virtual machine on
the host and summing the entire virtual machine load on the same host. This sum is divided by the
capacity of the host and this value is called the hosts normalized entitlement.
VM entitlement
Capacity of host
The memory capacity of the host is lower than the amount of installed physical memory. The
capacity of the host is calculated by subtracting the VMkernel overhead, the Service Console
overhead from the installed memory and a 6% reservation. In a cluster with VMware High
Availability (HA) enabled and HA Admission Control enabled (default), DRS maintains excess
powered-on capacity to meet the High Availability settings. This information is displayed on the
resource allocation tab of the cluster in vCenter.
The outcome of the sum (VM entitlement) / (capacity of host) becomes the load metric of the host
(CHLSD). The standard deviation of this value across all the hosts in the cluster is the CHLSD. Given
this value and the migration threshold, DRS computes the Target Host Load Standard Deviation
(THLSD).
Every migration recommendation from DRS has a priority level which indicates how beneficial the
migration is expected to be. The conservative migration threshold setting generates only the
priority-one recommendations which are mandatory recommendations. Selecting the aggressive
migration threshold setting, the cluster will be less tolerant to cluster imbalance and will generate
priority-five recommendations, which are expected to produce only very modest improvements.
DRS procedure:
While (load imbalance metric > threshold) {
move = GetBestMove();
If no good migration is found:
stop;
Else:
Add move to the list of recommendations;
Update cluster to the state after the move is added; }
While the cluster is imbalanced (Current host load standard deviation > Target host load standard
deviation), DRS selects a virtual machine to migrate based on specific criteria and simulates the
migration in the cluster. In this simulation, DRS computes the possible Current Host Load Standard
Deviation after the migration. If the CHLSD is still above the threshold it will repeat the procedure
but if this migration solves the imbalance, it will stop after adding it to the migration
recommendation list.
The GetBestMove procedure aims to find the virtual machine that will give the best improvement in
the cluster wide imbalance. The GetBestMove procedure consists of the following instructions:
getbestmove procedure:
GetBestMove() {
For each virtual machine v:
For each host h that is not Source Host:
If h is lightly loaded compared to Source Host:
If Cost-Benefit and Risk Analysis accepted simulate move v to h measure new cluster-wide load
imbalance metric as g
Return move v that gives least cluster-wide imbalance g.}
This procedure tries to find the migration that will offer the best improvement. DRS cycles through
each DRS-enabled virtual machine and each host that is not the source host. Only hosts that are
using fewer resources than the source host are considered.
After the cost-benefit and risk analysis is completed and the results are accepted a migration of the
virtual machine to the host is simulated. DRS will measure the new cluster-wide load imbalance
metric. DRS does this for all the virtual machines and compares the result of all tried combinations
(VM<->Host) and returns the vMotion that results in the least cluster imbalance.
Basic design principle:The number of clusters and virtual machines managed by vCenter influences
the number of calculations which impacts the performance of vCenter. Take this into account when
sizing the vCenter server.
Cost - During migration vMotion tries to reserve 30% of a physical CPU on both the source and
destination host. A shadow virtual machine is created during the vMotion process on the
destination host; the memory consumption by this shadow virtual machine is also factored in to the
cost of the recommendation. At the end of the vMotion process the migrated virtual
machine has a short period of downtime where a snapshot is made of the virtual machine and is
resumed on the destination host. This brief downtime is approximately one second or less and is
not disruptive to virtual machine connections.
NOTE
The term downtime needs some clarification; Downtime indicates the interruption of
service of the virtual machine. Downtime incurred during vMotion is usually measured in
milliseconds. But because there is an interruption of service, although negligible, it needs to
be factored in.
Benefit Due to the migration of the virtual machine, resources are freed up on the source host and the
virtual machine itself receives more resource due to the availability of resources on its new host.
The migration of workload will result in a much more balanced cluster.
Risk Risks accounts for the possibility of irregular loads. Irregular load indicates inconsistent and spiky
demand workloads.
The cost-benefit and risk analysis results in a resource gain, whether positive or negative, VMware
used the following chart to illustrate this resource gain.
Figure 36: Resource gain calculation
The X-axis of the chart displays the progress of time and the Y-axis shows the absolute positive or
negative gain of the virtual machine on both source and destination hosts. The resource gain is in
term of the absolute units MHz or MB depending on the type being measured.
DRS uses historical data for this calculation. DRS starts with determining how much resources the
virtual machine is consuming, by using the metrics Host CPU: active and Host Memory: active. After
establishing the consumed resources, it will predict the amount of time of this workload; this is
called stable time. After stable time, DRS becomes conservative and it assumes that the virtual
machine will run at the worst possible load listed in the history (up to 60 minutes) until the next
DRS invocation time. Imagine what impact adjusting the invocation interval has on the analysis.
The net resource gain is calculated for each of the periods and weighted by the length of the period.
In the example shown, migration gain is lower as there is an extra migration cost. After migration,
there is a period where the gain is positive and when the stable period ends, the worst gain turns
out to be negative. The areas are added together and the sum is used to decide if the move should
be rejected. DRS will only recommend a migration if it has an acceptable result of the cost-benefit
and risk analysis.
Basic design principle:Although DRS migrates busier virtual machines to gain the
most improvement of cluster balance, it does not justify use of big virtual machine. Virtual
machines with a smaller memory sizes or fewer virtual CPUs provide more placement
opportunities for DRS.
Virtual machines with larger memory size and/or more virtual CPUs add more constraints
to the selection and migration process. This means its recommended to configure the size
of the virtual machine to what it actually needs, preventing oversizing.
For each migration recommendation, the priority level is limited to the integral range priority 2 to
priority 5 (inclusive) and is calculated according to the following formula:
6 - ceil(LoadImbalanceMetric / 0.1 * sqrt (NumberOfHostsInCluster)).
Here, ceil (x) is the smallest integral value not less than x. LoadImbalanceMetric is the current host
load standard deviation shown on the cluster's summary page of the vSphere Client. For each host,
compute the load on the host as sum (expected VM loads)) / (capacity of host). Then compute the
standard deviation of the host load metric across all hosts to determine LoadimbalanceMetric.
The LoadimbalanceMetric value used in the algorithm is the current host load standard deviation
value and ceil rounds up the value to a integer (a whole number, like 1,2,3 etc).
Let us use this formula in an example, according to the screenshot; the 3 host cluster has a current
host load standard deviation of 0.022. According to the formula, the calculation would be:
6 ceil (0.022 / 0.1 * sqrt(3))
Figure 37: DRS summary
This would result in a priority level of 5 for the migration recommendation if the cluster was
imbalanced. We created a workflow diagram to help visualize the flow of the DRS imbalance
calculation process.
Chapter 15
Influence DRS Recommendations
Some DRS settings and feature can influence the DRS migration recommendation, this chapter takes
a closer look at the various settings and the impact they can have on the DRS processes.
Level 1 (conservative)
When selecting the conservative migration threshold level only mandatory moves, priority-one
recommendations, are being executed. The DRS cluster will not invoke load-balancing migrations.
Mandatory moves are issued when:
It is possible that a mandatory move will cause a violation on another host, if this happens DRS will
move virtual machines to fix the new violation. This scenario is possible when multiple rules exist
on the cluster. It is not uncommon to see several migrations to satisfy the configured DRS rules.
Level 3 (moderate)
The level 3 migration threshold is the default migration threshold when creating DRS clusters, the
moderate migration threshold applies priority-one, -two and priority-three recommendations
promising a good improvement in the clusters load balance.
Level 5 (aggressive)
Level 5 migrations is the right-most setting on the migration threshold slider and applies all five
priority level recommendations, every recommendation which promises even a slight improvement
in the clusters load balance is applied.
A level 1, five star recommendations should always be applied, but a list of several priority level 5
recommendations could also collectively affect the cluster negatively if those recommendations are
not applied.
Although the cost benefit risk analysis takes unstable workloads into account, selecting an
aggressive migration threshold when hosting virtual machines with varying loads in a cluster can
lead to a higher possibility of wasted migrations. A moderate migration threshold is more suitable
in such a scenario. Aggressive thresholds, level 4 and 5 are considered suitable for clusters with
equal-sized hosts, relatively constant workload demands and little to few DRS.
The default moderate migration threshold provides sufficient balance without excessive migration
activity. It is typically aggressive enough to maintain workload balance across hosts without
creating excessive overhead caused by too-frequent migrations.
Basic design principle: Select a moderate migration threshold if the cluster hosts virtual
machines with varying workloads.
Rules
VMware vSphere 4.1 contains two types of affinity rules, Virtual Machines to Host rules (VM-Host)
and Virtual Machine to Virtual Machine (VM-VM) rules. A VM-Host affinity rule specifies the affinity
between a group of virtual machines and a group of ESX hosts inside the cluster, whereas a VM-VM
affinity rule only specifies the affinity between individual virtual machines.
Affinity rules can specify if the virtual machines should stay together and run on specified hosts
(affinity rules) or if they are not allowed to run on the same host (anti-affinity).
Components
Virtual machine DRS groups and ESX host DRS groups are quite self-explanatory so lets dive into
the designations component straight away.
Designations
Two different types of VM-Host rules are available, a VM-Host affinity rule can either be a must
rule or a should rule. The must-rule is a mandatory rule for HA, DRS and DPM, it forces the virtual
machines to run on the ESX hosts specified in the ESX host DRS Group.
The should rule is a preferential rule for DRS and DPM. DRS and DPM use their best effort to try to
confine or prevent the virtual machines from running on the ESX host they are affined to, but DRS
and DPM can violate should rules if it compromises certain key operations HA is not aware of
preferential rules because DRS will not communicate these rules to HA.
HA, DRS and DPM must take the mandatory rules into account when generating or executing
operations. HA, DRS and DPM will never take any action that result in the violation of mandatory
affinity rules. Because of this, mandatory rules place more constraints on VM mobility, making it
more difficult for DRS to balance load and enforce resource allocation policies. HA and DPM
operations are constrained as well, for example mandatory rules will:
Due to their limiting behavior, it is recommended to use mandatory rules sparingly and only for
specific cases, such as licensing requirements. Preferential rules can be used to meet availability
requirements such as separating virtual machines between blade enclosures.
DPM
DPM does not place an ESX host into standby mode if the result would violate a mandatory rule and
will power-on ESX hosts if these are needed to meet the requirements of the mandatory rules.
High Availability
Due to the DRS-HA integration in vSphere 4.1, HA respects only mandatory (must) rules. During an
ESX host failure event, HA uses an archived list of hosts provided by DRS and places the virtual
machines only on a compatible host, i.e. one of the hosts that are allowed by the mandatory rules.
HA is unaware of the preferential (should) rules, so HA might unknowingly violate the rule during
placement of virtual machines after an ESX failure, but the violation will be corrected by the next
DRS invocation.
Let us take a look at a configuration which is very likely to be widely implemented soon the Oracle
Must affinity rule.
1. Place all Oracle virtual machines in a Cluster VM DRS group. (VM01, VM03, VM11, VM20)
2. Place all Oracle licensed ESX host in a Cluster Host DRS Group (ESX01, ESX02, ESX09,
ESX10)
3. Select Must run on Host in Group
In this scenario, DRS never places, migrates, or recommends placement of a host-affined virtual
machine on a host to which is not listed in the Cluster Host DRS Group (ESX01 ESX06 & ESX09ESX14). This means that DRS will never ever place the virtual machine on an unlicensed host, not
for maintenance mode, not for DPM power saving and not after an ESX host failure event.
This virtual-machine-to-host affinity rule makes it possible to run Oracle inside big clusters without
having to license all the ESX hosts. Oracle licenses can create a constraint on the design. Normally
separate smaller clusters were deployed for Oracle database virtual machines, increasing both
OPEX and CAPEX of the environment. These new rules allow the Oracle virtual machines to run
inside the main cluster with other virtual machines without having to license all the ESX hosts
inside the cluster.
You can customize the automation level for individual virtual machines in a DRS cluster to override
the automation level set on the cluster. This might be necessary to meet certain availability or
business requirements.
There are five automation level modes:
Fully Automated
Partially Automated
Manual
Default (cluster automation level)
Disabled
If the automation level of a virtual machine is set to disabled, DRS does not migrate that virtual
machine or provide migration recommendations for it. By setting the automation mode of the
virtual machine to manual, maintenance mode is able to evacuate this virtual machine
automatically.
Chapter 16
Resource Pools and Controls
As we progress using virtualization, most administrators spend less time setting up environments
and more time on resource management of the virtual infrastructure. VMware introduced DRS,
clusters and resource pools to help simplify resource management. Clusters aggregate ESX host
capacity into one large pool and create an independent layer between the resource providers (ESX
hosts) and resource consumers (virtual machines).
This independent layer has several advantages; one of these advantages is subdividing the cluster
capacity into smaller resource pools. Resource pools do not "carve up" physical resources of the
cluster which can only be used exclusively by their member virtual machines, but instead
guarantee, limit and prioritize their member virtual machines to a certain amount of cluster
capacity. The cluster provides resources based on resource allocation controls set at the resource
pool level.
These resource allocation controls, reservations, limits and shares are similar to virtual machine
resource allocation settings. But how do these settings work at resource pool level and what
impact does it have on the virtual machine workloads? Let us explore the construct called resource
pool a bit more.
When calculating the root resource pool, vCenter will exclude resources reserved for the
virtualization layer, such as the Service Console and VMkernel. The amount of resources required to
satisfy HA failover (assuming HA Admission Control is enabled) will be shown in the root resource
pool as reserved, whereas the amount of resources used by the Service Console and VMkernel will
not even show up in the capacity of the root resource pool.
Resource Pools
As stated before a VMware cluster allocates resources from hosts (resource providers) to virtual
machines (resource consumers). Resource pools are in between and are both resource providers
and consumers; a resource pool provides resources to virtual machines, but consumes resources
from the cluster.
Figure 41: Resource pools
Apart from the root resource pool, each resource pool has a parent resource pool. A resource pool
contains children, which can be other resource pools or virtual machines. In the example pictured
above, the root resource pool is parent of resource pool 1 and 2. Resource pool 1 is the child of the
root resource pool but also the parent of resource pool 3, vm3 and vm4.
NOTE
Placing virtual machines at the same level as resource pools is not a recommended configuration!
The maximum number of resource pool tree depth is 8, excluding the 4 resource pools created
internally on each ESX host. These internal resource pools are independent of the DRS resource
pools. To avoid complicated proportional share calculations and complex DRS resource entitlement
calculations, we advise not to exceed a resource pool depth of maximum of 2. The flatter the
resource pool tree the easier it is to manage.
techniques and mechanisms to allocate resources according to the virtual machines resource
entitlement. Similar to virtual machines, resource pools have reservations, limit and shares
parameters for CPU and memory resources. Expandable reservation is the only setting that exists
on the resource pool level and not at the virtual machine level.
Dividing these values across the hosts is based on the amount of running active virtual machines,
their VM resource allocation settings and their current utilization. Once the parent resource
allocation settings are propagated to the host local RP tree, the local host CPU and memory
scheduler takes care of the actual resource allocation.
Limit Also referred as MAX(maximum). Limit specifies an upper bound for resources that can
be allocated to a virtual machine or resource pool
Expandable reservation - This allows the resource pool, once it has already reserved as much
capacity as defined in its own Reservation setting to reserve even more. The reservation is taken
from the unreserved capacity in the parent of this root resource pool
Shares
Shares specify the priority for the virtual machine or resource pool relative to other resource pools
and/or virtual machines with the same parent in the resource hierarchy. Now there is not an
official term (yet) to this but let us use the term sibling share level. The relative priority is
calculated across all siblings in relation to their sibling share level.
Figure 44: Parent-child relation
The key point is that shares values can be compared directly only among siblings: the ratios of
shares of VM1:VM2 tells which VM is higher priority, but the shares of VM2:VM3 does not tell which
VM has higher priority.
Contrary to reservation and limits which are specified in absolute numbers, shares are relative to
the other virtual machines and resource pools. In consequence when more virtual machines
become active in the same hierarchical level, the relative share of the resources allocated to the
virtual machine will change.
When configuring shares you can select one of the three predefined settings; High, Normal or Low,
which specify share values with a 4:2:1 ratio, or select the Custom setting to specify a more
granular value.
Because shares determine relative priority on the same hierarchical level; the absolute values do
not matter, configuring the resource pools 1 and 2 with respectively share values 10 and 20 has the
same effect as configuring the resource pools with share values 10000 and 20000. Use care when
selecting the custom setting, as a virtual machine with a custom value can lead to the virtual
machine owning a large portion of shares or in other words priority.
The default behavior of shares is that they scale with the size of the virtual machine, a virtual
machine set to share level normal and configured with 1 vCPU and 1024 MB of memory will receive
a 1000 CPU shares and 10240 memory shares. The pre-defined share settings have the following
values:
Table 4: Share values
When a virtual machine is created, the amount of shares the virtual machine receives is based upon
the amount of configured vCPUs and default share level (low, normal and high); a virtual machine
set to share level normal receives 1000 shares per vCPU. This means that without changing any
share settings, a virtual machine that is configured with more vCPUs and memory is entitled to a
correspondingly larger amount of physical resources during contention. The assumption is that a
virtual machine with more vCPUs actually needs more CPU resources. In practice, this is not always
true.
NOTE
This method of assigning shares based on the amount of vCPU implicitly indicates (and
explicitly controls) that a virtual machine with multiple vCPUs is more important and has a
higher priority than a virtual machine with a lower amount of vCPUs. This situation does
not always reflect the business side requirements; the virtual machine with fewer vCPUs
might contain an application which is more important to the business than the resource
intensive application running on the virtual machine with multiple vCPUs. The level of
importance to the business does not automatically equal bigger virtual machines, and vice
versa.
Per default a resource pool is configured similar to a virtual machine with 4 vCPUs and 16GB set at
normal level, i.e. 4000 CPU shares and 163840 memory shares. Caution must be taken when placing
virtual machines on the same hierarchical level as resource pools, as virtual machines can end up
with a higher priority than intended.
Placing virtual machines at the same level as resource pools is something we see very often;
sometimes by design and sometimes accidental (and not recommended). During a manual vMotion,
the administrator needs to select a resource pool, and by default the root resource pool is selected.
If this step is overlooked by the administrator, the virtual machine ends up in the root resource
pool. In this scenario a virtual machine can be denied or deny other sibling virtual machines or
resource pools from resources.
vSphere 4.1 introduces a mechanism called flattened shares, explained in chapter 10 vSphere 4.1
HA and DRS Integration, please be aware that this only occurs when HA fails over the virtual
machine.
Resource pool share settings do not influence share settings on virtual machine. When creating a
virtual machine inside a resource pool with a share value set to High, the share level on the virtual
machine is still set to the default value of normal. This is because these VM-level shares indicate the
relative importance of the virtual machine within its hierarchical level i.e. to its siblings, not to
virtual machines in other resource pools.
So how do resource pool shares affect virtual machine workloads? As mentioned before, DRS
mirrors the resource pool hierarchy to each host, and the ESX hosts local CPU and memory
scheduler retrieve the resource allocation settings of each active virtual machine from this tree.
DRS divides the resource pool cluster level share amount across the mirrored local trees based on
the amount of running active virtual machines, their VM-level shares amounts and their current
utilization. Once the resource allocation settings are propagated to the host local RP tree, the local
host CPU and memory scheduler takes care of the actual resource allocation.
For the sake of simplicity, lets forget the previous example of nested resource pools and use a 2
host cluster. In this cluster a resource pool is created and the default resource allocation settings
are used. Per default a resource pool is configured similar to a 4 vCPU and 16GB virtual machine at
normal level, i.e. 4000 CPU shares and 163840 memory shares. (This example uses memory shares;
the same applies to CPU shares). Four virtual machines running inside the resource pool are
configured as followed:
Table 5: Share configuration scenario
Let us assume that all the virtual machines are running equal and stable workloads, DRS will
balance the virtual machines across both hosts and create the following resource pool mapping
Figure 45: Resource pool mapping
The amount of shares specified on virtual machines vm1, vm2 and vm3 totals 20480, which equals
half of the amount of total configured shares inside the resource pool. In this example DRS decides
to place the virtual machines on ESX host ESX1 and therefore assigns half of the share value of the
resource pool to the resource pool 1 mirrored in the host level resource pool tree.
At this point resource pool 2 is created but with a different share configuration, resource pool 2 is
configured with double the amount of resource pool 1; i.e. 327680. Inside the resource pool, the
virtual machines are identical to the virtual machines in resource pool 1.
Figure 46: Share ratio result
The local resource pool tree of ESX1 is updated with resource pool 2; this resource pool is
configured with twice the amount of shares as resource pool 1. By introducing 327680 shares, the
total amount of shares active on the host is increased to 491520 (163840+327680) Resource pool 2
owning 327680 of the total of 491520, which equals roughly to 66.6 percent.
Due to the 67%- 33% ratio at resource pool level, the local resource scheduler will allocate more
resources to resource pool 2. The allocated resources to resource pool 2 are subdivided between
the virtual machines based on their hierarchical level (sibling share level). This means that virtual
machine VM5 is entitled to get 50% of the resource pools resources during contention, which is
basically 33% of the ESX hosts resources. (1/2 of a pool which is 2/3 of the host = 2/6 = 1/3 of the
host).The %shares column in the resource allocation tab of the cluster displays the amount of
shares each object gets per parent share level.
To emphasize, my example used virtual machines with equal and stable workloads. During normal
conditions, some virtual machines have a higher utilization then others. As the working set is part
of the resource entitlement calculation, active workload is accounted for when dividing the
resource pool shares and resources between hosts and local resource pool trees. Because of the
resource usage this process of dividing occurs every time DRS is invoked and therefore the
distribution of resources will keep changing if appropriate.
Reservation
A reservation is a guaranteed lower bound of resources that is reserved for the resource pool or
virtual machine to ensure availability of physical resources at all times, even during resource
contention. Reservations can be set at resource pool level and virtual machine level. When setting a
reservation at resource pool level, you will guarantee a certain amount of resources for all its
children collectively, though they may contend with one another. To exclusively guarantee
resources for a specific virtual machine a VM-level memory reservation has to be set.
We recommend implementing this workaround very sparingly as creating a resource pool for each
VM creates a lot of administrative overhead and makes the host and cluster view a very unpleasant
environment to work in.
reserve it. Therefore shares on virtual machine level become much more important than actually
perceived by many.
NOTE
We must stop treating shares as the redheaded stepchild of the resource allocation settings
family and realize how important share ratios really are. Besides shares, active memory
usage and the configured memory size will impact the resource entitlement and thus
performance of the virtual machine.
Now setting a memory reservation on a resource pool level has its own weaknesses. As stated
before resource pool reservations do not flow to virtual machines, so they will not influence HA slot
sizes. Using only resource pool reservations and not virtual machine reservations can lead to
(temporary) performance loss if a host failover occurs. When a virtual machine is restarted by HA
they are not restarted in the correct resource pool but in the root resource pool, which can lead to
(temporary) starvation. vCenter 4.1 uses the flattened shares mechanism when restarting virtual
machines in the root resource pool. The movement of the virtual machine from the root resource
pool to the correct resource pool gets corrected in the next DRS run, but until that point in time, the
virtual machine needs to do without any memory reservations.
By default all vMotions initiated by DRS are of low priority. The distinction between High and Low
priority vMotion is that a Low Priority vMotions tries to reserve the percentage of a core, but will
proceed regardless of how much it actually received. High priority vMotions are designed to fail if
they cannot reserve sufficient resources.
In the picture above, resource pool 1 has a 10GB memory reservation configured, but the total
amount of configured memory of its virtual machines (16GB) is greater than the reservation.
During contention, the amount of reserved memory of resource pool 1 is divided between its virtual
machines based on their resource entitlements. In the example above, the 6GB memory resources
exceeding the reserved memory pool is allocated from the unreserved memory pool based on the
proportional share level of resource pool 1 and resource pool 2. The maximum amount of memory
that resource pool 1 can allocate is a total of the combined configured memory of all virtual
machines plus their memory overhead reservations.
Static overhead
Static overhead is the minimum overhead that is required for the virtual machine startup. DRS and
the VMkernel uses this metric for Admission Control and vMotion calculations. The destination ESX
host must be able to back the virtual machine reservation and the static overhead otherwise the
vMotion will fail.
Dynamic overhead
Once the virtual machine has started up, the virtual machine monitor (VMM) can request additional
memory space. The VMM will request the space, but the VMkernel is not required to supply it. If the
VMM does not obtain the extra memory space, the virtual machine will continue to function but this
could lead to performance degradation. The VMkernel treats virtual machine overhead reservation
the same as VM-level memory reservation and it will not reclaim this memory once it has been
used.
Admission control
As mentioned earlier DRS and the VMkernel will not allow a virtual machine to be powered on if
reservations cannot be guaranteed. This means that the effective memory reservation for a virtual
machine is the user configured memory reservation (VM-level reservation) plus the overhead
reservation. This means that during the design phase of a resource pool, the memory overhead of a
virtual machine must be included in the calculation of the memory reservation specified on the
resource pool. The behavior of dynamic overhead must also be taken into account. Table 3.2 of the
vSphere Resource Management guide lists the overhead memory of virtual machines. The table
listed below is an excerpt from the Resource Management guide and lists the most common ones.
Table 6: Virtual machine memory overhead (in MB)
Please be aware of the fact that memory overheads can grow with each new release of ESX, so keep
this in mind when upgrading to a new version. Verify the documentation of the virtual machine
memory overhead and check the specified memory reservation on the resource pool.
Expandable Reservation
The setting expandable reservation only exists on a resource pool level, although it is used to
allocate resources for virtual machine level reservations, including virtual machine memory
overhead reservation.
Expandable reservation is used by Admission Control. If the expandable reservation setting is
selected, Admission Control considers the capacity in the ancestor resource pool tree as available
for satisfying VM-level reservations. If the expandable reservation is not selected, Admission
Control considers only the resources available of the resource pool to satisfy the reservation.
A simple way to think of it is this: Add the VM-level reservationsplus implicit overhead
reservationsof every VM running in the resource pool. That sum cannot be greater than the
resource-pool level reservation, unless Expandable is checked.
Note that this has nothing to do with how much memory can be configured in or used by the VMs in
the resource pool; it's only about what they can reserve.
Figure 48: Expandable reservation workflow
When a virtual machine is powered on, it will search for unreserved capacity through the resource
pool tree. It will only consider unreserved resources from its ancestors but not siblings. Ancestors
are direct parents of the resource pool, parents of the parents, etc.
The search for unreserved capacity stops when a resource pool is configured without the
expandable reservation selected or when a limit is set. When the requested capacity would allocate
more resources than the limit of the parent resource pool specifies, the request is rejected, and the
virtual machine will not be started.
Figure 49: Traversing expandable reservation
Limits
A limit is an artificial cap on the usage of a resource. A limit is the complete opposite of reservation.
Where the reservation is a guaranteed lower bound of resources; the limit is a guaranteed upper
bound of resources. A virtual machine or resource pool is prevented from using more physical
resources than its configured limit. Even when there are plenty resources available, the limit will
prohibit the virtual machine or resource pool from making use of these available resources. Setting
limits on resource pool or virtual machine level can affect the performance of the virtual machines
but limits can negatively affect the rest of the environment as well.
The VMkernel CPU scheduler behaves differently from the VMkernel memory scheduler when it
comes to limiting physical resources. Let us take a quick look at the differences between the
VMkernel CPU and memory scheduler when setting limits.
Memory Scheduler
If a memory limit is set on the virtual machine, the virtual machine is not allowed to consume more
physical memory than its configured limit. If a virtual machine is configured with 4GB of memory,
but the administrator sets the memory limit to 1 GB, the virtual machine is able to consume 1 GB of
physical memory, the other 3 GB memory space will be supplied by ballooning or the swap file. A 4
GB swap file is created for the virtual machine to ensure availability of memory space for the virtual
machine.
The Guest OS inside the virtual machine is unaware of the specified limit as such setting limits can
have impact on the performance of the application inside the virtual machine, even if it does not
always consume more memory above the limit threshold. When modern Operating Systems boot,
one of the first things they do is check to see how much RAM they have available then tune their
caching algorithms and memory management accordingly. Applications such as SQL, Oracle and
JVMs do much the same thing. As stated before the limit is not exposed to the operating system
itself and as such the application will suffer and so will the service provided to the user. The funny
thing about this is that although the application might request everything it can, it might not even
need it. In that case, more common then we think, it is better to decrease provisioned memory than
to apply a limit as the limit will impose an avoidable and unwanted performance impact in most
cases while lowering the memory most likely will not.
DRS will divide the limit between the ESX host based on the amount of active virtual machines
inside the resource pool and the aggregated resource entitlement of the virtual machines. DRS
calculates the amount of maximum allowed resources for each ESX host and pushes the resource
allocation information to each ESX host inside the cluster.
The VMkernel memory scheduler only knows of the part of the DRS resource tree that is relevant to
its own local node. The memory scheduler will divide the amount of resources between the virtual
machines belonging to the same resource pool. The memory scheduler uses the same mechanism as
the CPU scheduler and assigns a limit based on the resource entitlement of the virtual machine.
Because the limit is set on the resource pool the memory scheduler is free to allocate resources
within the pool as required. This results in dynamic limiting the availability of physical resources as
the limit for the virtual machine is related to the resource utilization of its sibling virtual machines
in the resource pool.
In the scenario when other virtual machines are dormant, an active virtual machine can possibly
allocate up to its configured memory, contrary to a per-VM limit which is always active regardless
of the resource utilization of other virtual machines. If it is necessary to set limits, using a resource
pool level limit is preferred over per-VM limits.
Figure 52: Dividing of resource pool limit
machine reaches the limit of the local resource pool tree, the large pages (2MB) will be broken into
small pages (4KB) to allow reclamation.
Because the limit obstructs the virtual machine from using physical resources above the specified
limit, the VMkernel will back the remaining memory request by ballooning, compression-cache or
the swap file.
Ballooning, compressing and swapping virtual machine memory has impact on the ESX host and
possibly the SAN infrastructure. The VMkernel needs to use resources to communicate and run the
balloon driver and needs to store the memory pages inside a SAN-based swap file, consuming
bandwidth and creating additional load on the storage processors. Some administrators might
ignore the additional load created by swap and balloon but if the virtual machine is sized properly
to reflect its workload or SLA, these overhead situations will not occur.
Chapter 17
Distributed Power Management
With ESX 3.5, VMware introduced Distributed Power Management (DPM). DPM provides power
savings by dynamically sizing the cluster capacity to match the virtual machine resource demand.
DPM will dynamically consolidate virtual machines onto fewer ESX hosts and power down excess
ESX hosts during periods of low resource utilization. If the resource demand increases ESX hosts
are powered back on and the virtual machines are redistributed among all available ESX hosts in
the ESX cluster.
The goal of DPM is to keep the cluster utilization within a specific DPM target range, but at the same
time take various cluster settings, virtual machines settings and requirements into account when
generating DPM recommendations. After DPM has determined the maximum number of hosts
needed to handle the resource demand of the virtual machines, it leverages the DRS algorithm to
distribute the virtual machines across the number of hosts before placing the target ESX hosts into
standby mode.
Enable DPM
DPM is disabled by default and can be enabled by selecting the power management modes Manual
or Automatic. Due to DPM using DRS to migrate the virtual machines off the ESX hosts, DRS must be
enabled first before DPM can be enabled on the cluster.
Figure 53: DPM settings
DPM can be set to run in either manual or automated mode for the cluster. All hosts inside the
cluster will inherit the default cluster setting, but in addition a per-host setting can be set as well.
This setting overrides the cluster default. Per-host settings are only meaningful when DPM is
enabled, a use case for overriding the default cluster setting is when VMware Fault Tolerance
protected virtual machines are running inside the cluster. See section DRS, DPM and VMware Fault
Tolerance of chapter 18 for more info about the constraints Fault Tolerance introduce to DPM.
Each power management mode operates differently:
Power Management State and DPM behavior
The power management mode setting, manual or automatic, can differ from the DRS automation
settings, even the threshold can vary from each other. The combination of both mechanisms will
have different effect on the role of user and the automatic application of the recommendations
generated by DRS and DPM.
Table 7: Effect of combining DPM and DRS
Templates
While DPM leverages DRS to migrate all active virtual machines on the host before powering down
the host, the registered templates are not moved. This means that templates registered on the ESX
host placed in standby mode will not be accessible as long as the host is in standby mode.
Setting the DPM threshold to the most conservative level will result in DPM generating only priority
level 1 recommendations, according to the accompanied text below the threshold slider:
DPM will not generate power-off recommendations; this effectively means that the automatic DPM
power saving mode is disabled. The user is able to place the server in the standby mode manually,
but DPM will only power-on ESX hosts when the cluster fails to meet certain HA or custom capacity
requirements or constraints.
The DemandCapacityRatioTarget is the utilization target of the ESX host, by default this is set at
63%. The DemandCapacityRatioToleranceHost specifies the tolerance around the utilization target
for each host, by default this is set at 18%. This means that DPM will try to keep the ESX host
resource utilization centered at the 63% sweet spot, plus or minus 18 percent, resulting in a range
between 45 and 81 percent. If the resource utilization of both CPU and memory resources of an ESX
host falls below 45%, DPM evaluates power-off operations. If the resource utilization exceeds the
81 percent of either CPU or memory resources, DPM evaluates powering-on operations of standby
ESX hosts.
The sweet spot of 63 percent is based on in-house testing and feedback from customers. Both the
DemandCapacityRatioTarget and DemandCapacityRatioToleranceHost values can be modified by the
user, the DemandCapacityRatioTarget can be set between the range 40 to 90% and the
DemandCapacityRatioToleranceHost allowed input range is between 10 and 40%. It is
recommended to use the default values and to only modify the values when you fully understand
the impact.
DPM calculates the ESX host resource demand as the sum of each active virtual machine over a
historical period of interest plus two standard deviations. The demand itself is a combination of the
virtual machines working set (active memory) and an estimation of unsatisfied demand during
periods of contention. By using historical data over a longer period of time instead of using the
virtual machine active current demand, DPM ensures that the evaluated virtual machine demand is
representative of the virtual machine normal workload behavior. Using shorter periods of time or
only current demand can include short-term resource demand if DPM would react to this situation
it would unnecessarily generate power-on and power-off recommendations. Not only does this
negatively affect the power-saving efficiency, it will also have impact on the current resource
utilization as DRS shall try to load-balance the active virtual machines across constantly changing
landscape of available hosts. Finding a proper balance between providing resources and resource
demand can be quite difficult as underestimating resource demand can result in lower performance
while overestimating resource demand can lead to less optimal power savings.
DPM uses two periods of interest when calculating the average demand. The period of interest DPM
uses when evaluating virtual machine demand that can possibly lead to power-on operations is 300
seconds (5 minutes), DPM uses a longer period when evaluating resource demand that may lead to
power-off operations, DPM evaluates the virtual machine workload of the past 2400 seconds (40
minutes).
By using shorter periods of time for evaluation power-on operations DPM will have the ability to
respond to demand increase relatively quick. A longer period is used to evaluate power-off
operations so that DPM will respond slowly to a decrease in workload demand. DPM must be
absolutely sure that it will not negatively impact virtual machine performance.
Providing adequate resources for workload demand is considered more important by DPM than
rapid response to decreasing workloads, so performance receives a higher priority by DPM than
saving power. This also becomes visible when reviewing the rules of power-on operations
recommendation and power off operation recommendations, a power off recommendation is only
applied when the ESX host is below the specified target utilization range AND there are no poweron recommendations active.
or smaller virtual machines are considered first before heavy loaded hosts in the same group. If the
cluster contains heterogeneous sized hosts, DPM considers hosts in order of critical resource
capacity. Hosts with more critical resources (CPU or memory) are sorted before the other hosts in
its group. For power-on recommendations, larger capacity host are favored first, smaller capacity
host are favored for power off recommendation first.
Table 9: DPM preference
If the sort process discovers equal hosts with respect to the capacity or evacuation cost, DPM will
randomize the order of hosts, done for a wear-leveling effect. Be aware that sorting of the hosts for
power-on or power-off recommendations does not determine the actual order for the selection
process to power-on or power-off hosts. In addition, it might be possible that DPM will not strictly
adhere to its host sort order if doing so would lead to choosing a host with excessively larger
capacity than needed, if a smaller capacity host that can adequately handle the demand is also
available. But under normal circumstances DPM generates the power-off recommendation based on
the resource LowScore and HighScore.
memLowScore = Sum across all host below target utilization.(target utilization host utilization)
DPM is aware of which resource is more critical and will use and process this in the evaluation. If
the hosts are overcommitted on memory, DPM determines that memory is the critical resource and
will prioritize memory over CPU recommendations.
machines across all the hosts inside the cluster, even migration to the hosts who are currently
placed in standby mode. By using the HighScore calculation, DPM determines the impact that a
power-up operation has on the current utilization ratio. It needs to determine how much
improvement this power-up operation has on the distance of the resource utilization from the
target utilization or the possible reduction of the number of highly utilized hosts. DPM compares
the HighScore value of the cluster in its current state (standby host still down) to the HighScore
value of the simulations. If a simulation offers an improved HighScore value if a standby host is
powered-on, DPM will generate a power- on recommendation for that specific host. In some cases it
might be possible that constraints will limit the host selection. Limits such as the inability to
migrate virtual machines to the candidate host if it were powered on or that the virtual machines
that would move to a candidate host are not expected to reduce load on the highly-utilized hosts in
the cluster.
DPM continues to run simulations as long as there are hosts in the cluster exceeding the target
utilization range. DPM is very efficient in homogeneous sized clusters as DPM will skip every host
which is identical regarding physical resources or vMotion compatibility to any host who is already
rejected for power-on operation during the simulation.
DPM compares the LowScore value of the cluster with all the candidate hosts active to the
LowScore value of the simulations, if a simulation offers improvement of the LowScore and if the
HighScore value does not increase, DPM generates a power-off recommendation. This power-off
recommendation also contains virtual machine migration recommendations for the virtual
machines running on this particular host. DRS will not indicate these virtual machine migration
recommendations as priority level 1 migrations. If DRS is set to the conservative migration
threshold level, then DRS will only generate priority level 1 migration recommendations. These
priority level 1 migrations are mandatory moves and only address constraint violations, such as
anti-affinity rules or an ESX host entering maintenance mode. This threshold does not generate
non-mandatory recommendations to rebalance the workload of the virtual machines across the ESX
hosts in the cluster; therefore setting the migration threshold of DRS to generate priority level 1
recommendations will effectively disable DPM.
Basic design principle:When DPM is activated make sure DRS is not set to the conservative
threshold level.
DPM will not power down a host if it violates the minimum powered-on capacity specified by the
settings MinPoweredOnCpuCapacity and MinPoweredOnMemCapacity. Another reason for DPM to
not select a specific candidate host can be based on DRS constrains or objectives. For example a
host might be rejected to be powered off if the virtual machines that need to be migrated can only
be moved to hosts that become too heavily utilized. This situation can occur when multiple DRS
(anti) affinity are active in the cluster. A third factor is that DPM does not select a candidate host to
power down based on the negative or non-existing benefit indicated by the power-off cost/benefit
analysis run by DPM
Similar to power-on recommendations, DPM continues to run simulations as long as the cluster
contains ESX hosts below the target utilization range. (Considering both resources in case of a
power-off.)
DPM runs the power-off cost/benefit analysis which compares the costs and risk associated with a
power-off operation to the benefit of powering off the host. DPM will only accept a host power-off
recommendation if the benefits meet or exceed the performance impact multiplied by the
PowerPerformanceRatio setting. The default value is 40 but can be modified to a value in the range
between 0 and 500. As always do not change these settings only if the impact of modifying is known
to you. Both cost and benefit calculations include both CPU and Memory resources.
The power-off benefits and power-off cost are calculated as follows. The power-off benefit analysis
calculates the StableOffTime value, which indicates the amount of time the candidate host is
expected to be powered-off until the cluster needs its resources because of an anticipated increase
in virtual machine workload. The time that the virtual machine workload is stable and no power-up
operations are required is called the ClusterStableTime. DPM will use the virtual machine
stabletime, calculated by DRS cost-benefit-risk analysis, as input for the ClusterStableTime
calculation.
The time it takes from applying the power-off recommendation to the power-off state is taken into
account as well. The analysis breaks this time down into two sections and calculates this as the sum
of the time it takes migrating all active virtual machines off the host (HostEvacuationTime) and the
time it takes to power off the host (HostPowerOffTime). These values are combined in the sum:
StableOffTime = ClusterStableTime (HostEvacuationTime + HostPowerOffTime)
The power-off cost is calculated as the summation of the following estimated resource costs:
Migration of the active virtual machines running on the candidate host to other ESX hosts
Unsatisfied virtual machine resource demand during power-on candidate host at the end of
the ClusterStableTime
Migration of virtual machines back onto the candidate host
The last two bullet points can only be estimated by DPM; DPM calculates the required hosts which
need to be available at the end of the ClusterStableTime. This calculation is somewhat of a worstcase scenario as DPM expects all the virtual machines to generate heavy workloads at the end of the
ClusterStableTime, hereby generating a conservative value.
As previously mentioned DPM will only recommend a power-off operation as long as it is equal or
exceeds the performance impact. It might be possible the ClusterStableTime is low, this can result in
a StableOffTime equal or even less than zero. During this scenario, DPM will stop evaluating the
candidate host for a power-off operation recommendation because it will not offer any benefit.
Chapter 18
Integration with DRS and High Availability
Distributed Resource Scheduler
DPM tries to match the availability of resources in the cluster to the virtual machine workload and
resource demand. During the recommendation generating process DRS what-if mode is executed to
ensure that the power operation recommendations do not violate the DRS constraints and
objectives. In addition DRS has the ability to bring ESX hosts out of standby mode to acquire the
necessary resources to match the resource demand created by unexpected increase of virtual
machine workloads. DRS does not distinguish between ESX host which are placed in standby mode
by DPM or manually by the administrator. DRS might undo the manually placed standby mode the
next time DRS runs.
High Availability
If HA strict Admission Control is enabled (default), DPM will maintain the necessary level of
powered-on capacity to meet the configured HA failover capacity. HA places a constraint to prevent
DPM from powering down too many ESX hosts if it would violate the Admission Control Policy.
Contrary to disconnected hosts or hosts in maintenance mode, HA will consider the unreserved
resources provided by the ESX host for Admission Control and the ESX host can be brought out of
standby mode if the resources are required.
If HA strict Admission Control is disabled, the failover constraints are not passed on to DPM.
Because no constraints to keep enough resources available is enforced, DPM will generate poweroff recommendations and places ESX hosts in standby mode regardless of the impact it has on the
HA failover requirements. However, starting with vCenter 4.1, if a failure happens and HA cannot
restart some virtual machines due to insufficient powered-on hosts, HA will ask DRS/DPM to
power-on hosts to accommodate the restart of those virtual machines.
vSphere 4.0 HA clusters have a soft limit of a maximum amount of virtual machine if the cluster
exceeds more than nine hosts. If the cluster contains a maximum of eight hosts, the maximum
amount of virtual machines exceeds far more than the 40 virtual machines. DPM does not consider
the soft limit and can create a scenario where the remaining nine host end up with more than 40
virtual machines. Fortunately in vSphere 4.1 the soft limit of a maximum number of virtual
machines when the amount of ESX hosts in the cluster exceeds nine is removed.
DPM does not take current the current HA different node roles into account when selecting a host
or multiple hosts for power-down recommendations. Simply because DPM is not aware of the
different types of HA nodes. To avoid DPM powering down all HA primary nodes, DPM will explicit
disable HA on the host before placing a host into standby mode. By disabling the HA agent on the
host, HA will trigger a new primary node election resulting in the recalculation of primary nodes for
each former primary that is put into standby mode.
Another valuable option in this situation is to configure HA with Admission Control enabled.
Admission Control will ensure that enough resources will be available and will not allow DPM to
power down ESX hosts if it would violate failover requirements
DRS FT integration requires having EVC enabled on the cluster. Many companies do not enable EVC
on their ESX clusters based on either FUD (Fear, Uncertainty and Doubt) on performance loss or
arguments that they do not intend to expand their clusters with new types of hardware and
creating homogenous clusters. The advantages and improvement DRS-FT integration offers on both
performance and reduction of complexity in cluster design and operational procedures shed some
new light on the discussion to enable EVC in a homogeneous cluster. If EVC is not enabled, vCenter
will revert back to vSphere 4.0 behavior and enables the DRS disable setting on the FT virtual
machines.
By scheduling the DPM disable task more than one hour in advance of the morning peak, DRS will
have the time to rebalance the virtual machine across all active hosts inside the cluster and
Transparent Page Sharing process can collapse the memory pages shared by the virtual machines
on the ESX hosts. By powering up all ESX hosts early, the ESX cluster will be ready to accommodate
load increases.
Chapter 19
Summarizing
Improvements were made to DRS in vSphere 4.1 Better integration and more efficient algorithms
allows DRS to reach a steady state more quickly when there is significant load imbalance in the
cluster.
We have tried to simplify some of the concepts to make it easier to understand, still we
acknowledge that some concepts are difficult to grasp. We hope though that after reading this
section of the book everyone is confident enough to create and configure DRS clusters to achieve
higher consolidation ratios at low costs.
If there are any questions please do not hesitate to reach out to either of the authors.
Appendix
Appendix A Basic Design Principles
VMware High Availability
Avoid using static host files as it leads to inconsistency which makes troubleshooting
difficult.
In blade environments, divide hosts over all blade chassis and never exceed four hosts per
chassis to avoid having all primary nodes in a single chassis.
For network-based storage (iSCSI, NFS, FCoE) it is recommended (pre-vSphere 4.0 Update
2) to set the isolation response to "Shut Down" or Power off. It is also recommended to
have a secondary Service Console (ESX) or Management Network (ESXi) running on the
same vSwitch as the storage network to detect a storage outage and avoid false positives for
isolation detection.
Keep das.failuredetectiontime low for fast responses to failures.
If an isolation validation address has been added, das.isolationaddress, add 5000 to the
default das.failuredetectiontime (15000).
Be really careful with reservations, if theres no need to have them on a per virtual machine
basis, dont configure them, especially when using Host Failures Cluster Tolerates. If
reservations are needed, resort to resource pool based reservations.
Avoid using advanced settings to decrease the slot size as it could lead to more down time
and adds an extra layer of complexity. If there is a large discrepancy in size and reservations
are set it might help to put similar sized virtual machines into their own cluster.
When using Admission Control, balance your clusters and be conservative with reservations
as it leads to decreased consolidation ratios.
Although vSphere 4.1 will utilize DRS to try to accommodate for the resource requirements
of this virtual machine a guarantee cannot be given. Do the math; verify that any single host
has enough resources to power-on your largest virtual machine. Also take restart priority
into account for this/these virtual machine(s).
Admission Control guarantees enough capacity is available for virtual machine failover. As
such we recommend enabling it.
Do the math, and take customer requirements into account. We recommend using a
Percentage based Admission Control Policy as it is the most flexible policy.
VM Monitoring can substantially increase availability. It is part of the HA stack and we
heavily recommend using it!
das.allowNetwork[x] - Enables the use of port group names to control the networks used
for HA, where [x] is a number between 0 and 10. You can set the value to be Service
Console 2 or Management Network to use (only) the networks associated with those port
group names in the networking configuration. These networks need to be compatible for
HA to configure successful. Please note that the number [x] has no relationship with the
network, it only gives you the option to specify multiple networks.
setting it to true will disable the warning. HA must be reconfigured to make the
configuration issue go away.
das.vmMemoryMinMB - The minimum default slot size used for calculating failover
capacity. Higher values will reserve more space for failovers. Do not confuse with
das.slotMemInMB.
das.vmCpuMinMHz - The minimum default slot size used for calculating failover capacity.
Higher values will reserve more space for failovers. Do not confuse with das.slotCpuInMHz.
das.slotMemInMB - Sets the slot size for memory to the specified value. This advanced
setting can be used when a virtual machine with a large memory reservation skews the slot
size. As this will typically result in a conservative number of available slots.
das.slotCpuInMHz - Sets the slot size for CPU to the specified value. This advanced setting
can be used when a virtual machine with a large CPU reservation skews the slot size. As this
will typically result in a conservative number of available slots.
das.sensorPollingFreq - Set the time interval for status updates. As of vSphere 4.1 the
default value of this setting is 10. It can be configured between 1 and 30. It is not
recommended to decrease this value as it might lead to less scalability due to the overhead
of the status updates.
das.failureInterval (VM Monitoring) - The polling interval for failures. Default value is 30
seconds.