Professional Documents
Culture Documents
vSphere HA and Datastore Access Outages CurrentCapabilities Deep-Dive and Tech Preview
#vmworldinf
Disclaimer
Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
vSphere
vSphere
vSphere
vSphere
vSphere
Local Availability
Disaster Recovery
vSphere High Availability This talk vSphere Fault Tolerance vMotion and Storage vMotion
Data Protection
vSphere HA Recap
vSphere HA minimizes unplanned downtime Provides automatic VM recovery in minutes Protects against 3 types of failures
Infrastructure Host failures VM crashes Connectivity Host network isolated Datastore incurs PDL Application GuestOS hangs/crashes Application hangs/crashes
Talk Focus
Loss of accessibility is due to Network or switch failure Array, NFS sever, etc. misconfiguration
VMware ESX
VMware ESX
VM manageability and availability is affected Applications with vdisks on inaccessible datastores hang, crash, or fail May not be able to manage VMs on the affected hosts vSphere HAs protection impacted
1. Impact of datastore inaccessibility on HA failover workflows 2. Expanding HA protection against datastore inaccessibility
Objectives Learn how HA workflows are impacted by datastore accessibility Understand how vSphere 5.0/1 reduces the impact of inaccessibility Preview the future - protecting VMs against datastore inaccessibility
Agenda: Part 1
1. Impact of datastore inaccessibility on HA failover workflows Architecture overview Datastore usage HA workflows and responses
2. Expanding HA protection against datastore inaccessibility
Called the Fault Domain Manager (FDM) Provides all the HA on-host functionality
Operation
vCenter Server (VC) manages the cluster Failover operations are independent of VC Communicate over
Management network Datastores
Monitors hosts and VMs Manages VM restarts after failures Reports cluster state to VC
The FDM slave
Forwards critical state changes to the master Restart VMs when directed by the master Elect new master
vCenter Server (VC)
9
Datastore Usage
HA host states
influence triggers
Response
used in
10
Datastores are used when management network is not available Heartbeat datastores
Used by a master to monitor a partitioned/isolated slave Enables a master to detect VM power state changes VC chooses two (by default) for each host Reselected after datastore accessibility changes
11
used in
influence
HA host states
triggers
Response
used in
12
State Election Running (Master) Connected (Slave) Unreachable Isolated Partitioned Dead
Source FDM on the host VC Master Master or VC FDM on the host, reported by master Master Master
13
State Election Running (Master) Connected (Slave) Unreachable Isolated Partitioned Dead
Source FDM on the hostDetermined using VC Master Master or VC FDM on the host, reported by master Master Master
datastore communication
14
Dead
* See slide notes
15
used in
influence
HA host states
triggers
Response
used in
16
17
If all FDMs are isolated, all will apply isolation responses VMs not restarted until master has access to VM datastores Best practices Redundant management networks Reconfigure storage to reduce likelihood of inaccessible datastores Use leave powered on isolation option
18
Isolated host and master have access to heartbeat datastores Master will attempt failover on power off notification Otherwise Master will declare host dead and start failover immediately* Same situation applies to partitioned hosts
* More info in backup slides
19
20
Found place?
No
Restarted ?
No
Yes
21
Case 1: home datastore of VM is not accessible on masters host Master will proxy all accesses via a slave with access
Case 2: master may not know the VM is protected Reason #1: VMs home datastore is inaccessible VM cant be powered on in any case Master will retry once datastore is accessible
Reason #2: partition, multiple masters, other master owns VM But other master knows and will restart it if needed
22
Found place?
No
Restarted ?
No
Yes
23
Case 1: host manageability impacted by datastore inaccessibility Master will retry failovers on another host after timeout Could take a long time to restart failed VMs vSphere 5.0 and 5.1 enhancements significantly reduces impact Case 2: one of VMs datastores is inaccessible on some/all hosts Master will retry but could exhaust 5 retries before success Future opportunity to enhance HA
Both are discussed next in part 2 of this session
24
Agenda: Part 2
1. Impact of datastore inaccessibility on HA failover workflows
2. Expanding HA protection against datastore inaccessibility
25
Improve VM availability by ensuring 1. VMs are manageable 2. VMs that use the datastore are moved to healthy hosts Address #1 by enhancing ESX, #2 by enhancing HA
vCenter Server
Manage
VMware ESX
VMware ESX
26
Are ESX storage-device states that indicate inaccessibility PDL (Permanent Device Loss): device is permanently inaccessible E.g., caused by removing a LUN using array management software
ESX infers state from SCSI sense codes returned by an array iSCSI login reject (target is gone or access not authorized) Device must be recreated to restore normal operation
APD (All Paths Down): device is possibly temporarily inaccessible E.g., caused by unplugging a network cable Device could become accessible at any time
27
Idea: if a datastore is under APD/PDL, fail I/Os quickly Impacted operations notified faster and allows others to proceed
ESX PDL Support (vSphere 5.0) When under PDL, I/Os are failed immediately ESX APD Support (vSphere 5.1) When under APD, non guest I/Os are failed immediately after a delay
28
Technical Direction Restart VMs with datastores under APD/PDL on a healthy host Response is fully configurable and automatic In 5.0U1 introduced initial support for PDL Terminates a VM on first guest-issued I/O to a PDL virtual disk Once a VM has been terminated, vSphere HA will restart it Enabled using advanced options. See slide notes for details.
29
We are exploring a significant extension to this mechanism Design goals Add support for APD Triggered by PDL/APD declaration rather than guest I/Os Full customization of responses (e.g., event only option) Full user interface and detailed reporting VM placement sensitive to accessibility
30
VM Component Protection
Caveat: what follows is a prototype and a feature based on it may look quite different, if and when we offer it
31
Datastore inaccessible
PDL
2 Determine
per VM response
No action
3
Failover
End
APD
Datastore inaccessible
Failover
6 5
Yes
End
APD
Datastore inaccessible
Failover
6 5
No action
Yes
End
APD cleared
APD
PDL
No action
Failover
No action
Yes
End
APD cleared
Demo Overview
OR EX
UB
WS1 WS2
FC Switch
Ethernet Switch
SAN
NFS
Demo Overview
OR EX
UB
WS1 WS2
FC Switch
Ethernet Switch
SAN
NFS
Demo Overview
OR EX UB WS1 WS2
FC Switch
Ethernet Switch
SAN
NFS
Several platform enhancements in recent years vSphere 5.0: PDL support vSphere 5.1: APD support vSphere 5.0U1: HA restarts VMs if they fail during a PDL/APD The future: HA recovering VMs impacted by PDL/APD Comprehensive: APD and PDL, and covers all VM I/Os Configurable: Various levels of VM remediation Usable: Enable with 1 click, detailed error reporting
Please send us your feedback on the proposed feature
39
Session Summary
40
Session Summary
vSphere HA failure coverage extended in 5.0U1 to cover PDL HA vision: Full coverage of datastore accessibility outages Extend coverage of failures for more applications Extend HA coverage to multi-VM applications
41
Questions?
42
INF-BCO2807
vSphere HA and Datastore Access Outages CurrentCapabilities Deep-Dive and Tech Preview
#vmworldinf
45
46
48
It cant communicate with it over the network it can see its datastore heartbeats
ESX 1 ESX 3
Results in:
Another master elected VC reports one masters view of the cluster A VM running in the other partition will be monitored via the heartbeat datastores restarted if it fails (in masters partition)
ESX 2
ESX 4
ESX 1
ESX 3
50
Troubleshooting
51
52
HA Agent Logging
HA 5.0+ writes operational information to a single log file called fdm.log
A configurable number of historical copies are kept to assist with debugging
File contains a record of, for example, Inventory updates relating to VMs, the host, and datastores received from the host
management agent (hostd)
Processing of configuration updates sent to a master by vCenter Server Significant actions taken by the HA agent, such as protecting a VM or restarting a VM Messages sent by a slave to a master and by a master to a slave
Default location
ESXi 5.0+: /var/log/fdm.log (historical copies in var/run/log) Earlier ESX versions: /var/log/vmware/fdm (all files in the same directory)
Notes
See vSphere HA best practices guide for recommended log capacities HA log files are designed to assist VMware support in diagnosing problems and the format may
change at any time. Thus, for reporting, we recommend you rely on the vCenter Server HA-related events, alarms, config issues, and VM/host properties
53
Noteworthy modules are Cluster module responsible for cluster functions Invt module responsible for caching key inventory details Policy module responsible for deciding what to do on a failure Placement module responsible for placing failed VMs Execution module responsible for restarting VMs Monitor modules responsible for periodic health checks FDM module responsible for communication with vCenter Server
54
55
Files are named X-hb, where X is the (SDK API) moID of the host Master periodically reads heartbeats of all partitioned / isolated slaves
They are named X-powereon, where X is the (SDK API) moID of the host
56
57