Availability Best Practices Domain

Availability Best Practices - Availability Using Multiple Service
Domains Kha
By Jsavit-Oracle on Jul 17, 2013
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical
Domains)
Availability Using Multiple Service Domains

Oracle VM Server for SPARC lets the administrator create multiple service domains to provide virtual I/O devices to
guests. While not mandatory, this powerful feature adds resiliency and reduces single points of failure. Resiliency is
provided by redundant I/O devices, strengthened by providing the I/O from different service domains.
Domain types
First, a review of material previously presented in Best Practices - which domain types should be used to run
applications, re-stated here with more focus on availability.
Traditional virtual machine designs use a "thick" hypervisor that provides all virtualization functions: core virtualization
(creating and running virtual machines), management control point, resource manager, live migration (if available),
and I/O. If there is a crash in code providing one of those functions, it can take the entire system down with it.
In contrast, Oracle VM Server for SPARC offloads management and I/O functionality from the hypervisor to domains
(virtual machines). This is a modern, modular alternative to older monolithic designs, and permits a simpler and
smaller hypervisor. This enhances reliability and security because a smaller body of code is easier to develop and
test, and has a smaller attack surface for bugs and exploits. It reduces single points of failure by assigning
responsibilities to multiple system components that can be configured for redundancy.
Oracle VM Server for SPARC defines the following types of domain, each with their own roles:
Guest domain - uses virtual network and disk devices provided by one or more service domains. This is
where applications typically run.
There can be as many domains as fit in available CPU and memory, up to a limit of 128 per T-series server or
M5-32 physical domain.
I/O domain - any domain that has been assigned physical I/O devices, which may be a PCIe bus root
complex, a PCI device, or a SR-IOV (Single Root I/O Virtualization) virtual function. It has native performance
and functionality for the devices it owns, unmediated by any virtualization layer. It can host its own applications
to make use of native I/O, and can be a service domain to provide virtual I/O device services to guests.
A server can have as many I/O domains as there are assignable physical devices. In these articles we will refer
to I/O domains that own a PCIe bus, also called "root complex domains". There can be as many such domains
as there are assignable "PCIe root complexes". For example, there are up to 4 root complex I/O domains on a
T4-4 since there is one PCIe root per socket and there are 4 sockets. A T5 system has two such buses per
socket, so an eight-socket T5-8 server can have 16 I/O domains.
Service domain - provides virtual network and disk devices to guest domains. It is almost always an I/O
domain with a PCIe bus, because otherwise it has no physical devices to serve out as virtual devices. It is not
recommended to run applications in a service domain, in order to ensure performance and availibility of the
virtual device services are not compromised.
Control domain - is the management control point for the server, used to configure domains and manage
resources. It is the first domain to boot on a power-up. It is always an I/O domain since it needs physical disk
and network devices, and is usually a service domain as well. It doesn't have to be a service domain, but
there's no reason not to make use of the bus and I/O devices needed to bring up the control domain.
There is one control domain per T-series server or M5-32 physical domain.
There are separable components for different services. You can create redundant network and disk configurations
that can tolerate loss of an I/O device, and you can also create multiple service domains for redundant access to I/O.
This is not mandatory: you can choose to have a single primary domain that is the sole I/O and service domain, and
still provide redundant I/O. Oracle VM Server for SPARC provides an additional level of redundancy and fault
tolerance. Guest domains can continue working even if a service domain or the control domain is down
Rolling upgrades
Multiple service domains can be used for "rolling upgrades" in which service domains are serially updated and
rebooted without disrupting guest domain operation. This provides "serviceability", the "S" in "Reliability, Availability,
and Serviceability" (RAS) for continuous application availability during a planned service domain outage.
For example, you might update Solaris in the service domain to apply fixes or get the latest version of a device driver,
or even (in the case of the control domain) update the logical domains manager (the ldomsmanager package) to
add new functionality. (Note: there is never a problem with doing a pkg update to create a new boot environment
while guest domains continue to run. Solaris 11 permits system updates during normal operation. There is only a
domain outage during the reboot to bring up the new boot environment into operation, and that brief outage can be
made invisible to guests by using multiple service domains.)
If you have two service domains providing virtual disk and network services, you can update one of them and reboot
it. During the time that domain is rebooting all virtual I/O continues on the other service domain. When the first service
domain is up again, you can reboot the other one, providing continuous operation while upgrading system software.
This avoids taking a planned application outage for upgrades, which is often difficult to schedule due to business
requirements. It is also an efficient alternative to "evacuating" a server by using live migration to move guests off the
box during maintenance and then moving them back. The advantages over evacuation are that overall capacity isn't
reduced during maintenance, since all boxes remain in service, and the delay and overhead of migrating guest virtual
machines off and then back onto a server is eliminated. Oracle VM Server for SPARC can use the "evacuation with
live migration" method - but doesn't have to.
Okay, you convinced me - now tell me how to create

an alternate service domain
Let's discuss details for configuring alternate service domains. The reference material is at Oracle VM Server for
SPARC Administration Guide - Assigning PCIe Buses.
The basic process is to identify which buses to keep on the control domain, which initially has all hardware resources,
and which ones to assign to alternate service domains (which means they are also I/O domains), and then assign
them.
1.
Ensure your system is physically configured with network and disk devices on multiple buses so the
control domain and I/O domains can operate using their own physical I/O. Alternate service domains should not
depend on the primary domain's disk or network devices. That would create a single point of failure based on
the control domain - defeating the purpose of the exercise.
2.
Identify which PCIe buses are needed by the control domain. At a minimum, that includes buses with
the control domain's boot disk and network link used to log into it, plus backend devices used for virtual devices
it serves to guests.
3.
Identify which PCIe buses are needed by the service domains. As with the control domain, this includes
devices needed so the domain can boot, plus any devices used for virtual network and disk backends.
4.
Move selected buses from the control domain to the service domain. This will require a delayed
reconfiguration and reboot of the control domain because physical devices cannot yet be dynamically
reconfigured.
5.
From the control domain define virtual disk and network services using both control and service domains.
In the following graphic, we've retained PCI bus pci_0 for the control domain, and assigned the other
bus pci_1 (on a two-bus server) to an I/O domain.
Illustration with a T5-2

Let's demonstrate the process on a T5-2. This lab machine has a small I/O configuration, but enough to demonstrate
the techniques used. First, let's see the configuration. Note that a T5-2 has four PCIe buses, since T5 systems have
two buses per socket, and that the ldm list -l -o physio command displays both the bus device and
itspseudonym, showing the correspondence between the pci@NNN and pci_N bus identification format.
t5# ldm list

NAME
STATE
FLAGS
CONS
VCPU
MEMORY
UTIL
NORM
UPTIME
primary
active
-n-c--
UART
256
511G
0.2%
0.0%
24m
t5# ldm list-io

NAME
TYPE
BUS
DOMAIN
STATUS
----
----
---
------
------
pci_0
BUS
pci_0
primary
pci_1
BUS
pci_1
primary
pci_2
BUS
pci_2
primary
pci_3
BUS
pci_3
primary
... snip for brevity ...
t5# ldm list -l -o physio

IO
DEVICE
PSEUDONYM
pci@340
pci_1
pci@300
pci_0
pci@3c0
pci_3
pci@380
pci_2
OPTIONS

This shows all four bus device names and pseudonyms.
Identify which PCIe buses are needed for control domain disks
This section illustrates the method shown in Assigning PCIe Buses. First, ensure that the ZFS root pool is named
'rpool' as expected, and get the names of disk devices in the pool.
t5# df /
/
(rpool/ROOT/solaris-1):270788000 blocks 270788000 files
t5# zpool status rpool

pool: rpool
state: ONLINE
scan: none requested

config:
NAME
STATE
rpool
ONLINE
ONLINE
c0t5000CCA03C2C7B50d0s0
READ WRITE CKSUM
errors: No known data error

On current systems, boot disks are managed with Solaris I/O multipathing, hence the long device name. Use
thempathadm command to determine which PCIe bus it is connected to, first obtaining the "Initiator Port Name", and
then getting the "OS Device File", which contains the bus information:
t5# mpathadm show lu /dev/rdsk/c0t5000CCA03C2C7B50d0s0

Logical Unit:
/dev/rdsk/c0t5000CCA03C2C7B50d0s2

Paths:
Initiator Port Name:
Target Port Name:
w508002000138a190
w5000cca03c2c7b51

t5# mpathadm show initiator-port w508002000138a190
Initiator Port:
w508002000138a190
Transport Type:
unknown
OS Device File:
/devices/pci@300/pci@1/pci@0/pci@2/scsi@0/iport@1
This tells us that the boot disk is on bus pci@300, which is also referred to as pci_0, as shown by the output
from ldm list -l -o physio primary. We can not remove this bus from the control domain!
For very old machines

On older systems like UltraSPARC T2 Plus, the disk may not be managed by Solaris I/O multipathing, and you'll see
the traditional "cNtNdN" name disk device. Simply use ls -l on the device path to get the bus name. The example
below uses a mirrored ZFS root pool, and both disks are on bus pci_400, with pseudonym pci_0
t5240# df /
/
(rpool/ROOT/solaris-1):60673718 blocks 60673718 files
t5240# zpool status rpool

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h39m with 0 errors on Fri Jul 12 08:08:16 2013
config:
NAME
STATE
rpool
ONLINE
ONLINE
c3t0d0s0
ONLINE
c3t1d0s0
ONLINE
mirror-0
READ WRITE CKSUM
errors: No known data errors

t5240# ls -l /dev/dsk/c3t0d0s0
lrwxrwxrwx
1 root
root
49 Jul 24 2012 /dev/dsk/c3t0d0s0
-> ../../devices/pci@400/pci@0/pci@8/scsi@0/sd@0,0:a
t5240# ls -l /dev/dsk/c3t1d0s0
lrwxrwxrwx
1 root
root
49 Jul 24 2012 /dev/dsk/c3t1d0s0
-> ../../devices/pci@400/pci@0/pci@8/scsi@0/sd@1,0:a
t5240# ldm list-io
NAME
TYPE
BUS
DOMAIN
STATUS
----
----
---
------
------
pci_0
BUS
pci_0
primary
pci_1
BUS
pci_1
primary

t5240# ldm list -l -o physio primary
NAME
primary
IO
DEVICE
PSEUDONYM
pci@400
pci_0
pci@500
pci_1
OPTIONS

The control domain on this T5240 requires bus pci@400, which has pseudonym pci_0. The other bus on this twobus system can be used for an alternate service domain.
To summarize: we've identified the PCI bus needed for the control domain boot disks that can not be removed.
Identify which PCIe buses are needed for the control domain network
A similar process is used to determine which buses are needed for network access. In this example net0 is used to
log into the control domain, and also for a virtual switch. That is the "vanity name" for ixgbe0.
t5# dladm show-phys net0

LINK
MEDIA
STATE
SPEED
DUPLEX
DEVICE
net0
Ethernet
up
10000
full
ixgbe0
t5# ls -l /dev/ixgbe0
lrwxrwxrwx
1 root
root
53 May 9 11:04 /dev/ixgbe0 ->
../devices/pci@300/pci@1/pci@0/pci@1/network@0:ixgbe0
lrwxrwxrwx
1 root
root
55 May 9 11:04 /dev/ixgbe1 ->
../devices/pci@300/pci@1/pci@0/pci@1/network@0,1:ixgbe1
lrwxrwxrwx
1 root
root
53 May 9 11:04 /dev/ixgbe2 ->
../devices/pci@3c0/pci@1/pci@0/pci@1/network@0:ixgbe2
lrwxrwxrwx
1 root
root
55 May 9 11:04 /dev/ixgbe3 ->
../devices/pci@3c0/pci@1/pci@0/pci@1/network@0,1:ixgbe3
Based on these commands pci@300 is used for ixgbe0 and ixgbe1. while pci@3c0, is used
for ixgbe2 and ixgbe3.
Move selected buses from the control domain to the service domain
The preceding commands have shown that the control domain uses pci@300 (also called pci_0) for both network
and disk, and we can reassign any other bus to an alternate service + I/O domain. This will require a delayed
reconfiguration to remove physical I/O from the control domain, and then we'll define a new domain and give it that
bus. We first save the system configuration to the service processor so we can fall back to the prior state if we need
to.
t5# ldm add-spconfig 20130617-originalbus

t5# ldm start-reconf primary
Initiating a delayed reconfiguration operation on the primary domain.
All configuration changes for other domains are disabled until the primary
domain reboots, at which time the new configuration for the primary domain
will also take effect.
t5# ldm remove-io pci_2 primary
t5# ldm remove-io pci_3 primary
t5# shutdown -i6 -y -g0
After the reboot, we see that the two removed buses are no longer assigned to the primary domain:
t5# ldm list-io

NAME
TYPE
BUS
DOMAIN
STATUS
----
----
---
------
------
pci_0
BUS
pci_0
primary
pci_1
BUS
pci_1
primary
pci_2
BUS
pci_2
pci_3
BUS
pci_3

At this point we create the alternate I/O domain. This is done using normal ldm commands with the exception that
we do not define virtual disk and network devices, because it will use physical I/O devices on the buses it owns.
t5# ldm create alternate

t5# ldm set-core 1 alternate
t5# ldm set-mem 8g alternate
t5# ldm add-io pci_3 alternate
t5# ldm add-io pci_2 alternate
t5# ldm bind alternate
t5# ldm start alternate
LDom alternate started
t5# ldm add-spconfig 20130617-alternate
At this point, just install Solaris in the alternate domain using whatever process seems convenient. A network install
could be used, or a virtual ISO install image could be temporarily added from the control domain, and then removed
after installation.
Summary
Oracle VM Server for SPARC lets you configure multiple service and I/O domains in order to provide resilient I/O
services to guests. Following articles will show exactly how to configure such domains for highly available guest I/O.
Coming next
This article shows how to split out buses to create an alternate I/O domain, but we haven't made it into
a servicedomain yet. The next article will complete the picture, using an even larger server, to illustrate how to create
the I/O domain and then define redundant network and disk services.

Availability Best Practices Domain

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Availability Best Practices Domain

Uploaded by

Copyright:

Available Formats

Availability Best Practices - Availability Using Multiple Service

Availability Using Multiple Service Domains

Okay, you convinced me - now tell me how to create

Illustration with a T5-2

t5# ldm list

t5# ldm list-io

... snip for brevity ...

t5# ldm list -l -o physio

... snip for brevity ...

(rpool/ROOT/solaris-1):270788000 blocks 270788000 files

t5# zpool status rpool

scan: none requested

READ WRITE CKSUM

errors: No known data error

t5# mpathadm show lu /dev/rdsk/c0t5000CCA03C2C7B50d0s0

... snip for brevity ...

... snip for brevity ...

For very old machines

(rpool/ROOT/solaris-1):60673718 blocks 60673718 files

t5240# zpool status rpool

READ WRITE CKSUM

errors: No known data errors

... snip for brevity ...

... snip for brevity ...

t5# dladm show-phys net0

t5# ldm add-spconfig 20130617-originalbus

t5# ldm list-io

... snip for brevity ...

t5# ldm create alternate

You might also like