Professional Documents
Culture Documents
Table of Contents
Data Reliability With Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
About This Article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Data Availability And Reliability Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Internal Storage Reliability With Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Hardware RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Reliability In I/O Domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Reliability In Guest Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I/O Reliability And Availability On Sun Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Example: Sun SPARC Enterprise T2000 Server I/O Architecture . . . . . . . . . . . . . . . 8
I/O Architecture Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Installing And Configuring Logical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Configuring Hardware RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
About Hardware RAID Disk Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Implementing Hardware RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Creating A Logical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Configuring ZFS In The I/O Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Implementing ZFS And Cloning Guest Domain Volumes . . . . . . . . . . . . . . . . . . . . . . 18
Create A ZFS Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Clone And Unconfigure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Clone And Run The Unconfigured System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Save Your Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Configuring Volume Management In Guest Domains . . . . . . . . . . . . . . . . . . . . . . . 24
Setting Up Volume Management Through Network Install . . . . . . . . . . . . . . . . . . . 24
Set Up The Guest Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Rules File Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Profile File Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Sysidcfg File Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Add Install Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Housekeeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
About The Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Ordering Sun Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 1
Logical
Domain 1
Logical
Domain
Logical
Domain n
I/O
Domain 2
Solaris
10 OS
Guest OS
Image
Guest OS
Image
Guest OS
Image
Solaris
10 OS
Hypervisor
PCI A
PCI B
CoolThreads Technology-Powered Server
Figure 1. Logical Domains supports multiple guest domains each with their own
secure partition of server hardware. I/O is handled by I/O domains that own one
or more PCI buses.
Logical Domains allow individual PCI root nexus nodes (referred to as PCI buses in this
article) to be allocated to I/O domains so that each one owns a PCI bus and any
devices connected to it. On servers with multiple PCI buses, multiple I/O domains can
be configured to provide multiple paths from guest domains to I/O resources. If an I/O
domain, its PCI bus, or its peripherals fail, or the domain needs to be rebooted, a guest
domain can continue to access its I/O resources through the second I/O domain given
an appropriate configuration in the guest domain.
In the realm of data reliability, Sun servers with CoolThreads technology, combined
with the Solaris 10 Operating System, offer several tools to build reliability through
redundancy. These tools can be used at three different architectural levels in order to
increase data reliability using the servers on-board disk drives and disk controller:
Hardware-level redundancy can be established using the RAID capabilities of the
servers built-in disk controller.
An I/O domain can support I/O redundancy through software features including the
Zettabyte File System (ZFS) and Solaris Volume Manager software. With reliability
established in the I/O domain, reliable storage can be provided to guest domains.
Guest domains can use the same software features to implement redundancy in the
guest itself: ZFS and Solaris Volume Manager software can be configured within a
logical domain using virtual disks.
This Sun BluePrints article discusses the approaches and trade-offs for establishing data
reliability through redundancy using the resources of the server itself, namely its own
PCI bus(es), built-in disk controller, and disk drives. Logical Domains opens up a realm of
possibilities for implementing I/O availability and reliability techniques in the virtual
world, and this article is the first in a series to address this broad topic. Perhaps not
surprisingly, the techniques discussed in this article all have their benefits and
drawbacks. The choice of which ones to implement is one that has to be made after
carefully comparing each solutions benefits to your datacenter requirements.
Chapter 6, Configuring ZFS In The I/O Domain on page 18 shows the benefits of
establishing reliability using ZFS, in particular using snapshots and clones with
Logical Domains.
Chapter 7, Configuring Volume Management In Guest Domains on page 24 shows
how to set up Solaris Volume Manager software in a guest domain using JumpStart
software.
Chapter 2
discussed here, and configuration examples are provided later in the article. In general,
the lower in the stack you go, the more efficient the RAID implementation should be.
Hardware RAID, implemented in the disk controller, offloads the CPU from
implementing RAID, and should be the most efficient. Software RAID implemented in a
guest domain must replicate every write to at least two virtual disks, imposing the
overhead of traversing through an I/O domain for every write.
All Sun servers with CoolThreads technology have a single controller for internal disk
drives, so only a single I/O domain can be used to access internal disks. Given this fact,
for access to internal storage, reliability can be considered independently from
availability.
Hardware RAID
Every Sun server with CoolThreads technology comes with an internal RAID controller
that supports RAID 0 (striping) and RAID 1 (mirroring). RAID 1 allows pairs of disks to be
configured into RAID groups where the disk mirroring is handled in hardware. This
approach is easy to manage, and is transparent to both the I/O and the guest domains.
With hardware RAID mirroring disks, there are three ways in which this reliable storage
can be provided to guest domains:
1.
An entire mirrored volume can be provided as a virtual disk to a guest domain. This
disk can be partitioned and used as a boot disk.
2.
A partition of a mirrored disk can be provided to a guest domain. Partitions provided to guest domains cannot be partitioned as virtual disks can.
3.
A flat file created on a mirrored disk can be provided as a virtual disk to a guest
domain. This disk can be partitioned and used as a boot disk.
The most common technique is Option 3, using a flat file provided to a guest domain as
a virtual disk because many such virtual disks can be created on a single mirrored disk,
and these virtual disks can be partitioned and booted from the guest domain. Option 1,
allocating a mirrored disk set per guest domain, is an option that would significantly
limit the number of available domains on most systems. Option 2, allocating a partition
of a mirrored disk, limits the number of virtual disks that can be created because only a
fixed number of partitions can be created on a physical disk. These virtual disks also
cannot be used as boot disks in Logical Domains.
Solaris Volume Manager software can be used to create RAID sets that can be
provided as virtual disks to guest domains in two ways: entire volumes can be
exported as virtual disks, and flat files can be created on a volume and then exported
as a virtual disk. Again, the most common approach would be to create flat files that
provide partitionable virtual disks to guest domains.
The Zettabyte File System (ZFS) incorporates volume management and filesystem
technology into a single interface that can manage both. Disks can be allocated to a
ZFS pool with the ZFS volume management capabilities establishing replication. ZFS
supports mirroring and RAID-Z, which is similar to RAID 5. The most common
approach using ZFS is to allocate flat files to guest domains so that they can be
partitioned and used as boot disks. In LDoms version 1.0.3 you can provide a ZFS
filesystem to a guest domain.
If the choice is between Solaris Volume Manager software and hardware RAID,
hardware RAID is the most transparent of the choices. Still, if an IT organization has
standardized on Solaris Volume Manager software, the organizations current best
practices can be directly translated into a Logical Domains environment.
If the choice is between either RAID technology and ZFS, there are some important ZFS
features that help in implementing Logical Domains. Using ZFS snapshots and cloning,
a flat file containing a guest domains virtual disk can be frozen in time and used as a
backup. One or more snapshots can be taken and used as points in time to which a
Logical Domain can be rolled back. Snapshots can also be used as point-in-time volume
copies and transferred to magnetic tape. Unlike volume copies, snapshots are instant
and space efficient. The only space used by a snapshot is the volumes metadata and
any blocks changed in the primary volume since the snapshot was taken.
ZFS allows read-only snapshots to be turned into clones. A clone is a read/write volume
created from a snapshot that is equally instant and space efficient. Clones can be used
to create virtual disks for a number of Logical Domains and for a number of purposes.
For example:
A golden master environment can be created and then cloned for use by multiple
Logical Domains. Horizontally scaled applications (such as Web servers) are prime
candidates for using this technology.
A single environment can be cloned and used for various project phases. Clones of
virtual disks can be used for development, test, and production environments.
An existing production environment can be cloned and used for testing operating
system and application patches before they are put into production.
In all cases, the benefits of ZFS clones are the same: they can be created nearly
instantly, and they use storage efficiently, helping to raise return on investment.
Chapter 3
Logical
Domain 1
Logical
Domain
Logical
Domain n
I/O
Domain 2
Solaris 10 OS
Logical
Domain
Manager
Guest OS
Image
Guest OS
Image
Guest OS
Image
Solaris
10 OS
Hypervisor
PCI A
PCI B
CoolThreads TechnologyPowered Server
Built-in
Ethernet
Built-in
Disk Drives
Built-in
Ethernet
External Storage
Figure 2. The Sun SPARC Enterprise T2000 servers I/O configuration illustrates how
internal and external I/O resources can interface to guest domains through two
I/O domains.
The illustration highlights the fact that all internal drives are connected to a single PCI
bus, so all internal storage is accessed via one I/O domain. If the servers two PCI buses
are allocated to each of two I/O domains, however, two host bus adapters can be
configured to provide access to external storage through two PCI Express slots via two
I/O domains. Redundancy and resiliency can then be established in the guest domain
with software mirroring such as that provided by Solaris Volume Manager software. The
Sun SPARC Enterprise T2000 server is used as an interesting example here because it is
the only server in this series with its built-in Ethernet ports split between two PCI buses.
This architecture allows IP Multipathing (IPMP) to be used for high availability network
access through two I/O domains using internal ports.
PCI Buses PCI Bus Disk Controllers Built-In Ethernet Ports Expansion Slots
The number of PCI buses indicates the maximum number of I/O domains that can be
configured. By default, the first I/O domain that you create owns all PCI buses;
creating a second I/O domain requires some additional steps.
Servers with a single PCI bus are limited to using a single I/O domain for all I/O.
Servers with two PCI buses can have two I/O domains. For example, if a second I/O
domain is created on a Sun SPARC Enterprise 5240 server, and PCI bus B is allocated to
it, the domain can access four built-in Ethernet ports and three PCI Express expansion
slots.
All of the servers having two PCI buses connect to at least one expansion slot per bus.
This allows PCI Express host bus adapters and/or network interface controllers to be
10
installed for redundant access to storage and/or networks through two I/O domains.
This includes the Sun SPARC Enterprise T5240, T5140, and T2000 servers.
The Sun SPARC Enterprise T2000 server is the only product having built-in Ethernet
ports configured across both PCI buses. This server can support two I/O domains,
each connected to two Ethernet ports, allowing IP Multipathing (IPMP) to be
configured in guest domains with no additional hardware required.
11
Chapter 4
In order to set up Logical Domains on Sun servers with CoolThreads technology, youll
need to install the required set of Logical Domains software, upgrade your server
firmware (if necessary), and reboot your server with Logical Domains enabled. These
procedures are well-described in external documents, and the recommended steps are
outlined here:
1.
Its a good idea to set up a network install server. With a network install server,
youll be able to quickly and easily install Solaris software into guest domains
using JumpStart software, or install your own OS and application combinations
using Solaris Flash software. For Logical Domains version 1.0.2 and earlier, a
network install server is a requirement, as these versions do not provide access
from guest domains to built-in DVD drives. For Logical Domains 1.0.3 and later, you
can boot a guest domain from a DVD and install the Solaris OS interactively.
2.
If you plan to use hardware RAID, update the firmware in your disk controller if necessary, and set up hardware RAID as described in the next chapter before you proceed to the next step. To obtain the firmware, consult the Product Notes
document for your server available at: http://docs.sun.com/app/docs
/prod/coolthreads.srvr#hic. Youll find a patch that contains instructions for determining your current firmware version and the latest firmware to install. Youll
need to update firmware while booted from the network or from the install CD
since this process re-partitions the disks that you configure into RAID sets.
3.
Install the Solaris 10 8/07 release or later. This OS version is required on some of
the newer CoolThreads technology-enabled servers, and it incorporates patches
that are needed with earlier versions. You can check the Product Notes issued with
each version of the Solaris OS and LDoms software for the required minimum supported versions of each component.
4.
Visit http://www.sun.com/servers/coolthreads/ldoms/get.jsp to obtain the latest Logical Domain Manager software, links to the latest Administration Guide, and
the latest Release Notes. LDoms version 1.0.2 works with the Solaris 10 8/07 OS.
5.
The Logical Domains Administration Guide describes how to update the system
firmware. This is important because the LDoms hypervisor runs in firmware and
must be synchronized with the version configured in the OS. Note that the firmware revision number is not related to the Logical Domains version number.
6.
Install the Logical Domains Manager following the instructions contained in the
package that you downloaded.
7.
Set up your control and I/O domain as described in the Sun BluePrints article cited
in the sidebar, or as described in the Logical Domains Administration Guide.
12
Chapter 5
A typical Logical Domains configuration uses one domain to act as the servers control
and first service and I/O domain. Hardware RAID, available on all Sun servers with
CoolThreads technology, is an excellent way to provide a highly reliable boot disk for
the combined control, service, and I/O domain. This domains reliability is of critical
importance: if it fails, all I/O to on-board disk drives for all guest domains fails as well.
This chapter describes how to configure disk mirroring for the first two disk drives on a
four-disk server. These configuration steps were performed on a Sun SPARC Enterprise
T2000 server, however the commands for other platforms are the same.
In this example, we will combine disks 0 and 1, both Sun 72 GB drives, into a single
mirrored RAID set. After the sequence of steps is complete, the format command will
display the following choice of drives, making it appear that the server has one fewer
disk drive; namely, the c1t1d0 drive no longer appears:
13
In contrast to Solaris Volume Manager software, hardware RAID mirrors only whole
disks at a time, so regardless of how the disk is partitioned, all of its data is mirrored.
The drive appearing at c1t0d0 should be partitioned and the OS installed on it only
after the RAID set is configured by the following steps. Any data previously stored on the
mirrored disks will be lost.
The raidctl command allows disks to be specified using the standard controllertarget-disk format (such as c1t0d0), or in an x.y.z format from which you can learn the
possible x, y, and z values by typing the raidctl command with no arguments. Since
we already know the controller-target-disk way of specifying disks, we use this form to
create a RAID 1 set that mirrors disks c1t0d0 and c1t1d0:
After asking you to confirm this destructive operation, the raidctl command will
provide status reports as it configures the RAID set, concluding by telling you the
identify of the new volume that it has created.
14
Creating RAID volume will destroy all data on spare space of member
disks, proceed (yes/no)? y
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Physical disk 0 created.
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Physical disk 1 created.
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Volume 0 created.
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Volume 0 is |enabled||optimal|
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Volume 0 is |enabled||optimal|
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Physical disk (target 1) is |out of sync||online|
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Volume 0 is |enabled||degraded|
/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2 (mpt0):
Volume 0 is |enabled||resyncing||degraded|
Volume c1t0d0 is created successfully!
When you format the disk, its important to let the format command auto-configure
the disk type before you partition and label the drive. Type the format command and
select disk 0, assuming that you have combined drives 0 and 1 into a single volume.
Use the type command to display the selection of possible disk types:
format> type
Select 0 to auto-configure the drive, and you will see that the format command
recognizes the newly-created RAID set. Then label the disk:
15
Note that the disk size has reduced slightly to allow the RAID controller room on the
disk to store its metadata. Now you can partition the disk as part of the OS install
process, and at any time you can type raidctl -l to see the status of the RAID set.
The output from this command, issued just after the RAID set was created, indicates
that the controller is synchronizing the two drives. When the synchronization is
complete, youll see the status change to OPTIMAL.
# raidctl -l c1t0d0
Volume
Size
Stripe Status
Cache RAID
Sub
Size
Level
Disk
---------------------------------------------------------------c1t0d0
68.3G
N/A
SYNC
N/A
RAID1
0.0.0
68.3G
GOOD
0.1.0
68.3G
GOOD
After youve installed the Solaris 10 OS on the RAID set, you need to install and
configure the Logical Domains software. You create a virtual disk service and virtual
switch that provide storage and networking to guest domains. You can create flat files
whose contents are mirrored, connect them as boot disks to guest domains, and then
use your network install server to install the Solaris OS in each domain. This example
illustrates creating a guest domain guest2 with 2 virtual CPUs, 2 GB of main memory,
and an 8 GB boot disk. The guest domains network is connected to the existing virtual
switch named primary-vsw0. The domain is set not to automatically boot.
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
mkfile 8g /domains/s2image.img
ldm create guest2
ldm set-vcpu 2 guest2
ldm set-mem 2g guest2
ldm add-vnet vnet0 primary-vsw0 guest2
ldm add-vdsdev /domains/s2image.img vol2@primary-vds0
ldm add-vdisk vdisk0 vol2@primary-vds0 guest2
ldm bind guest2
ldm set-variable auto-boot\?=false guest2
Assuming that you plan to do a network install into this domain, one of the key pieces
of information that youll need to set up JumpStart software is the domains network
MAC address on the virtual switch connecting the server to the physical network:
16
CONS
VCPU
MEMORY
5001
2G
UTIL
MAC
00:14:4f:fb:08:50
VCPU
VID
0
1
PID
8
9
UTIL STRAND
100%
100%
MEMORY
RA
0x8000000
PA
0x188000000
SIZE
2G
VARIABLES
auto-boot?=false
NETWORK
NAME
SERVICE
DEVICE
vnet0
primary-vsw0@primary
network@0
PEER
MAC
primary-vsw0@primary
00:14:4f:fb:ae:95
vnet0@guest1
00:14:4f:fb:58:90
MAC
00:14:4f:fa:2b:e4
DISK
NAME
vdisk0
VCONS
NAME
guest2
VOLUME
vol2@primary-vds0
TOUT DEVICE
disk@0
SERVICE
primary-vc0@primary
PORT
5001
SERVER
primary
The MAC address youll need to use is 00:14:4f:fa:2b:e4, noted in orange above.
This is the MAC address for the server that appears on the virtual network vnet0. Note
that this is not the MAC address listed under MAC.
17
The list-bindings command indicates that the domains console is located on port
5001. You can reach the console on that port via telnet localhost 5001, and begin
the network install:
The network install proceeds just as it does on a physical server; examples of JumpStart
software configuration files are beginning with Rules File Setup on page 25.
18
Chapter 6
Copy the boot disk image for a working configuration into a ZFS filesystem.
ZFS Filesystem
zfspool/source
Snapshot: zfspool/source@initial
Clone
zfspool/
source-unconfig
Unconfigure Image
Snapshot: zfspool/sourceunconfig@initial
Clone of Clone
zfspool/guest3
zfspool/guest4
Clone of Clone
19
2.
Create a snapshot and a clone that you then unconfigure. This step preserves the
original boot disk image so that you can use it again for other purposes or roll back
to it in case you make a mistake.
3.
Create a snapshot of the filesystem containing the unconfigured image and make
multiple clones of it, illustrating that you can make clones of clones.
In our case, since we want to save some physical disk partitions for future use, we
allocate two disk partitions to a zfs pool. We first use the format command to print
the partition table on drive 3, c1t2d0:
Part
Tag
0
root
1
swap
2
backup
3 unassigned
4 unassigned
5 unassigned
6
usr
7 unassigned
Flag
wm
wu
wu
wm
wm
wm
wm
wm
Cylinders
0 - 1030
1031 - 1237
0 - 14086
0
0
0
1238 - 14086
0
Size
10.01GB
2.01GB
136.71GB
0
0
0
124.69GB
0
Blocks
(1031/0/0)
20982912
(207/0/0)
4212864
(14087/0/0) 286698624
(0/0/0)
0
(0/0/0)
0
(0/0/0)
0
(12849/0/0) 261502848
(0/0/0)
0
Just to make sure that the fourth drive is partitioned exactly the same, we use a
powerful (but potentially dangerous) set of commands to copy drive 3s partitioning to
drive 4. This set of commands makes the task easy, but be sure that your arguments are
correct, as its a very fast way to make existing data inaccessible if you accidentally
partition the wrong drive:
Now we create a ZFS pool called zfspool that uses the usr slice of each drive:
20
Now we create a ZFS filesystem named source in which we can store files that are
then provided to guest domains as virtual disks:
We copy a flat file that contains the boot disk for another guest domain into the source
filesystem. The source file well work with from here on is source.img.
Using the zfs list, command note that we currently are using 8 GB for the image,
which is the only file stored in the ZFS pool:
AVAIL
114G
114G
REFER
25.5K
8.00G
MOUNTPOINT
/zfspool
/zfspool/source
21
Note that the clone of the 8 GB disk image uses 22.5 KB of storage space since all of its
blocks simply reference the existing zfspool/source filesystem:
USED
8.00G
8.00G
0
22.5K
AVAIL
114G
114G
114G
REFER
27.5K
8.00G
8.00G
8.00G
MOUNTPOINT
/zfspool
/zfspool/source
/zfspool/source-unconfig
The clone of source.img is the disk image that well unconfigure so that it no longer
has a system identity as defined by host name, IP address, time zone, etc. Since well
only be using this file temporarily, lets hijack an existing Logical Domain by attaching
this file as its boot disk just while we do the unconfigure step. In this example, well
first detach the guests existing virtual disk device named vol2@primary-vds0, and
add a new virtual disk device, the source.img file from our cloned filesystem, as the
same name, vol2@primary-vds0. We start with guest2 in the stopped state and
then start it once weve switched its disks:
Now we telnet to the correct port number on localhost for reaching guest2s
console, boot the domain if necessary, and run the sys-unconfig command. If we do
anything unintended, we can always delete the clone and revert back to the copy of
source.img in the zfspool/source filesystem.
# sys-unconfig
WARNING
This program will unconfigure your system. It will cause it
to revert to a "blank" system - it will not have a name or know
about other systems or networks.
This program will also halt the system.
Do you want to continue (y/n) ? y
22
The sys-unconfig command shuts down the system, and now we can restore
guest2s original disk image:
Here are the steps to create guest3, a Logical Domain with two virtual CPUs, 2 GB of
main memory, and a boot disk whose image is source.img contained in the ZFS
filesystem zfspool/guest3. Since we arent running a production environment, we
set the domain not to boot automatically.
23
Now you can use ldm start to start the logical domain. If guest3 is the third logical
domain youve created, you can telnet to localhost port 5002 to connect to the
console and boot the domain. Once booted, the system configuration process starts,
and you will be asked to provide system identity information just as if you opened up a
new server out of the box. In a real environment, you would have a new instance of a
golden master application instance that you had configured in the source Logical
Domain.
The space savings of using clones is enormous. The ldm list command shows that,
after creating two new domains and configuring them, weve used only approximately
50 MB per domain:
USED
8.13G
50.4M
51.7M
8.00G
0
31.3M
AVAIL
114G
114G
114G
114G
114G
REFER
29.5K
8.00G
8.00G
8.00G
8.00G
8.00G
MOUNTPOINT
/zfspool
/zfspool/guest3
/zfspool/guest4
/zfspool/source
/zfspool/source-
8.00G
24
Chapter 7
The third architectural level at which I/O redundancy can be configured is within the
I/O domain itself. Volume management software, such as Solaris Volume Manager
software or ZFS, manages multiple virtual disks and performs the necessary replication
from the guest domain. Of the two choices, Solaris Volume Manager software is the
one that supports mirrored boot disks and is the one discussed here.
In cases where two I/O domains are used to support multiple paths to external storage,
volume management in the guest domain is the preferred approach because there is
no single point of failure along the communication path other than the server itself. In
the case of a single I/O domain accessing internal disk storage, the topic of this article,
replicating storage within the guest domain is less efficient because it imposes the
overhead of dual virtual-to-physical paths in the guest. Every disk write must traverse
the path to the virtual device twice, whereas configuring the replication in the I/O
domain puts the I/O closer to the physical device, which is more efficient.
Nevertheless, if your existing best practices dictate using volume management in the
guest OS, or if you need to create a development environment that re-creates a real
one, you can use Solaris Volume Manager software and ZFS in guest domains just as
you do on a physical system. One datacenter best practice is to establish disk mirroring
during an install with JumpStart software, so this section illustrates how to set up root
disk mirroring with Solaris Volume Manager software as part of the network install
process. This section assumes that youre familiar with the process of setting up
JumpStart software, so the process we describe is reduced to the essentials.
25
that is already mirrored. This is an inefficient thing to do, however for the purposes of
this example, the commands are correct.
We use Logical Domain guest4 that already is set up using a cloned disk from the
previous section. We create a second flat file as a mirror, and a third flat file as the
quorum disk for Solaris Volume Manager software. Then we save the configuration so
that it will survive a power cycle of the server:
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
fraser#
mkfile 8g /zfspool/guest4/mirror.img
mkfile 16M /domains/guest4quorum.img
ldm stop guest4
ldm unbind guest4
ldm add-vdsdev /zfspool/guest4/mirror.img vol5@primary-vds0
ldm add-vdisk vdisk1 vol5@primary-vds0 guest4
ldm add-vdsdev /domains/guest4quorum.img vol6@primary-vds0
ldm add-vdisk vdisk2 vol6@primary-vds0 guest4
ldm bind guest4
ldm remove-spconfig fourguests
ldm add-spconfig fourguests
Connect to the domains console and verify that it has three disks attached:
{0} ok show-disks
a) /virtual-devices@100/channel-devices@200/disk@2
b) /virtual-devices@100/channel-devices@200/disk@1
c) /virtual-devices@100/channel-devices@200/disk@0
q) NO SELECTION
Enter Selection, q to quit: q
26
#
# Jumpstart Profile to install a Solaris Flash archive in a LDom
# Use a third, small virtual disk as a quorum disk for SVM
#
install_type
flash_install
archive_location
nfs js:/jumpstart/Archives/s4mirror.flar
#
partitioning
explicit
#
# Mirror root and swap
#
filesys mirror:d10 c0d0s0 c0d1s0 free /
filesys mirror:d20 c0d0s1 c0d1s1 1024 swap
#
# Create three sets of state databases so there should
# always be a quorum of databases if a single disk fails.
# Two databases per volume for media redundancy.
#
metadb c0d0s7 size 8192 count 2
metadb c0d1s7 size 8192 count 2
metadb c0d2s2 size 8192 count 2
timezone=US/Mountain
timeserver=localhost
root_password=678B3hGfBa01U
name_service=DNS {domain_name=loneagle.com search=loneagle.com
name_server=ins(192.168.0.134)}
network_interface=vnet0 {protocol_ipv6=no hostname=s4
ip_address=192.168.0.58 default_route=192.168.0.1
netmask=255.255.255.192}
security_policy=NONE
system_locale=en_US
terminal=xterms
keyboard=US-English
service_profile=open
nfs4_domain=loneagle.com
27
Install
Install the Solaris OS, or in this case the Solaris Flash archive, by attaching to the logical
domains console and issuing the command:
The installation process will set up the mirrored root and swap filesystems. During the
install, you may see an error message while the state replicas are created:
And you may also see multiple errors at boot time similar to this:
These relate to known issues and do not interfere with Solaris Volume Manager
softwares operation.
Housekeeping
You will need to do the normal housekeeping that you should do when setting up a
mirrored root partition with Solaris Volume Manager software, including setting up a
boot block on the second disk in the mirror, and setting up alternate boot devices in
OpenBoot software.
28
Install a boot block in the second disk of the mirror from within the guest domain so
that you can boot from the second disk in the event that the first drive fails:
Then, at the OpenBoot software prompt, create device aliases for your primary and
secondary boot disks and set your boot-device to boot from them in order:
Test
You can test your configuration by removing disks virtually; shut down the guest
domain and remove one of the disks. In this example, we remove disk1:
fraser#
fraser#
fraser#
fraser#
ldm
ldm
ldm
ldm
unbind guest4
remove-vdisk vdisk1 guest4
bind guest4
start guest4
You can verify from the OpenBoot software that disk1 is missing:
{0} ok show-disks
a) /virtual-devices@100/channel-devices@200/disk@2
b) /virtual-devices@100/channel-devices@200/disk@0
The domain will boot normally because it has a quorum of state databases. After
testing, you can replace the failed disk and reboot with both disks available:
fraser# ldm
fraser# ldm
fraser# ldm
fraser# ldm
LDom guest4
unbind guest4
add-vdisk vdisk1 vol5@primary-vds0 guest4
bind guest4
start guest4
started
29
Now that youve restored the drive whose failure was simulated, the metastat
command executed in the guest domain indicates that you need to resync the mirrors.
You can do so with the metareplace command:
30
Conclusion
Chapter 8
Conclusion
The virtual world enabled by Logical Domains on Sun servers with CoolThreads
technology opens up a broad range of possibilities for organizations needing to
implement data reliability and availability mechanisms to match those on physical
servers. Data availability can be increased by configuring multiple paths to network or
disk I/O resources. Data reliability can be increased by configuring redundant storage
media through mirroring and RAID techniques.
The purpose of this Sun BluePrints series article is to address reliability techniques
using Logical Domains and the internal disk resources of the Sun servers on which they
run. Because every Sun server with CoolThreads technology is configured with a single
controller for on-board disk storage, the techniques available boil down to
implementing mirroring and RAID at one of three architectural levels:
In the hardware, using the disk controller itself,
In the I/O domain, using Solaris Volume Manager or ZFS software, or
In the guest domain, using Solaris Volume Manager or ZFS software.
Although our recommendations are to implement redundancy at as low a level as
possible given the limitation of using a single I/O domain, there are plenty of reasons to
use higher architectural levels, including the following:
The Zettabyte File System integrates volume management and filesystems that
support mirroring and RAID, and also snapshots and clones. These mechanisms can
be used to create new Logical Domains for development and testing, or for
implementing a number of identical Logical Domains for horizontal scaling.
Snapshots can be used for backups, and they can be used to create point-in-time
copies that can be moved to remote locations for disaster-recovery purposes. Whats
best about ZFS is that snapshots and clones are created almost instantly, and their
copy-on-write semantics make efficient use of storage resources.
Using Solaris Volume Manager or ZFS software in guest domains allows existing
physical configurations to be re-created almost exactly in a virtual world. Instead of
these volume management mechanisms accessing physical disks, they access virtual
disks that are supported underneath with flat files, physical drives, or disk partitions.
The ability to re-create the world of physical servers so faithfully in the virtual world
allows IT organizations to continue using their existing best practices. It also provides
a low-cost, low-impact test bed for creating disk failure scenarios and validating the
procedures for handling them.
This article is the first in a series designed to address the issues of translating I/O
reliability and availability practices from the physical world into the world of Logical
Domains. Future articles may address high availability network and disk I/O
31
Conclusion
configurations using multiple PCI buses and the trade-offs of implementing the
techniques at various architectural levels.
Acknowledgments
The author would like to thank Steve Gaede, an independent technical writer and
engineer, for working through the issues and configurations described in this paper and
developing this article based on them.Thanks also to Maciej Browarski, Alexandre
Chartre, James MacFarlane, and Narayan Venkat for their help in understanding some
of the nuances of using Solaris Volume Manager software.
References
References to relevant Sun publications are provided in sidebars throughout the article.
Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com
2008 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, BluePrints, CoolThreads, JumpStart, Netra, OpenBoot, Solaris, and SunDocs are trademarks or registered trademarks of Sun
Microsystems, Inc. or its subsidiaries in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems, Inc. Information subject to change without notice.
Printed in USA 6/2008