You are on page 1of 90

Eserver pSeries

Section 2: The Technology

"Any sufficiently advanced technology will


have the appearance of magic."
…Arthur C. Clarke

© 2003 IBM Corporation


^Eserver pSeries

Section Objectives

 On completion of this unit you should be able to:


– Describe the relationship between technology and
solutions.
– List key IBM technologies that are part of the POWER5
products.
– Be able to describe the functional benefits that these
technologies provide.
– Be able to discuss the appropriate use of these
technologies.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

IBM and Technology

Solutions

Products

Technology

Science

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Technology and innovation

 Having technology available is a necessary first


step.
 Finding creative new ways to use the technology
for the benefit of our clients is what innovation is
about.
 Solution design is an opportunity for innovative
application of technology.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

When technology won’t ‘fix’ the problem

 When the technology is not related to the problem.


 When the client has unreasonable expectations.

Concepts of Solution Design © 2003 IBM Corporation


Eserver pSeries

POWER5 Technology

© 2003 IBM Corporation


^Eserver pSeries

POWER4 and POWER5 Cores


POWER4 Core POWER5 Core

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER5

Enhanced distributed switch


 Designed for entry and high-end

L3 Dir
servers SMT Core SMT Core
 Enhanced memory subsystem
 Improved performance

Mem Ctrl
 Simultaneous Multi-Threading
1.9 MB L2 Cache
 Hardware support for Shared
Processor Partitions (Micro-
Partitioning)
 Dynamic power management Chip-Chip / MCM-MCM / SMPLink
 Compatibility with existing
POWER4 systems
 Enhanced reliability, availability,
serviceability GX+

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Enhanced memory subsystem


 Improved L1 cache design

Enhanced distributed switch


– 2-way set associative i-cache

L3 Dir
– 4-way set associative d-cache SMT Core SMT Core
– New replacement algorithm (LRU vs. FIFO)
 Larger L2 cache
– 1.9 MB, 10-way set associative

Mem Ctrl
 Improved L3 cache design
1.9 MB L2 Cache
– 36 MB, 12-way set associative
– L3 on the processor side of the fabric
– Satisfies L2 cache misses more frequently
– Avoids traffic on the interchip fabric Chip-Chip / MCM-MCM / SMPLink
 On-chip L3 directory and memory controller
– L3 directory on the chip reduces off-chip delays after an
L2 miss
– Reduced memory latencies
 Improved pre-fetch algorithms

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Enhanced memory subsystem


POWER4 system structure POWER5 system structure
Reduced
L3 latency
Processor
Processor Processor
Processor Processor
Processor Processor
Processor Processor
Processor Processor
Processor Processor
Processor Processor
Processor

L3 Dir
L2
L2 L2
L2 L3 L2
L2 L2
L2 L3
Cache
Cache Cache
Cache Cache Cache
Cache Cache
Cache Cache

ri D 3L
Fabric
Fabric Fabric
Fabric Fabric
Fabric Fabric
Fabric
controller
controller controller
controller controller
controller controller
controller
Larger
Memory
Memory Memory
Memory SMPs
L3 L3
Cache Cache
controller
controller controller
controller 64-way

Memory Memory Faster


access to Memory Memory
controller controller
memory
Number of
Memory Memory chips cut
in half

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Simultaneous Multi-Threading (SMT)

 What is it?
 Why would I want it?

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER4 pipeline

Branch redirects Out-of-order processing

Branch
Instruction Fetch pipeline
MP ISS RF EX Load/store WB Xfer
pipeline
IF IC BP
MP ISS RF EA DC Fmt WB Xfer CP

D0 D1 D2 D3 Xfer GD MP ISS RF EX Fixed-point WB Xfer


pipeline
Instruction Crack and
Group Formation MP ISS RF F6
F6
F6
F6
F6
F6 Floating- WB Xfer
point pipeline
Interrupts & Flushes

POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF =
register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-
point execution pipe, Fmt = data format, WB = write back, and CP = group commit)
POWER5 pipeline

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Multi-threading evolution

Memory  Execution unit utilization is low in today’s


Instruction streams microprocessors
 25% of average execution unit utilization across
a broad spectrum of environments

FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
ehc a C-i

Processor Cycles

Next evolution step

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Coarse-grained multi-threading

Memory  Two instruction streams, one thread at any instance


Instruction streams  Hardware swaps in second thread when long-latency event
occurs
 Swap requires several cycles

Swap
Swap
Swap
FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
ehc a C-i

Processor Cycles

Next evolution step

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Coarse-grained multi-threading (Cont.)

 Processor (for example, RS64-IV) is able to store context for


two threads
– Rapid switching between threads minimizes lost cycles due
to I/O waits and cache misses.
– Can yield ~20% improvement for OLTP workloads.
 Coarse-grained multi-threading only beneficial where
number of active threads exceeds 2x number of CPUs
– AIX must create a “dummy” thread if there are insufficient
numbers of real threads.
• Unnecessary switches to “dummy” threads can degrade
performance ~20%
• Does not work with dynamic CPU deallocation

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Fine-grained multi-threading

Memory  Variant of coarse-grained multi-threading


Instruction streams  Thread execution in round-robin fashion
 Cycle remains unused when a thread encounters a
long-latency event

FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
ehc a C-i

Processor Cycles

Next evolution step

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER5 pipeline

Branch redirects Out-of-order processing

Branch
Instruction Fetch pipeline
MP ISS RF EX Load/store WB Xfer
IF
IF IC BP
pipeline
MP ISS RF EA DC Fmt WB Xfer CP
CP

D0 D1 D2 D3 Xfer GD MP ISS RF EX Fixed-point WB Xfer


pipeline
Instruction Crack and
Group Formation MP ISS RF F6
F6
F6
F6
F6
F6 Floating- WB Xfer
point pipeline
Interrupts & Flushes

POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF =
register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-
point execution pipe, Fmt = data format, WB = write back, and CP = group commit)
POWER4 pipeline

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Simultaneous multi-threading (SMT)

Memory  Reduction in unused execution


Instruction streams
units results in a 25-40% boost and
even more!

FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
ehc a C-i

Processor Cycles

First evolution step

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Simultaneous multi-threading (SMT) (Cont.)

 Each chip appears as a 4-way SMP to software


– Allows instructions from two threads to execute simultaneously
 Processor resources optimized for enhanced SMT
performance
– No context switching, no dummy threads
 Hardware, POWER Hypervisor, or OS controlled thread
priority
– Dynamic feedback of shared resources allows for balanced
thread execution
 Dynamic switching between single and multithreaded mode

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Dynamic resource balancing

 Threads share many


resources
– Global Completion Table,
Branch History Table,
Translation Lookaside Buffer,
and so on
 Higher performance realized
when resources balanced
across threads
– Tendency to drift toward
extremes accompanied by
reduced performance

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Adjustable thread priority


Single-threaded operation
 Instances when unbalanced
2
execution is desirable
2
– No work for opposite thread

Instructions per cycle


1
– Thread waiting on lock 1
– Software determined non 1
uniform balance 1 Power
Save
– Power management 1 Mode

 Control instruction decode rate 0


0
– Software/hardware controls
eight priority levels for each 0
0,7 2,7 4,7 6,7 7,7 7,6 7,4 7,2 7,0 1,1
thread
Thread 0 Priority - Thread 1 Priority
Thread 0 IPC Thread 1 IPC

Hardware thread priorities

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Single-threaded operation

 Advantageous for execution unit


limited applications
– Floating or fixed point intensive
workloads
 Execution unit limited applications
Thread states
provide minimal performance
leverage for SMT Dormant
Software
– Extra resources necessary for SMT
provide higher performance benefit Hardware
or Software
when dedicated to single thread Active
Software
 Determined dynamically on a per
processor basis
Software Null

Concepts of Solution Design © 2003 IBM Corporation


Eserver pSeries

Micro-Partitioning

© 2003 IBM Corporation


^Eserver pSeries

Micro-Partitioning overview

 Mainframe inspired technology


 Virtualized resources shared by multiple partitions
 Benefits
– Finer grained resource allocation
– More partitions (Up to 254)
– Higher resource utilization
 New partitioning model
– POWER Hypervisor
– Virtual processors
– Fractional processor capacity partitions
– Operating system optimized for Micro-Partitioning exploitation
– Virtual I/O

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Processor terminology
Shared processor Shared processor Dedicated
partition partition processor partition Logical (SMT)
SMT Off SMT On SMT Off
Virtual

Shared

Dedicated

Inactive (CUoD)
Entitled capacity

Deconfigured

Installed physical
processors

Shared processor pool

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Shared processor partitions

 Micro-Partitioning allows for multiple partitions to share


one physical processor
 Up to 10 partitions per physical processor
CPU 0 CPU 1
 Up to 254 partitions active at the same time
 Partition’s resource definition
– Minimum, desired, and maximum values for each
resource CPU 3 CPU 4
– Processor capacity
LPAR 1 LPAR 2
– Virtual processors LPAR 3 LPAR 4
– Capped or uncapped LPAR 5 LPAR 6

• Capacity weight
– Dedicated memory
• Minimum of 128 MB and 16 MB increments
– Physical or virtual I/O resources

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Understanding min/max/desired resource values

 The desired value for a resource is given to a


partition if enough resource is available.
 If there is not enough resource to meet the desired
value, then a lower amount is allocated.
 If there is not enough resource to meet the min
value, the partition will not start.
 The maximum value is only used as an upper limit
for dynamic partitioning operations.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Partition capacity entitlement

 Processing units
– 1.0 processing unit represents one
physical processor Minimum requirement
0.1 processing units
 Entitled processor capacity
– Commitment of capacity that is
reserved for the partition
– Set upper limit of processor utilization 0.5 processing unit 0.4 processing unit
for capped partitions
– Each virtual processor must be granted
at least 1/10 of a processing unit of
entitlement
 Shared processor capacity is always Processing capacity
delivered in terms of whole physical 1 physical processor
processors 1.0 processing units

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Capped and uncapped partitions

 Capped partition
– Not allowed to exceed its entitlement
 Uncapped partition
– Is allowed to exceed its entitlement
 Capacity weight
– Used for prioritizing uncapped partitions
– Value 0-255
– Value of 0 referred to as a “soft cap”

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Partition capacity entitlement example

 Shared pool has 2.0 processing units


available
 LPARs activated in sequence
 Partition 1 activated
– Min = 1.0, max = 2.0, desired = 1.5
– Starts with 1.5 allocated processing units
 Partition 2 activated
– Min = 1.0, max = 2.0, desired = 1.0
– Does not start
 Partition 3 activated
– Min = 0.1, max = 1.0, desired = 0.8
– Starts with 0.5 allocated processing units

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Understanding capacity allocation – An example

 A workload is run under different configurations.


 The size of the shared pool (number of physical
processors) is fixed at 16.
 The capacity entitlement for the partition is fixed
at 9.5.
 No other partitions are active.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Uncapped – 16 virtual processors

Uncapped (16PPs/16VPs/9.5CE)

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapsed time

 16 virtual processors.
 Uncapped.
 Can use all available resource.
 The workload requires 26 minutes to complete.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Uncapped – 12 virtual processors

Uncapped (16PPs/12VPs/9.5CE)

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapsed time

 12 virtual processors.
 Even though the partition is uncapped, it can only use 12
processing units.
 The workload now requires 27 minutes to complete.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Capped

Capped (16PPs/12VPs/9.5E)

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapses time

 The partition is now capped and resource utilization is


limited to the capacity entitlement of 9.5.
– Capping limits the amount of time each virtual processor is
scheduled.
– The workload now requires 28 minutes to complete.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Dynamic partitioning operations

 Add, move, or remove processor capacity


– Remove, move, or add entitled shared processor capacity
– Change between capped and uncapped processing
– Change the weight of an uncapped partition
– Add and remove virtual processors
• Provided CE / VP > 0.1
 Add, move, or remove memory
– 16 MB logical memory block
 Add, move, or remove physical I/O adapter slots
 Add or remove virtual I/O adapter slots
 Min/max values defined for LPARs set the bounds within which
DLPAR can work

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Dynamic LPAR

Standard on all new systems

Part#1 Part#2 Part#3 Part#4

Production MoveyLegac Test/


resources
Apps Dev
File/
Print
between live
AIX partitions
AIX AIX Linux
5L 5L 5L

Hypervisor

HMC

Concepts of Solution Design © 2003 IBM Corporation


Eserver pSeries

Firmware

POWER Hypervisor

© 2003 IBM Corporation


^Eserver pSeries

POWER Hypervisor strategy

 New Hypervisor for POWER5 systems


– Further convergence with iSeries
– But brands will retain unique value propositions
– Reduced development effort
– Faster time to market
 New capabilities on pSeries servers
– Shared processor partitions
– Virtual I/O
 New capability on iSeries servers
– Can run AIX 5L

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER Hypervisor component sourcing


pSeries H-Call Interface

iSeries Location codes Nucleus (SLIC) Virtual I/O

Load from flash Bus recovery Dump

Drawer concurrent maint. Slot/tower concurrent maint.

Message passing 255 partitions Shared processor LPAR

NVRAM Partition on demand Capacity on Demand

I/O configuration Virtual Ethernet


FSP
SCSI IOA LAN IOA VLAN IOA

HMC
HSC
VLAN

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER Hypervisor functions


Dynamic Micro-Partitioning
CPU 0 CPU 1
 Same functions as POWER4 Hypervisor.
– Dynamic LPAR
– Capacity Upgrade on Demand
 New, active functions.
CPU 2 CPU 3
– Dynamic Micro-Partitioning
Shared processor pools
– Shared processor pool
– Virtual I/O
– Virtual LAN

Enhanced distributed switch

Enhanced distributed
L3 Dir

L3 Dir
SMT Core SMT Core SMT Core SMT Core

Enhanced distributed switch

Enhanced distributed switch


L3 Dirswitch Mem Ctrl

L3 Dir
Mem Ctrl

Mem Ctrl
SMT Core SMT Core SMT Core SMT Core
1.9 MB L2 Cache 1.9 MB L2 Cache

 Machine is always in LPAR mode.

Mem Ctrl
Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink
1.9 MB L2 Cache 1.9 MB L2 Cache

– Even with all resources dedicated to one OS


Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink

Virtual I/O
Dynamic LPAR
Capacity Upgrade on Demand
Client Capacity Growth

Planned
Actual Disk LAN

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER Hypervisor implementation

 Design enhancements to previous POWER4


implementation enable the sharing of processors
by multiple partitions
– Hypervisor decrementer (HDECR)
– New Processor Utilization Resource Register (PURR)
– Refine virtual processor objects
• Does not include physical characteristics of the processor
– New Hypervisor calls

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER Hypervisor processor dispatch


 Manage a set of processors on the machine
(shared processor pool). Virtual processor capacity entitlement for
six shared processor partitions
 POWER5 generates a 10 ms dispatch window.
– Minimum allocation is 1 ms per physical
processor.
 Each virtual processor is guaranteed to get its
entitled share of processor cycles during each 10
ms dispatch window.
POWER
– ms/VP = CE * 10 / VPs Hypervisor’s
processor
 The partition entitlement is evenly distributed dispatch
among the online virtual processors.
 Once a capped partition has received its CE CPU 0 CPU 1
within a dispatch interval, it becomes not-

Enhanced distributed switch

Enhanced distributed switch


L3 Dir

L3 Dir
SMT Core SMT Core SMT Core SMT Core

Enhanced distributed switch

Enhanced distributed switch


runnable.

L3 Dir

L3 Dir
SMT Core SMT Core SMT Core SMT Core

Mem Ctrl

Mem Ctrl
1.9 MB L2 Cache 1.9 MB L2 Cache

Mem Ctrl

Mem Ctrl
1.9 MB L2 Cache 1.9 MB L2 Cache

Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink

 A VP dispatched within 1 ms of the end of the Chip-Chip / MCM-MCM / SMPLink

CPU 2
Chip-Chip / MCM-MCM / SMPLink

CPU 3
dispatch interval will receive half its CE at the
start of the next dispatch interval.
Shared processor pool

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Dispatching and interrupt latencies

 Virtual processors have dispatch latency.


 Dispatch latency is the time between a virtual
processor becoming runnable and being actually
dispatched.
 Timers have latency issues also.
 External interrupts have latency issues also.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Shared processor pool


 Processors not associated with Virtual processor capacity entitlement for
dedicated processor partitions. six shared processor partitions

 No fixed relationship between virtual


processors and physical processors.
 The POWER Hypervisor attempts to
use the same physical processor. POWER
Hypervisor’s
– Affinity scheduling processor
dispatch
– Home node

Enhanced distributed switch

Enhanced distributed switch

Enhanced distributed switch

Enhanced distributed switch


L3 Dir

L3 Dir

L3 Dir

L3 Dir
SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core SMT Core

Mem Ctrl

Mem Ctrl

Mem Ctrl

Mem Ctrl
1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache 1.9 MB L2 Cache

Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink Chip-Chip / MCM-MCM / SMPLink

CPU 0 CPU 1 CPU 2 CPU 3

Shared processor pool

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Affinity scheduling

 When dispatching a VP, the POWER Hypervisor attempts to


preserve affinity by using:
– Same physical processor as before, or
– Same chip, or
– Same MCM
 When a physical processor becomes idle, the POWER
Hypervisor looks for a runnable VP that:
– Has affinity for it, or
– Has affinity to no-one, or
– Is uncapped
 Similar to AIX affinity scheduling

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Operating system support

 Micro-Partitioning capable operating systems need to be modified


to cede a virtual processor when they have no runnable work
– Failure to do this results in wasted CPU resources
• For example, an partition spends its CE waiting for I/O
– Results in better utilization of the pool
 May confer the remainder of their timeslice to another VP
– For example, a VP holding a lock
 Can be redispatched if they become runnable again during the
same dispatch interval

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Example
Physical LPAR 1 LPAR 3 LPAR 1 LPAR 3 LPAR 1
IDLE IDLE
processor 0 VP 1 VP 2 VP 1 VP 0 VP 1

Physical LPAR 2 LPAR 1 LPAR 3 LPAR 3 LPAR 3 LPAR 1 LPAR 3 LPAR 2


processor 1 VP 0 VP 0 VP 0 VP 1 VP 2 VP 0 VP 1 VP 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec)

LPAR1
Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped)
LPAR2
Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped)
LPAR3
Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped)

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

POWER Hypervisor and virtual I/O

 I/O operations without dedicating resources to an individual


partition
 POWER Hypervisor’s virtual I/O related operations
– Provide control and configuration structures for virtual adapter images
required by the logical partitions
– Operations that allow partitions controlled and secure access to
physical I/O adapters in a different partition
– The POWER Hypervisor does not own any physical I/O devices; they
are owned by an I/O hosting partition
 I/O types supported
– SCSI
– Ethernet
– Serial console
Disk LAN

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Performance monitoring and accounting

 CPU utilization is measured against CE.


– An uncapped partition receiving more than its CE will record 100% but
will be using more.
 SMT
– Thread priorities compound the variable speed rate.
– Twice as many logical CPUs.
 For accounting, interval may be incorrectly allocated.
– New hardware support is required.
 Processor utilization register (PURR) records actual clock ticks spent
executing a partition.
– Used by performance commands (for example, new flags) and
accounting modules.
– Third party tools will need to be modified.

Concepts of Solution Design © 2003 IBM Corporation


Eserver pSeries

Virtual I/O Server

© 2003 IBM Corporation


^Eserver pSeries

Virtual I/O Server


 Provides an operating environment for virtual I/O administration
– Virtual I/O server administration
– Restricted scriptable command line user interface (CLI)
 Minimum hardware requirements
– POWER5 VIO capable machine
– Hardware management console
– Storage adapter
– Physical disk
– Ethernet adapter
– At least 128 MB of memory
 Capabilities of the Virtual I/O Server
– Ethernet Adapter Sharing
– Virtual SCSI disk
• Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific models of
EMC, HDS, and STK disk subsystems, attached using Fiber Channel
– Interacts with AIX and Linux partitions

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual I/O Server (Cont.)

 Installation CD when Advanced POWER


Virtualization feature is ordered
 Configuration approaches for high availability
– Virtual I/O Server
• LVM mirroring
• Multipath I/O
• EtherChannel
– Second virtual I/O server instance in another partition

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual SCSI

 Allows sharing of storage devices


 Vital for shared processor partitions
– Overcomes potential limit of adapter slots due to Micro-
Partitioning
– Allows the creation of logical partitions without the need for
additional physical resources
 Allows attachment of previously unsupported storage
solutions

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

VSCSI server and client architecture overview


 Virtual SCSI is based on a
client/server relationship. Virtual I/O Client Client
 The virtual I/O resources are assigned Server partition partition partition
using an HMC. AIX Linux
LVM
 Virtual SCSI enables sharing of
adapters as well as disk devices. Logical
volume 1
Logical
volume 2 hdisk hdisk

 Dynamic LPAR operations allowed. VSCSI server VSCSI server VSCI client VSCI client
adapter adapter adapter adapter

 Dynamic mapping between physical


and virtual resources on the virtual
POWER Hypervisor
I/O server.
Physical adapter

Physical disk
(SCSI, FC)

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual devices
Client partition

Virtual
 Are defined as LVs in the I/O server hdisk disk
partition
LVM
– Normal LV rules apply
VSCI client
 Appear as real devices (hdisks) in the adapter

hosted partition POWER Hypervisor


 Can be manipulated using Logical VSCSI server
Volume Manager just like an ordinary adapter
physical disk LVM
 Can be used as a boot device and as a
NIM target
LV
 Can be shared by multiple clients hdisk

Virtual I/O Server partition

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

SCSI RDMA and Logical Remote Direct Memory Access

 SCSI transport protocols define the


rules for exchanging information
between SCSI initiators and targets.
Virtual I/O Server Client partition AIX
 Virtual SCSI uses the SCSI RDMA partition
Protocol (SRP).
– SCSI initiators and targets have the
ability to directly transfer information VSCI device
between their respective address Physical
adapter device
driver (target) VSCI device
Data Buffer
spaces. driver
Device
Mapping
driver (initiator)

 SCSI requests and responses are


sent using the Virtual SCSI adapters. Reliable Command / Response Transport
Logical Remote Direct Memory Access

 The actual data transfer, however, is POWER Hypervisor


done using the Logical Redirected Physical adapter
DMA protocol.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual SCSI security

 Only the owning partition has access to its data.


 Data-information is copied directly from the PCI
adapter to the client’s memory.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Performance considerations

 Twice as many processor cycles to do VSCSI as a locally attached


disk I/O (evenly distributed on the client partition and virtual I/O
server)
– The path of each virtual I/O request involves several sources of
overhead that are not present in a non-virtual I/O request.
– For a virtual disk backed by the LVM, there is also the performance
impact of going through the LVM and disk device drivers twice.
 If multiple partitions are competing for resources from a VSCSI
server, care must be taken to ensure enough server resources
(CPU, memory, and disk) are allocated to do the job.
 If not constrained by CPU performance, dedicated partition
throughput is comparable to doing local I/O.
 Because there is no caching in memory on the server I/O partition,
it's memory requirements should be modest.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Limitations

 Hosting partition must be available before hosted


partition boot.
 Virtual SCSI supports FC, parallel SCSI, and SCSI
RAID.
 Maximum of 65535 virtual slots in the I/O server
partition.
 Maximum of 256 virtual slots on a single partition.
 Support for all mandatory SCSI commands.
 Not all optional SCSI commands are supported.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Implementation guideline

 Partitions with high performance and disk I/O


requirements are not recommended for
implementing VSCSI.
 Partitions with very low performance and disk I/O
requirements can be configured at minimum
expense to use only a portion of a logical volume.
 Boot disks for the operating system.
 Web servers that will typically cache a lot of data.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

LVM mirroring

Virtual I/O Client Virtual I/O


 This configuration Server partition Server
protects virtual disks in a partition partition
client partition against LVM LVM LVM
failure of:
VSCSI server VSCSI VSCSI VSCSI server
– One physical disk adapter client
adapter
client
adapter
adapter

– One physical adapter POWER Hypervisor


– One virtual I/O server Physical SCSI
adapter
Physical SCSI
adapter

 Many possibilities exist


to exploit this great Physical disk Physical disk
(SCSI) (SCSI)
function!

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Multipath I/O
Virtual I/O Client Virtual I/O
 This configuration protects Server partition Server
virtual disks in a client partition partition
partition against failure of: LVM
LVM LVM
(hdisk) (hdisk)
– Failure of one physical FC
adapter in one I/O server VSCSI server
adapter
VSCSI
client
VSCSI
client
VSCSI server
adapter
adapter adapter
– Failure of one Virtual I/O
server POWER Hypervisor
 Physical disk is assigned as a Physical FC adapter Physical FC adapter

whole to the client partition


 Many possibilities exist to SAN Switch

exploit this great function!


Physical disk
ESS

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual LAN overview


VLAN 1
 Virtual network segments on top of VLAN 2
physical switch devices.
 All nodes in the VLAN can
communicate without any L3 routing
or inter-VLAN bridging. Node A-1 Node A-2
 VLANs provides:
Switch A
– Increased LAN security
– Flexible network deployment over
traditional network devices
 VLAN support in AIX is based on the Switch B Switch C
IEEE 802.1Q VLAN implementation.
– VLAN ID tagging to Ethernet frames X
– VLAN ID restricted switch ports

Node B-1 Node B-2 Node B-3 Node C-1 Node C-2

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual Ethernet

 Enables inter-partition communication.


– In-memory point to point connections
 Physical network adapters are not needed.
 Similar to high-bandwidth Ethernet connections.
 Supports multiple protocols (IPv4, IPv6, and ICMP).
 No Advanced POWER Virtualization feature required.
– POWER5 Systems
– AIX 5L V5.3 or appropriate Linux level
– Hardware management console (HMC)

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual Ethernet connections

 VLAN technology implementation


– Partitions can only access data directed to them.
 Virtual Ethernet switch provided by the POWER AIX AIX Linux
Hypervisor partition partition partition
 Virtual LAN adapters appears to the OS as
physical adapters
– MAC-Address is generated by the HMC.
 1-3 Gb/s transmission speed
– Support for large MTUs (~64K) on AIX. Virtual
Ethernet
Virtual
Ethernet
Virtual
Ethernet
adapter adapter adapter

 Up to 256 virtual Ethernet adapters


Virtual Ethernet switch
– Up to 18 VLANs. POWER Hypervisor
 Bootable device support for NIM OS installations

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual Ethernet switch

 Based on IEEE 802.1Q VLAN standard


– OSI-Layer 2
– Optional Virtual LAN ID (VID)
– 4094 virtual LANs supported
– Up to 18 VIDs per virtual LAN port
 Switch configuration through HMC

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

How it works
Virtual Ethernet adapter

Virtual VLAN switch port

PHYP caches source MAC

IEEE VLAN Y Check VLAN header


header?
N
Insert VLAN header
Y
Port allowed?
N

Dest. MAC in N
table?

Y
Configured associated switch
N
port N
Trunk adapter
defined?
Y Match for
VLAN Nr. in Y
table?
N
Pass to Trunk
Deliver Drop packet
adapter

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Performance considerations
Throughput per 0.1 entitlement
Throughput/0.1
 Virtual Ethernet performance entitlement
[Mb/s]
1000

– Throughput scales nearly linear with the 800


600
allocated capacity entitlement 400
65394
200 9000
0 1500 MTU
0.1 0.3 0.5 0.8 1 size
CPU entitlements

Throughput, TCP_STREAM
 Virtual LAN vs. Gigabit Ethernet throughput Throughput
[M b/s]
– Virtual Ethernet adapter has higher raw 10000

throughput at all MTU sizes 8000

6000
VLAN
– In-memory copy is more efficient at larger MTU 4000 Gb Ethernet

2000

0
MTU 1
1500 1500 9000 9000 65394 65394
Simpl./Dupl. S D S D S D

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Limitations

 Virtual Ethernet can be used in both shared and


dedicated processor partitions provided with the
appropriate OS levels.
 A mixture of Virtual Ethernet connections, real network
adapters, or both are permitted within a partition.
 Virtual Ethernet can only connect partitions within a
single system.
 A system’s processor load is increased when using
virtual Ethernet.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Implementation guideline

 Know your environment and the network traffic.


 Choose a high MTU size, as it makes sense for the
network traffic in the Virtual LAN.
 Use the MTU size 65394 if you expect a large amount of
data to be copied inside your Virtual LAN.
 Enable tcp_pmtu_discover and udp_pmtu_discover in
conjunction with MTU size 65394.
 Do not turn off SMT.
 No dedicated CPUs are required for virtual Ethernet
performance.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Connecting Virtual Ethernet to external networks


 Routing
– The partition that routes the traffic to the external work does not necessarily have to be
the virtual I/O server.
AIX partition AIX Linux AIX partition AIX Linux
partition partition partition partition

1.1.1.100 3.1.1.1 3.1.1.10 3.1.1.10 2.1.1.100 4.1.1.1 4.1.1.10 4.1.1.11

Virtual Ethernet switch Virtual Ethernet switch


POWER Hypervisor POWER Hypervisor
Physical adapter Physical adapter

IP Router
IP subnet 1.1.1.X 1.1.1.1 IP subnet 2.1.1.X
2.1.1.1

AIX Linux
Server Server
1.1.1.10 2.1.1.10

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Shared Ethernet Adapter

 Connects internal and external VLANs using one physical


adapter.
 SEA is a new service that acts as a layer 2 network switch.
– Securely bridges network traffic from a virtual Ethernet
adapter to a real network adapter
 SEA service runs in the Virtual I/O Server partition.
– Advanced POWER Virtualization feature required
– At least one physical Ethernet adapter required
 No physical I/O slot and network adapter required in the
client partition.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Shared Ethernet Adapter (Cont.)

 Virtual Ethernet MAC are visible to outside systems.


 Broadcast/multicast is supported.
 ARP (Address Resolution Protocol) and NDP (Neighbor Discovery
Protocol) can work across a shared Ethernet.
 One SEA can be shared by multiple VLANs and multiple subnets can
connect using a single adapter on the Virtual I/O Server.
 Virtual Ethernet adapter configured in the Shared Ethernet Adapter
must have the trunk flag set.
– The trunk Virtual Ethernet adapter enables a layer-2 bridge to a
physical adapter
 IP fragmentation is performed or an ICMP packet too big message is
sent when the shared Ethernet adapter receives IP (or IPv6) packets
that are larger than the MTU of the adapter that the packet is
forwarded through.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual Ethernet and Shared Ethernet Adapter security

 VLAN (virtual local area network) tagging description taken


from the IEEE 802.1Q standard.
 The implementation of this VLAN standard ensures that the
partitions have no access to foreign data.
 Only the network adapters (virtual or physical) that are
connected to a port (virtual or physical) that belongs to the
same VLAN can receive frames with that specific VLAN ID.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Performance considerations

 Virtual I/O-Server Throughput


[Mb/s]
Virtual I/O Server Throughput, TCP_STREAM

performance
2000

1500

– Adapters stream data at 1000

500

media speed if the Virtual 0


1 2 3 4

I/O server has enough MTU


Simplex/Duplex
1500
simplex
1500
duplex
9000
simplex
9000
duplex

capacity entitlement.
Virtual I/O Server

– CPU utilization per Gigabit


CPU
Utilisation normalized CPU utilisation, TCP_STREAM
[%cpu/Gb]
100

of throughput is higher with 80

a Shared Ethernet adapter.


60

40

20

0
1 2 3 4
MTU 1500 1500 9000 9000
Simplex/Duplex simplex duplex simplex duplex

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Limitations

 System processors are used for all communication functions,


leading to a significant amount of system processor load.
 One of the virtual adapters in the SEA on the Virtual I/O
server must be defined as a default adapter with a default
PVID.
 Up to 16 Virtual Ethernet adapters with 18 VLANs on each
can be shared on a single physical network adapter.
 Shared Ethernet Adapter requires:
– POWER Hypervisor component of POWER5
systems
– AIX 5L Version 5.3 or appropriate Linux level

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Implementation guideline

 Know your environment and the network traffic.


 Use a dedicated network adapter if you expect heavy
network traffic between Virtual Ethernet and local
networks.
 If possible, use dedicated CPUs for the Virtual I/O
Server.
 Choose 9000 for MTU size, if this makes sense for your
network traffic.
 Don’t use Shared Ethernet Adapter functionality for
latency critical applications.
 With MTU size 1500, you need about 1 CPU per gigabit
Ethernet adapter streaming at media speed.
 With MTU size 9000, 2 Gigabit Ethernet adapters can
stream at media speed per CPU.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Shared Ethernet Adapter configuration


Virtual I/O Server AIX Linux
 The Virtual I/O Server is partition partition
configured with at least one Shared Ethernet Adapter

physical Ethernet adapter. ent0 VLAN 2 VLAN 1


VLAN 1
10.1.1.11
VLAN 2
10.1.2.11

 One Shared Ethernet Adapter Virtual Ethernet switch

can be shared by multiple POWER Hypervisor


Physical adapter
VLANs.
 Multiple subnets can connect
VLAN 1 VLAN 2
using a single adapter on the
Virtual I/O Server.
AIX Linux
Server Server
10.1.1.14 10.1.2.15

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Multiple Shared Ethernet Adapter configuration


Virtual I/O Server AIX Linux
partition partition
 Maximizing throughput Shared Ethernet Adapter

– Using several Shared Ethernet ent0 ent1


VLAN VLAN VLAN 1 VLAN 2
2 1 10.1.1.11 10.1.2.11
Adapters
Virtual Ethernet switch
– More queues POWER Hypervisor
Physical adapter Physical adapter
– More performance
VLAN 1 VLAN 2

AIX Linux
Server Server
10.1.1.14 10.1.2.15

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Multipath routing with dead gateway detection


Virtual I/O AIX partition Virtual I/O
 This configuration protects Server 2 Multipath routing Server 2
with
your access to the external dead gateway
detection
network against:
default route to 9.3.5.10 via 9.3.5.12
Shared Ethernet Adapter default route to 9.3.5.20 via 9.3.5.22 Shared Ethernet Adapter

VLAN 1 VLAN 1 VLAN 2 VLAN 2


– Failure of one physical ent0
9.3.5.11 9.3.5.12 9.3.5.22 9.3.5.21
ent0

network adapter in one I/O Virtual Ethernet switch


server POWER Hypervisor
Physical adapter Physical adapter
– Failure of one Virtual I/O
server
Gateway Gateway
– Failure of one gateway 9.3.5.10 9.3.5.20

External
network

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Shared Ethernet Adapter commands

 Virtual I/O Server commands


– lsdev -type adapter: Lists all the virtual and physical adapters.
– Choose the virtual Ethernet adapter we want to map to the physical
Ethernet adapter.
– Make sure the physical and virtual interfaces are unconfigured (down or
detached).
– mkvdev: Maps the physical adapter to the virtual adapter, creates a layer 2
bridge, and defines the default virtual adapter with its default VLAN ID. It
creates a new Ethernet interface (for example, ent5).
– The mktcpip command is used for TCP/IP configuration on the new
Ethernet interface (for example, ent5).
 Client partition commands
– No new commands are needed; the typical TCP/IP configuration is done
on the virtual Ethernet interface that it is defined in the client partition
profile on the HMC.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Virtual SCSI commands

 Virtual I/O Server commands


– To map a LV:
• mkvg: Creates the volume group, where a new LV will be created using the mklv
command.
• lsdev: Shows the virtual SCSI server adapters that could be used for mapping
with the LV.
• mkvdev: Maps the virtual SCSI server adapter to the LV.
• lsmap -all: Shows the mapping information.
– To map a physical disk:
• lsdev: Shows the virtual SCSI server adapters that could be used for mapping
with a physical disk.
• mkvdev: Maps the virtual SCSI server adapter to a physical disk.
• lsmap -all: Shows the mapping information.
 Client partition commands
– No new commands needed; the typical device configuration uses the
cfgmgr command.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Section Review Questions

1. Any technology improvement will boost performance


of any client solution.
a. True
b. False
2. The application of technology in a creative way to
solve client’s business problems is one definition of
innovation.
a. True
b. False

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Section Review Questions

3. Client’s satisfaction with your solution can be


enhanced by which of the following?
a. Setting expectations appropriately.
b. Applying technology appropriately.
c. Communicating the benefits of the technology to the
client.
d. All of the above.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Section Review Questions

4. Which of the following are available with


POWER5 architecture?
a. Simultaneous Multi-Threading.
b. Micro-Partitioning.
c. Dynamic power management.
d. All of the above.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Section Review Questions

5. Simultaneous Multi-Threading is the same as


hyperthreading, IBM just gave it a different
name.
a. True.
b. False.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Section Review Questions

6. In order to bridge network traffic between the


Virtual Ethernet and external networks, the
Virtual I/O Server has to be configured with at
least one physical Ethernet adapter.
a. True.
b. False.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Review Question Answers

1. b
2. a
3. d
4. d
5. b
6. a

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Unit Summary

 You should now be able to:


– Describe the relationship between technology and solutions.
– List key IBM technologies that are part of the POWER5
products.
– Be able to describe the functional benefits that these
technologies provide.
– Be able to discuss the appropriate use of these technologies.

Concepts of Solution Design © 2003 IBM Corporation


^Eserver pSeries

Reference

 You may find more information here:


IBM eServer pSeries AIX 5L Support for Micro-
Partitioning and Simultaneous Multi-threading
White Paper
Introduction to Advanced POWER Virtualization on
IBM eServer p5 Servers SG24-7940
IBM eServer p5 Virtualization – Performance
Considerations SG24-5768

Concepts of Solution Design © 2003 IBM Corporation

You might also like