Professional Documents
Culture Documents
-1
Section Objectives
Upon completion of this section, you will be able to:
y Describe business continuity
y Describe the solutions and the supporting technologies
that enable business continuity and uninterrupted data
availability
Backup and Recovery
Local Replication
Remote Replication
The objectives for this section are shown here. Please take a moment to read them.
-2
In this Section
This section contains the following modules:
1. Business Continuity Overview
2. Backup and Recovery
3. Local Replication
4. Remote Replication
Additional Information:
-3
-4
Information has become a critical asset for businesses. The survival of a business depends on
uninterrupted availability of the data. Steps should be taken to ensure continuous availability of data
in the event of a disaster.
The objectives for this module are shown here. Please take a moment to review them.
-1
Before we can talk about business continuity and solutions for business continuity, we must first define
the terms.
Business Continuity is the preparation for, response to, and recovery from an application outage that
adversely affects business operations
Business Continuity Solutions address systems unavailability, degraded application performance, or
unacceptable recovery strategies
-2
Damaged Reputation
Customers
Suppliers
Financial markets
Banks
Business partners
Lost Revenue
Direct loss
Compensatory payments
Lost future revenue
Billing losses
Investment losses
Financial Performance
Revenue recognition
Cash flow
Lost discounts (A/P)
Payment guarantees
Credit rating
Stock price
Other Expenses
Temporary employees, equipment rental, overtime
costs, extra shipping costs, travel expenses...
2007 EMC Corporation. All rights reserved.
There are many factors that need to be considered when calculating the cost of downtime. A formula to
calculate the costs of the outage should capture both the cost of lost productivity of employees and the
cost of lost income from missed sales.
y The Estimated average cost of 1 hour of downtime = (Employee costs per hour) *( Number of
employees affected by outage) + (Average Income per hour).
y Employee costs per hour is simply the total salaries and benefits of all employees per week,
divided by the average number of working hours per week.
y Average income per hour is just the total income of an institution per week, divided by average
number of hours per week that an institution is open for business.
-3
Information Availability
% Uptime
% Downtime
98%
2%
7.3 days
3hrs 22 min
99%
1%
3.65 days
1 hr 41 min
99.8%
0.2%
17 hrs 31 min
20 min 10 sec
99.9%
0.1%
8 hrs 45 min
10 min 5 sec
99.99%
0.01%
52.5 min
1 min
99.999%
0.001%
5.25 min
6 sec
99.9999%
0.0001%
31.5 sec
0.6 sec
Information Availability ensures that applications and business units have access to information
whenever it is needed. The primary components of information availability are:
y Protection from data loss
y Ensuring data access
y Appropriate data security
Since information is a major business asset, high information availability increases productivity and
efficiency. Therefore, it is necessary to make this information reliable, available any time it is required,
and sharable by different platforms, anywhere, at anytime. Ensuring access to this information and
appropriate data security are also very important and all must be done in a cost effective manner.
Most availability limits will be defined in terms of Nines. This chart translates the percentage down
time into amounts of downtime per year and per week. Downtime translates to lost revenue. In
healthcare, Gartner group considers 99.5% of system availability as outstanding. This is the equivalent
43 hours of unplanned downtime and 50 hours of planned downtime per year.
The online window for some critical applications has moved to 99.999% of time.
-4
1.1
1.2
1.3
1.5
Manufacturing
1.6
Call location
1.6
Telecommunications
Credit card sales authorization
Energy
Point of sale
Retail brokerage
2.0
2.6
2.8
3.6
6.5
This chart shows how much money each industry loses for each hour of downtime. As you can see,
downtime is expensive!
-5
Wks
Days
Hrs
Mins Secs
Recovery Time
Synchronous
Replication
Asynchronous
Replication
Periodic
Replication
Tape
Backup
Recovery Point
Secs Mins
Recovery Point Objective (RPO) is the point in time to which systems and data must be recovered after
an outage. This defines the amount of data loss a business can endure. Different business units within
an organization may have varying RPOs.
-6
Mins Secs
Recovery Point
Fault detection
Recovering data
Secs Mins
Recovery Time
Tape Restore
Hrs
Manual
Migration
Days
Global
Cluster
Wks
Recovery Time Objective (RTO) is the period of time within which systems, applications, or functions
must be recovered after an outage. This defines the amount of downtime that a business can endure,
and survive. Recovery time includes: fault detection, data recovery, and bringing applications back online.
-7
y Disaster restart
Process of restarting mirrored consistent copies of data and applications
Allows restart of all participating DBMS to a common point of consistency
utilizing automated application of recovery logs during DBMS initialization
The restart time is comparable to the length of time required for the
application to restart after a power failure
2007 EMC Corporation. All rights reserved.
Disaster recovery is the process of restoring a previous copy of the data and applying logs or other
necessary processes to that copy to bring it to a known point of consistency.
Disaster restart is the restarting of dependent write consistent copies of data and applications, utilizing
the automated application of DBMS recovery logs during DBMS initialization to bring the data and
application to a transactional point of consistency.
There is a fundamental difference between Disaster Recovery and Disaster Restart. Disaster recovery
is the process of restoring a previous copy of the data and applying logs to that copy to bring it to a
known point of consistency. Disaster restart is the restarting of mirrored consistent copies of data and
applications.
Disaster recovery generally implies the use of backup technology in which data is copied to tape and
then it is shipped off-site. When a disaster is declared, the remote site copies are restored and logs are
applied to bring the data to a point of consistency. Once all recoveries are completed, the data is
validated to ensure it is correct.
While it might seem like semantics, there is an important difference between recovery and restart. The
key difference between the two is the RTO. In a recovery situation one might have to restore data from
tape or disk, roll forward committed transactions, roll back uncommitted transactions, and restore to a
point of application (or database) consistency. These processes will elongate the RTO. In a restart
situation, the application or the database self-heals so to speak. As mentioned in the slide, this is
very much like starting back up after a power failure.
-8
Elevated demand for increased application availability confirms the need to ensure business continuity
practices are consistent with business needs.
Interruptions are classified as either planned or unplanned. Failure to address these specific outage
categories seriously compromises a companys ability to meet business goals.
Planned downtime is expected and scheduled, but it is still downtime causing data to be unavailable.
Causes of planned downtime include:
y New hardware installation/integration/maintenance
y Software upgrades/patches
y Backups
y Application and data restore
y Data center disruptions from facility operations (renovations, construction, other)
y Refreshing a testing or development environment with production data
y Porting testing/development environment over to production environment
-9
Causes of Downtime
Human Error
System Failure
Infrastructure Failure
Disaster
2007 EMC Corporation. All rights reserved.
Today, the most critical component of an organization is information. Any disaster occurrence will
affect information availability critical to run normal business operations.
In our definition of disaster, the organizations primary systems, data, applications are damaged or
destroyed. Not all unplanned disruptions constitute a disaster.
- 10
Business continuity and disaster recovery are not the same. Business Continuity is a holistic approach
to planning, preparing, and recovering from an adverse event. The focus is on prevention, identifying
risks, and developing procedures to ensure the continuity of business function. Disaster recovery
planning should be included as part of business continuity.
Objectives of Business Continuity:
y Facilitate uninterrupted business support despite the occurrence of problems.
y Create plans that identify risks and mitigate them wherever possible.
y Provide a road map to recover from any event.
Disaster Recovery is more about specific cures, to restore service and damaged assets after an adverse
event. In our context, Disaster Recovery is the coordinated process of restoring systems, data, and
infrastructure required to support key ongoing business operations.
- 11
Business Continuity Planning (BCP) is a risk management discipline. It involves the entire business-not just IT. BCP proactively identifies vulnerabilities and risks, planning in advance how to prepare for
and respond to a business disruption. A business with strong BC practices in place is better able to
continue running the business through the disruption and to return to business as usual.
BCP actually reduces the risk and costs of an adverse event because the process often uncovers and
mitigates potential problems.
- 12
Implement,
Objectives
Maintain, and
Assess
Analysis
Document
Design
Develop
- 13
Loss p/y
Est cost of
mitigation
Entire
Company
$279,056
.25
$69517
$5,800
Entire
Company
$279,066
0.2
$55768
$66,456
Entire
Company
$279,098
0.2
$55619
$10,000
IT-All
$16,000
1.0
18000
$80,000
Entire
Company
$16,000
0.5
$8000
$122,000
ITIntranet/B2B
$400
1.0
$1800
$5,000
The Business Impact Analysis quantifies the impact that an outage will have to the business and
potential costs associated with the interruption. It helps businesses channel their resources based on
probability of failure and associated costs. In the example shown, the dollar values are arbitrary and
are used just for illustration.
- 14
Primary
Node
User & Application
Clients
IP
Earlier, we discussed the importance of mitigating potential problems. Now, lets walk through a data
storage infrastructure example to identify the single points of failure and solutions to eliminate them.
- 15
HBA Failures
y Configure multiple HBAs, and
use multi-pathing software
Protects against HBA
failure
Can provide improved
performance (vendor
dependent)
HBA
Port
HBA
Switch
Host
Storage
One component that could fail is the HBA on the server. Configuring multiple HBAs and using multipathing software provides path redundancy. Upon detection of a failed HBA, the software can re-drive
the I/O through another available path. This eliminates the HBA from being a single point of failure
- 16
HBA
Port
HBA
Port
Host
Switch
Storage
A switch or a storage array port could also fail. As shown in this example, configuring multiple
switches, and making the devices available via multiple storage array ports, provides protection against
switch or storage array port failures.
- 17
Disk Failures
y Use some level of RAID
HBA
Port
HBA
Port
Host
Switch
Storage
As seen earlier, using some level of RAID, such as RAID-1 or RAID-5, ensures continuous operation
in the event of disk failures.
- 18
Host Failures
y Clustering protects against production host failures
HBA
Port
HBA
Port
Host
Switch
Storage
Storage
HBA
HBA
Host
2007 EMC Corporation. All rights reserved.
- 19
HBA
Port
HBA
Port
Host
Switch
Storage
Storage
HBA
HBA
Host
2007 EMC Corporation. All rights reserved.
It is also possible for the site or the storage array to fail. Remote replication of data to a secondary
array at a secondary site will protect against these failures.
- 20
Clustering
Software
User &
Application
Clients
IP
Redundant Paths
Redundant Disks
(RAID 1/RAID 5)
Primary
Node
Keep Alive
Redundant
Network
Switches
IP
Failover
Node
Redundant
Site
Storage Array
Storage Array
This slide summarizes what we have seen in the previous few. It uses clustering, redundant paths,
RAID protected disks, remote replication of data to a secondary site, and a redundant Local Area
Network.
- 21
Business Continuity technology solutions include local replication, remote replication, and
backup/restore. This module provides a very high level overview of some of these solutions. They are
covered in more detail in later modules.
- 22
Local Replication
y Data from the production devices is copied over to a set
of target (replica) devices within the same array
y After some time, the replica devices will contain identical
data as those on the production devices
y Subsequently copying of data can be halted. At this pointin-time, the replica devices can be used independently of
the production devices
y The replicas can then be used for restore operations in
the event of data corruption or other events
y Alternatively the data from the replica devices can be
copied to tape. This off-loads the burden of backup from
the production devices
2007 EMC Corporation. All rights reserved.
Local replication technologies offer fast and convenient methods for ensuring data availability. The
different technologies and the uses of replicas for BC/DR operations will be discussed in a later
module in this section. Typically, local replication uses replica disk devices. This greatly speeds up the
restore process, thus minimizing the RTO. Frequent point-in-time replicas also help in minimizing
RPO.
- 23
Remote Replication
y Data from the production devices is copied over to a set
of target (replica) devices on a different array at some
distance away
y Target devices can be kept continuously synchronized
with the production devices
y In the event of a failure of the production devices,
applications can continue to run from the target devices
Remote replication typically involves a pair of arrays separated by some distance. To achieve nearzero RPO and a very small RTO, production and target devices are kept synchronized at all times.
Periodic local replicas of the target devices may also be taken, to protect against data corruption on the
production devices. The various alternatives for remote replication are discussed later in this section.
- 24
Backup/Restore
y Backup to tape has been the predominant method for
ensuring data availability and business continuity
y Low cost, high capacity disk drives are now being used
for backup to disk. This considerably speeds up the
backup and the restore process
y Frequency of backup will be dictated by defined
RPO/RTO requirements as well as the rate of change of
data
Far from being antiquated, periodic backup is still a widely used method for preserving copies of data.
In the event of data loss due to corruption or other events, data can be restored up to the last backup.
Evolving technologies now permit faster backups to disks. Magnetic tape drive speeds and capacities
are also continually being enhanced. The various backup paradigms and the role of backup in BC/DR
planning are discussed in detail later in this section.
- 25
Module Summary
Key points covered in this module:
y Importance of Business Continuity
y Types of outages and their impact to businesses
y Business Continuity Planning and Disaster Recovery
y Definitions of RPO and RTO
y Difference between Disaster Recovery and Disaster
Restart
y Identifying and eliminating Single Points of Failure
These are the key points covered in this module. Please take a moment to review them.
- 26
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
- 27
At this point, lets apply what weve learned to some real world examples. In this case, we look at
how EMC PowerPath improves Business Continuity in storage environments.
- 28
y Automatic detection
and recovery from
host-to-array path
failures
2007 EMC Corporation. All rights reserved.
PowerPath
SERVER
y Transparent to the
application
File System
Management
Utils
STORAGE
DBMS
SCSI
Driver
SCSI
Driver
SCSI
Driver
SCSI
Driver
SCSI
Driver
SCSI
Driver
SCSI
SCSI
SCSI
SCSI
SCSI
SCSI
Controller Controller Controller Controller Controller Controller
Interconnect
Topology
PowerPath is host-based software that resides between the application and the disk device layers.
Every I/O from the host to the array must pass through the PowerPath driver software. This allows
PowerPath to work in conjunction with the array and connectivity environment to provide intelligent
I/O path management. This includes path failover and dynamic load balancing, while remaining
transparent to any application I/O requests as it automatically detects and recovers from host-to-array
path failures.
PowerPath is supported on various hosts and Operating Systems such as Sun- Solaris, IBM-AIX, HPUX, Microsoft Windows, Linux, and Novell. Storage arrays from EMC, Hitachi, HP, and IBM are
supported. The level of OS and array models supported varies between PowerPath software versions.
- 29
PowerPath Features
y Multiple paths, for higher
availability and performance
PowerPath Delivers:
PowerPath maximizes application availability, optimizes performance, and automates online storage
management while reducing complexity and cost, all from one powerful data path management
solution. PowerPath supports the following features:
y Multiple path support - PowerPath supports multiple paths between a logical device and a host.
Multiple paths enables the host to access a logical device, even if a specific path is unavailable.
Also, multiple paths enable sharing of the I/O workload to a given logical device.
y Dynamic load balancing - PowerPath is designed to use all paths at all times. PowerPath distributes
I/O requests to a logical device across all available paths, rather than requiring a single path to bear
the entire I/O burden.
y Proactive path testing and automatic path recovery - PowerPath uses a path test to ascertain the
viability of a path. After a path fails, PowerPath continues testing it periodically to determine if it
is fixed. If the path passes the test, PowerPath restores it to service and resumes sending I/O to it.
y Automatic path failover - If a path fails, PowerPath redistributes I/O traffic from that path to
functioning paths.
y Online configuration and management - PowerPath management interfaces include a command
line interface and a GUI interface on Windows.
y High availability cluster support - PowerPath is particularly beneficial in cluster environments, as it
can prevent operational interruptions and costly downtime.
- 30
PowerPath Configuration
Host Application(s)
SERVER
STORAGE
SAN
PowerPath
SD
SD
SD
HBA
HBA
HBA
SCSI
Driver
HBA Host Bus
Adapter
SD
Interconnect
Topology
Storage
Without PowerPath, if a host needed access to 40 devices, and there were four Host Bus Adapters, you
would most likely configure the host to present 10 unique devices to each HB-. With PowerPath, the
host is given access to all 40 devices via all four HBAs.
PowerPath supports up to 32 paths to a logical volume. The host can be connected to the array using a
number of interconnect topologies such as SAN, SCSI, or iSCSI.
- 31
y PowerPath directs
I/O to optimal path
based on current
workload and path
availability
y When a path fails
PowerPath chooses
another path in the
set
2007 EMC Corporation. All rights reserved.
SERVER
Host Application(s)
STORAGE
y Platform
independent base
driver
y Applications direct
I/O to PowerPath
SD
SD
HBA
HBA
HBA
SCSI
Driver
HBA Host Bus
Adapter
SD
Interconnect
Topology
Storage
The PowerPath filter driver is a platform independent driver that resides between the application and
HBA driver.
The driver identifies all paths that read and write to the same device and builds a routing table for the
device, called a volume path set. A volume path set is created for each shared device in the array.
PowerPath can use any path in the set to service an I/O request. If a path fails, PowerPath can redirect
an I/O request from that path to any other available path in the set. This redirection is transparent to the
application, which does not receive an error.
- 32
SERVER
Host Application(s)
STORAGE
y In most environments,
a host will have
multiple paths to the
Storage System
y Volumes are spread
across all available
paths
SD
SD
SD
SD
HBA
HBA
HBA
SCSI
Driver
Interconnect
Topology
Storage
Without PowerPath, the loss of a channel (as indicated in the diagram by a red dotted line) means one
or more applications may stop functioning. This can be caused by the loss of a Host Bus Adapter,
Storage Array Front-end connectivity, Switch port, or a failed cable. In a standard non-PowerPath
environment, these are all single points of failure. In this case, all I/O that was heading down the path
highlighted in red is now lost, resulting in an application failure and the potential for data loss or
corruption.
- 33
SERVER
STORAGE
y PowerPath responds by
taking the path offline and
re-driving I/O through an
alternate path
PowerPath
SD
SD
SD
HBA
HBA
HBA
SCSI
Driver
HBA Host Bus
Adapter
SD
Interconnect
Topology
Storage
This example depicts how PowerPath failover works. When a failure occurs, PowerPath transparently
redirects the I/O down the most suitable alternate path. The PowerPath filter driver looks at the volume
path set for the device, considers current workload, load balancing, and device priority settings, and
chooses the best path to send the I/O down. In the example, PowerPath has three remaining paths to
redirect the failed I/O, and to load balance.
- 34
Summary
Key points covered in this module:
y PowerPath is server based software that provides
multiple paths between the host bus adapter and the
Storage Subsystem
Redundant paths eliminate host adapter, cable connection, and
channel adapters as single points of failures and increase availability
Improves performance by dynamically balancing the workload across
all available paths
Application transparent
These are the key points covered in this module. Please take a moment to review them.
- 35
- 36
This module looks at backup and recovery. Backup and recovery are a major part of the planning
process for Business Continuity.
The objectives for the module are shown here. Please take a moment to read them.
In this Module.
This module contains the following lessons:
y Planning for Backup and Recovery
y Backup and Recovery Methods
y Backup Architecture Topologies
y Managing the Backup Process
This module contains the four lessons shown here. These lessons provide an overview of backup and
recovery, including the business and technical aspects.
This lesson provides an overview of the business drivers for backup and recovery and introduces some
of the common terms used when developing a backup and recovery plan.
The objectives for this lesson are shown here. Please take a moment to read them.
What is a Backup
y Backup is an additional copy of data that can be used for
restore and recovery purposes
y The Backup copy is used when the primary copy is lost or
corrupted
y This Backup copy can be created as a:
Simple copy (there can be one or more copies)
Mirrored copy (the copy is always updated with whatever is written to
the primary copy)
A Backup is a copy of the online data that resides on primary storage. The backup copy is created and
retained for the sole purpose of recovering deleted, broken, or corrupted data on the primary disk.
The backup copy is usually retained over a period of time, depending on the type of the data, and on
the type of backup. There are three derivatives for backup: disaster recovery, archival, and operational
backup.
The data that is backed up may be on such media as disk or tape, depending on the backup derivative
the customer is targeting. For example, backing up to disk may be more efficient than tape in
operational backup environments.
Several choices are available to get the data written to the backup media.
y You can simply copy the data from the primary storage to the secondary storage (disk or tape),
onsite. This is a simple strategy, easily implemented, but impacts the production server where the
data is located, since it uses the servers resources. This may be tolerated on some applications, but
not high demand ones.
y To avoid an impact on the production application, and to perform serverless backups, you can
mirror (or snap) a production volume. For example, you can mount it on a separate server and
then copy it to the backup media (disk or tape). This option completely frees up the production
server, with the added infrastructure cost associated with additional resources.
y Remote Backup can be used to comply with offsite requirements. A copy from the primary
storage is done directly to the backup media that is sitting on another site. The backup media can
be a real library, a virtual library or even a remote filesystem.
y You can do a copy to a first set of backup media, which will be kept onsite for operational restore
requirements, and then duplicate it to another set of media for offsite purposes. To simplify the
procedure, replicate it to an offsite location to remove any manual procedures associated with
moving the backup media to another site.
Disaster Recovery addresses the requirement to be able to restore all, or a large part of, an IT
infrastructure in the event of a major disaster.
Archival is a common requirement used to preserve transaction records, email, and other business
work products for regulatory compliance. The regulations could be internal, governmental, or perhaps
derived from specific industry requirements.
Operational is typically the collection of data for the eventual purpose of restoring, at some point in the
future, data that has become lost or corrupted.
y Client
Gathers Data for Backup (a backup client sends backup data to a
backup server or storage node).
y Storage Node
Backup products vary, but they do have some common characteristics. The basic architecture of a
backup system is client-server, with a backup server and some number of backup clients or agents. The
backup server directs the operations and owns the backup catalog (the information about the backup).
The catalog contains the table-of-contents for the data set. It also contains information about the
backup session itself.
The backup server depends on the backup client to gather the data to be backed up. The backup client
can be local or it can reside on another system, presumably to backup the data visible to that system. A
backup server receives backup metadata from backup clients to perform its activities.
There is another component called a storage node. The storage node is the entity responsible for
writing the data set to the backup device. Typically, there is a storage node packaged with the backup
server and the backup device is attached directly to the backup servers host platform. Storage nodes
play an important role in backup planning as it can be used to consolidate backup servers.
Servers
Backup Clients
Backup Server
& Storage Node
Metadata
Catalog
Disk
Storage
2007 EMC Corporation. All rights reserved.
Data Set
Tape
Backup
Backup and Recovery - 9
Business Considerations
y Customer business needs determine:
What are the restore requirements RPO & RTO?
Where and when will the restores occur?
What are the most frequent restore requests?
Which data needs to be backed up?
How frequently should data be backed up?
hourly, daily, weekly, monthly
Some important decisions that need consideration before implementing a Backup/Restore solution are
shown here. Some examples include:
y The Recovery Point Objective (RPO)
y The Recovery Time Objective (RTO)
y The media type to be used (disk or tape)
y Where and when the restore operations occur, especially if an alternative host is used to receive the
restore data
y When to perform backup
y The granularity of backups, Full, Incremental or cumulative
y How long to keep the backup; For example, some backups need to be retained for 4 years, others
just for 1 month
y Is it necessary or not to take copies of the backup?
y Location: Many organizations have dozens of heterogeneous platforms that support a complex
application. Consider a data warehouse where data from many sources is fed into the warehouse.
When this scenario is viewed as The Data Warehouse Application, it easily fits this model. Some
of the issues are:
How the backups for subsets of the data are synchronized
How these applications are restored
y Size: Backing up a large amount of data that consists of a few big files may have less system
overhead than backing up a large number of small files. If a file system contains millions of small
files, the very nature of searching the file system structures for changed files can take hours, since
the entire file structure is searched.
y Number: a file system containing one million files with a ten-percent daily change rate will
potentially have to create 100,000 entries in the backup catalog. This brings up other issues such
as:
How a massive file system search impacts the system
Search time/Media impact
Is there an impact on tape start/stop processing?
Many backup devices, such as tape drives, have built-in hardware compression technologies. To
effectively use these technologies, it is important to understand the characteristics of the data. Some
data, such as application binaries, do not compress well. Text data can compress very well, while other
data, such as JPEG and ZIP files, are already compressed.
y Disaster Recovery
Driven by the organizations disaster recovery policy
Portable media (tapes) sent to an offsite location / vault
Replicated over to an offsite location (disk)
Backed up directly to the offsite location (disk, tape or emulated tape)
y Archiving
Driven by the organizations policy
Dictated by regulatory requirements
Retention periods are the length of time that a particular version of a dataset is available to be restored.
Retention periods are driven by the type of recovery the business is trying to achieve:
y For operational restore, data sets could be maintained on a disk primary backup storage target for a
period of time, where most restore requests are likely to be achieved, and then moved to a
secondary backup storage target, such as tape, for long term offsite storage.
y For disaster recovery, backups must be done and moved to an offsite location.
y For archiving, requirements usually will be driven by the organizations policy and regulatory
conformance requirements. Tapes can be used for some applications, but for others a more robust
and reliable solution, such as disks, may be more appropriate.
Lesson: Summary
Topics in this lesson included:
y Backup and Recovery definitions and examples
y Common reasons for Backup and Recovery
y The business considerations for Backup and Recovery
y Recovery Point Objectives and Recovery Time Objectives
y The data considerations for Backup and Recovery
y The planning for Backup and Recovery
In this lesson we reviewed the business and data considerations when planning for backup and
recovery, including:
What is a Backup and Recovery?
What is the Backup and Recovery process?
Business recovery needs
y RPO Recovery point objectives
y RTO Recovery time objectives
Data characteristics
y Files, compression, retention
Weve discussed the importance and considerations for a backup plan. This lesson provides an
overview of the different methods for creating a backup set.
The objectives for this lesson are shown here. Please take a moment to read them.
Cumulative (Differential)
Incremental
Full
2007 EMC Corporation. All rights reserved.
Cumulative
Incremental
Backup and Recovery - 17
The three different types of backups include: Full Backup; Incremental Backup; and Cumulative
Backup.
y A full backup is a backup of all data on the target volumes, regardless of any changes made to the
data itself.
y An incremental backup contains the changes since the last backup, of any type, whichever was
most recent.
y A cumulative backup, also known as a differential backup, is a type of incremental that contains
changes made to a file since the last full backup.
The granularity and levels for backups depend on business needs, and, to some extent, technological
limitations. Some backup strategies define as many as ten levels of backup. IT organizations use a
combination of these to fulfill their requirements. Most use some combination of full, cumulative, and
incremental backups.
<Continued>
Tuesday
Wednesday
Thursday
File 4
File 3
File 5
Incremental
Incremental
Incremental
Files 1, 2, 3
Full Backup
Files 1, 2, 3, 4, 5
Production
y Key Features
Files that have changed since the last full or incremental backup are
backed up
Fewest amount of files to be backed up, therefore faster backup and less
storage space
Longer restore because last full and all subsequent incremental backups
must be applied
2007 EMC Corporation. All rights reserved.
Tuesday
Wednesday
Thursday
File 4
Files 4, 5
Files 4, 5, 6
Cumulative
Cumulative
Cumulative
Files 1, 2, 3
Full Backup
Files 1, 2, 3, 4, 5, 6
Production
y Key Features
More files to be backed up, therefore it takes more time to backup
and uses more storage space
Much faster restore because only the last full and the last cumulative
backup must be applied
Lesson: Summary
Topics in this lesson included:
y Hot and Cold backups
y The levels of backup granularity
This lesson provided an introduction to backup methods and levels of backup granularity.
So far, we have discussed the importance of the backup plan and the different methods used when
creating a backup set. This lesson provides an overview of the different topologies and media types
that are used to support creating a backup set.
The objectives for this lesson are shown here. Please take a moment to read them.
There are three basic topologies that are used in a backup environment: Direct Attached Based
Backup, LAN Based Backup, and SAN Based Backup.
There is also a fourth topology, called Mixed, which is formed when mixing two or more of these
topologies in a given situation.
LAN
Metadata
Catalog
Backup Server
Data
Storage Node
Backup Client
Media
Backup
Here, the backup data flows directly from the host to be backed up to the tape, without utilizing the
LAN. In this model, there is no centralized management and it is difficult to grow the environment.
Mail Server
Backup Client
Metadata
Data
LAN
Metadata
Data
Backup Server
Storage Node
Storage Node
2007 EMC Corporation. All rights reserved.
In this model, the backup data flows from the host to be backed up to the tape through the LAN. There
is centralized management, but there may be an issue with the LAN utilization since all data goes
through it.
LAN
Data
SAN
Backup
Device
Metadata
Data
Backup Server
2007 EMC Corporation. All rights reserved.
A SAN based backup, also known as LAN free backup, is achieved when there is no backup data
movement over the LAN. In this case, all backup data travels through a SAN to the destination backup
device.
This type of backup still requires network connectivity from the Storage Node to the Backup Server,
since metadata always has to travel through the LAN.
Mail Server
Backup Client
Data
LAN
Data
SAN
Metadata
Data
Backup
Device
Backup Server
2007 EMC Corporation. All rights reserved.
A SAN/LAN mixed based backup environment is achieved by using two or more of the topologies
described in the previous slides. In this example, some servers are SAN based while others are LAN
based.
Backup Media
y Tape
Traditional destination for backups
Sequential access
No protection
y Disk
Random access
Protected by the storage array (RAID, hot spare, etc)
There are two common types of Backup media: Tape and Disk.
Data from
Stream 2
Data from
Stream 3
Tape
Tape drive streaming is recommended from all vendors in order to keep the drive busy. If you do not
keep the drive busy during the backup process (writing), performance suffers. Multiple streaming helps
to improve performance drastically, but it generates one issue as well: the backup data becomes
interleaved; thus, the recovery times are increased.
Backup to Disk
y Backup to disk minimizes tape in backup environments
by using disk as the primary destination device
Cost benefits
No processes changes needed
Better service levels
Backup to disk replaces tape and its associated devices, as the primary target for backup, with disk.
Backup to disk systems offer major advantages over equivalent scale tape systems, in terms of capital
costs, operating costs, support costs, and quality of service. It can be implemented fully on day 1 or
over a phased approach.
Disk
Backup / Restore
108
Minutes
Tape
Backup / Restore
0
10
20
30
40
50
60
70
80
90
Typical Scenario:
y 800 users, 75 MB mailbox
y 60 GB database
31
This example shows a typical recovery scenario using tape and disk. As you can see, recovery with
disk provides much faster recovery than does recovery with tape.
This example shows a typical recovery scenario using tape and disk. As you can see, recovery with
disk provides much faster recovery than recovery with tape.It is important to keep in mind that this
example involves data recovery only. The time it takes to bring the application online is a separate
matter. Even so, you can see in this example that the benefit was a restore roughly five times faster
than it would have been with tape.
Restore time
17 Min. 19 Minutes
Log playback
Backup on ATA
Backup on tape
10
20
30
40
50
60
70
80
Typical Scenario:
y 800 users, 75 MB mailbox
y 60 GB DB restore time
y 500 MB logs log playback
The diagram shows typical recovery scenarios using different technical solutions. As seen in the slide,
recovery with a Local Replica or clones provides the quickest recovery method.
It is important to note that using clones on disk enables you to be able to make more copies of your
data more often. This will improve RPO (the point from which they can recover). It also improves
RTO because the log files are smaller, reducing the log playback time.
In a traditional approach for backup and archive, businesses take a backup of production. Typically,
backup jobs use weekly full backups and nightly incremental backups. Based on business
requirements, they then copy the backup jobs and eject the tapes to have them sent offsite, where they
are stored for a specified amount of time.
The problem with this approach is simple; as the production environment grows, so does the backup
environment.
Archive
Backup/recovery and archiving support different business and goals. This slide compares and
contrasts some of the differences that are significant.
3
4
2
Production
Archive
Process
The recovery process is much more important than the backup process. It is based on the appropriate
recovery-point objectives (RPOs) and recovery-time objectives (RTOs). The process usually drives a
decision to have a combination of technologies in place, from online local replicas, to backup to disk,
to backup to tape for long-term, passive RPOs.
Archive processes are determined not only by the required retention times, but also by retrieval-time
service levels and the availability requirements of the information in the archive.
For both processes, a combination of hardware and software is needed to deliver the appropriate
service level. The best way to discover the appropriate service level is to classify the data and align the
business applications with it.
Lesson: Summary
Topics in this lesson included:
y The DAS, LAN, SAN, and Mixed topologies
y Backup media considerations
This lesson provided an overview of the different topologies and media types that support creating a
backup set.
We have discussed the planning and operations of creating a backup. This lesson provides an overview
of management activities and applications that help manage the backup and recovery process.
The objectives for this lesson are shown here. Please take a moment to read them.
There are two types of user interfaces, Command Line Interface (CLI) and Graphical User Interface
(GUI).
Command Line Interface CLI
y Backup administrators usually write scripts to automate common tasks, such as sending reports via
email
Graphical User Interfaces GUI
y Control the backup and restore process
y Multiple backup servers
y Multiple storage nodes
y Multiple platforms/operating systems
y Single and easy to use interface that provides the most common (if not all) administrative tasks
Shown here are the common tasks associated with managing a Backup or Restore activity using the
B/R application.
Backup:
y Configuring a backup to be started automatically, most (if not all) of the time
y Enabling the backup client to initiate it own
Restore:
y There is usually a separate GUI to manage the restore process
y Information is pulled from the backup catalog when the user is selecting the files to be restored
y Once the selection is finished, the backup server starts reading from the required backup media,
and the files are sent to the backup client
Backup Reports
y Backup products also offer reporting features
y These features rely on the backup catalog and log files
y Reports are meant to be easy to read and provide
important information such as:
Amount of data backed up
Number of completed backups
Number of incomplete backups (failed)
Types of errors that may have occurred
Lesson: Summary
Topics in this lesson included:
y The features and functions of common Backup/Recovery
applications
y The Backup/Recovery process management
considerations
y The importance of the information found in Backup
Reports and in the Backup Catalog
This lesson provided an overview of Backup and Recovery management activities and tools including:
The Backup Application process and user interface; Reports; and the Backup Catalog.
Module Summary
Key points covered in this module:
y The best practices for planning Backup and Recovery
y The common media and types of data that are part of a
Backup and Recovery strategy
y The common Backup and Recovery topologies
y The Backup and Recovery Process
y Management considerations for Backup and Recovery
These are the key points covered in this module. Please take a moment to review them.
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
At this point, lets apply what weve learned to some real world examples. In this case, we will
describe EMCs product implementation of a Backup and Recovery solution.
EMC NetWorker
Tiered Protection and Recovery Management
Remove risk
Faster and more consistent data
backup
Improve reliability
Keep recovery copies fresh and
reduce process errors
Basic
Tape backup
and recovery
Low
Backup
to disk
Advanced
backup
Disk-backup
option
Snapshot
management
SERVICE-LEVEL REQUIREMENTS
High
NetWorkers installed base of more than 20,000 customers worldwide is a testament to the products
market leadership.
Data-growth rates are accelerating, and the spectrum of data and systems that live in environments
runs the gamut from key applications that are central to the business to other types of information that
may be less important.
What is interesting is that the industry has been somewhat stuck for several years at a one-size-fits-all
strategy to backup and recovery. Were referring to a basic backup scenario, or traditional tape
backup.
Tape backup serves a noble purpose and is working very well for some companies; its been EMCs
core business for some time, so EMC knows it well. But shifting market dynamics, as well as more
demanding business environments, have lead to other important choices for backup.
Today, traditional tape faces the challenge of meeting service-level requirements for protection and
availability of an ever-increasing quantity of enterprise data. This is why EMC has built into
NetWorker key options to meet the needs of a wide range of environments. This includes the ability to
use disk for backup, as well as to take advantage of advanced-backup capabilities that connect backup
with array-based snapshot and replication management. These provide you with essentially the
highest-possible performance levels for backup and recovery. As the value of information changes
over time, you may choose any one of these, or a combination thereof, to meet your needs.
Critical applications
Heterogeneous platforms and
storage
Scalable architecture
256-bit AES encryption and secure
authentication
Key
applications
LAN
Backup
server
NAS
(NDMP)
SAN
Storage
Node
y Centralized management
Graphical user interface
Customizable reporting
Wizard-driven configuration
y Performance
Tape
library
Data multiplexing
Advanced indexing
Efficient media management
The first key focus is on providing complete coverage. Enterprise protection means the ability to provide coverage for all
the components in the environment. NetWorker provides data protection for the widest heterogeneous support of operating
systems, and is integrated with leading databases and applications for complete data protection.
A single NetWorker server can be used to protect all clients and servers in the environment, or secondary servers can be
employed, which EMC calls Storage Nodes, as a conduit for additional processing power or to protect large critical servers
directly across a SAN without having to take data back over the network. Such LAN-free backup is standard with
NetWorker.
NetWorker can easily back up environments in LAN, SAN, or WAN environments, with coverage for key storage such as
NAS. As a matter of fact, NetWorkers NAS-protection capabilities, leveraging the Network Data Management Protocol
(NDMP), are unequaled.
The key here is that NetWorker can easily grow and scale as needed in the environment and provide advanced
functionality, including clustering technologies, open-file protection and compatibility with tape hardware and the new
class of virtual-tape and virtual-disk libraries.
While NetWorker encompasses all these pieces in the environment, EMC has made sure there is a common set of
management tools.
With NetWorker, EMC has focused on what it takes within environments both large and small to get the best performance
possible, in terms of both speed and reliability. This means the inclusion of capabilities such as multiplexing to protect data
as quickly as possible while making use of the backup storages maximum bandwidth. It also means ensuring that the way
in which EMC indexes and manages the saving of data is designed to provide not only the best performance, but also
stability and reliability.
Offline (Cold)
Restart
application
24x7 OPERATIONS
SAVE
NetWorker MODULE
Back up
application
Application
Application
DOWNTIME
Shut down
application
Application
Applications can be backed up either offline or online. NetWorker by itself can back up closed
applications as flat files. During an offline, or cold, backup, the application is shut down, backed up
and restarted after the backup is finished.
This is fine, but during the shutdown and backup period, the application is unavailable. This is not
acceptable in todays business environments. This is why EMC has worked to integrate NetWorker
with applications to provide online backup, specifically, with the use of NetWorker in conjunction
with NetWorker Modules.
During an online, or hot, backup, the application is open and is backed up while open. The NetWorker
Module extracts data for backup with an API; the application need not be shut down, and remains open
while the backup finishes.
NetWorker supports a wide range of applications for online backup with granular-level recovery,
including:
y
y
y
y
y
y
y
y
Oracle
Microsoft Exchange
Microsoft SQL Server
Lotus Notes
Sybase
Informix
IBM DB2
EMC Documentum
Backup and Recovery - 49
Media-Management Advantages
y Open Tape Format
Datastream multiplexing
Self-contained indexing
Cross-platform format
UNIX Windows Linux
NetWorker
UNIX/Linux
NetWorker
Windows
Key
applications
Backup
server
SAN
Simultaneous-access operations
No penalty on restore versus tape
NAS
y High performance
Storage
Node
y Superior capability
Tape
library
Diskbackup
target
The focus here is the resolution of the top pain points around traditional tape-based backup.
Performance NetWorker backup to disk allows for simultaneous-access operations to a volume, both
reads (restore, staging, cloning) and writes (backups). With NetWorker, as opposed to with traditional
tape-only backup, you dont "pay a penalty on restore."
Also, cloning from disk to tape is up to 50% faster. Why? As soon as the Save Set (backup job) is
complete, the cloning process can begin without the Administrator having to wait for all the backup
jobs to complete. NetWorker can back up to disk and clone to tape at the same time. You dont have to
spend 1216 hours a day running clone operations (tape-to-tape copies); in fact, you might actually be
able to eliminate the clone jobs. Some NetWorker customers have seen cloning times reduced from
1216 hours daily to three to four hours daily.
Cloning from disk to tape also augments the disaster-recovery strategy for tape. As data grows, more
copies must be sent offsite. Because NetWorker backup to disk improves cloning performance, you
can continue to meet the daily service-level agreements to get tapes offsite to a vaulting provider.
Taking the idea of leveraging disk even further, leads us into a discussion of NetWorkers advanced
backup capability, which also leverages disk-based technologies.
y Instant restore
y Off-host backups
Production
information
y Achieve stringent recovery-time objectives
(RTOs),
recovery-point objectives (RPOs)
Recover
Backup
Snapshot
5:00 p.m.
Snapshot
11:00 a.m.
Backup snap
10:00 p.m.
Backup
server
Backup and Recovery - 52
Disk-solution providers, like EMC, provide array-based abilities to perform snapshots and replication.
These point-in-time copies of data allow for instant recovery of disk and data volumes. Many are
likely familiar with array-based replication or snapshot capabilities.
NetWorker is engineered to take advantage of these capabilities by providing direct tie-ins with EMC
offerings such as CLARiiON with SnapView, or Symmetrix with TimeFinder/Snap. This enables you
to begin to meet the most stringent recovery requirements.
In a study done in the spring of 2004, the Taneja Group identified that the market intends to rely on
snapshots for ensuring application-data availability and rapid recoveries. The figures represent a scale
of one to five, with one as the low point, five as the high point:
y Rapid application recovery (4.34)
y Ability to automate backup to tape (4.13)
y Instant backup (3.98)
y Roll back to point in time (3.88)
y Integration with backup strategy (3.87)
y Flexibility to leverage hardware (3.61)
y Multiple fulls throughout day (3.49)
Advanced Backup
Heterogeneous
clients
Administer snapshots in
NetWorker
Schedule, create, retain, and
delete snapshots by policy
Key
applications
y Third-party integration
LAN
Backup
server
NAS
SAN
Tape
library
Storage
Node
Software-based (RecoverPoint)
y Application recovery
CLARiiON
with
SnapView
In addition to traditional backup-and-recovery application modules for disk and tape, the snapshot
management capability called NetWorker PowerSnap enables you to meet the demanding service-level
agreement requirements in both tape and disk environments by seamlessly integrating snapshot
technology and applications. NetWorker PowerSnap software works with NetWorker Modules to
enable snapshot backups of applicationswith consistency.
PowerSnap performs snapshot management by policy, just like standard backup policies to tape or
disk. It uses these policies to determine how many snapshots to create, how long to retain the
snapshots, when to do backups to tape from specified snapshotsall based on business needs that you
define.
For example, snapshots might be taken every few hours, and the three most recent are retained. You
can easily leverage any of those snapshots to back up to tape in an off-host fashion, i.e., with no impact
to the application servers.
PowerSnap manages the full life cycle of snapshots, including creation, scheduling, backups, and
expiration. This, along with its orchestration with applications, provides a comprehensive solution for
complete application-data protection to help you meet the most stringent of RTOs and RPOs.
y Block-level backups
Host-based snapshot
Targeted at high-density file
systems
Single-file restore
Sparse backups
y High performance
Significant backup-and-restore
performance impactup to 10
times faster
Drive tape at rated speeds
10,000,000+ files
1,000,000+ directories
Optional network-accelerated
serverless backup with Cisco
intelligent switch
If there are servers with lots of files and lots of directories, what we refer to as high-density file
systems, backup and recovery are particularly challenging. With so many files, traditional backup
struggles to keep up with backup windows.
NetWorker SnapImage enables block-level backup of these file systems while maintaining the ability
to restore a single file. SnapImage is intelligent enough to also support sparse backups.
y Sparse files contain data with portions of empty blocks, or zeroes.
y NetWorker backs up only the non-zero blocks, thereby reducing:
Time for backup
Amount of backup-media space consumed
y Sparse-file examples:
Large database files with deleted data or unused database fields
Files from image applications
y With the NetWorker SnapImage Module, backup and recovery of servers with high-density file
systems is significantly increased:
The time required to back up 18.8 million 1 KB files in a 100 GB file system with a block size
of 4 KB can be reduced from 31 to seven hours.
The time required to perform a Save Set restore of one million 4 KB files in a 5.36 GB internal
disk can be reduced from 72 to seven minutes.
Business Challenge:
y Complex application
environment
Server-free backup
y No backup window
y Recovery-time objective:
Restore 24 TB in two
hours
Disaster-Recovery Site
Disaster recovery
Offsite protection
Production Site
NetWorker
Disasterrecovery host
Tape
library
Storage Node
Application
host
Storage Node
PowerSnap
SAN
SAN
Tape
library
SRDF/S
Symmetrix
DMX
Symmetrix
DMX
CLARiiON
CX
Value Proposition
Zero backup window for
applications
Eliminated data-loss risk
Reduced management
overhead
55
Summary
Key points covered in this topic:
y EMCs product implementation of a Backup and
Recovery solution
In this topic, we described EMCs product implementation of a Backup and Recovery solution.
This concludes the module.
Local Replication
Upon completion of this module, you will be able to:
y Discuss replicas and the possible uses of replicas
y Explain consistency considerations when replicating file
systems and databases
y Discuss host and array based replication technologies
Functionality
Differences
Considerations
Selecting the appropriate technology
In this module, we will look at what replication is, technologies used for creating local replicas, and
things that need to be considered when creating replicas.
The objectives for the module are shown here. Please take a moment to read them.
-1
What is Replication
y Replica - An exact copy (in all details)
y Replication - The process of reproducing data
REPLICATION
Original
Replica
Local replication is a technique for ensuring Business Continuity by making exact copies of data. With
replication, data on the replica is identical to the data on the original at the point-in-time that the
replica was created.
Examples:
y Copy a specific file
y Copy all the data used by a database application
y Copy all the data in a UNIX Volume Group (including underlying logical volumes, file systems,
etc.)
y Copy data on a storage array to a remote storage array
-2
-3
Considerations
y What makes a replica good
Recoverability
Considerations for resuming operations with primary
Consistency/re-startability
How is this achieved by various technologies
y Kinds of Replicas
Point-in-Time (PIT) = finite RPO
Continuous = zero RPO
-4
DBMS
File System
Buffer
Volume Management
Multi-pathing Software
Device Drivers
HBA
HBA
HBA
Physical Volume
2007 EMC Corporation. All rights reserved.
Most OS file systems buffer data in the host before the data is written to the disk on which the file
system resides.
y For data consistency on the replica, the host buffers must be flushed prior to the creation of the
PIT. If the host buffers are not flushed, the data on the replica will not contain the information that
was buffered on the host.
y Some level of recovery will be necessary
Note: If the file system is unmounted prior to the creation of the PIT, no recovery would be needed
when accessing data on the replica.
-5
Data
Logs
-6
All logging database management systems use the concept of dependent write I/Os to maintain
integrity. This is the definition of dependent write consistency. Dependent write consistency is
required for the protection against local power outages, loss of local channel connectivity, or storage
devices. The logical dependency between I/Os is built into database management systems, certain
applications, and operating systems.
-7
Database
Application
Data
Log
Database applications require that for a transaction to be deemed complete, a series of writes have to
occur in a particular order (Dependent Write I/O). These writes would be recorded on the various
devices/file systems.
y In this example, steps 1-4 must complete for the transaction to be deemed complete
Step 4 is dependent on Step 3 and will occur only if Step 3 is complete
Step 3 is dependent on Step 2 will occur only if Step 2 is complete
Step 2 is dependent on Step 1 will occur only if Step 1 is complete
y Steps 1-4 are written to the databases buffer and then to the physical disks
-8
Data
Replica
Log
Data
Log
Consistent
Note: In this example, the database is online.
2007 EMC Corporation. All rights reserved.
At the point in time when the replica is created, all the writes to the source devices must be captured
on the replica devices to ensure data consistency on the replica.
y In this example, steps 1-4 on the source devices must be captured on the replica devices for the
data on the replicas to be consistent.
-9
Replica
Data
Log
Inconsistent
Note: In this example, the database is online.
2007 EMC Corporation. All rights reserved.
Creating a PIT for multiple devices happens quickly, but not instantaneously.
y Steps 1-4 which are dependent write I/Os have occurred and have been recorded successfully on
the source devices
y It is possible that steps 3 and 4 were copied to the replica devices, while steps 1 and 2 were not
copied
y In this case, the data on the replica is inconsistent with the data on the source. If a restart were to be
performed on the replica devices, Step 4 which is available on the replica might indicate that a
particular transaction is complete, but all the data associated with the transaction will be
unavailable on the replica making the replica inconsistent.
- 10
Source
Replica
Data
Database
Application
(Offline)
Log
Consistent
Database replication can be performed with the application offline (i.e., application is shutdown, no
I/O activity) or online (i.e., while the application is up and running). If the application is offline, the
replica will be consistent because there is no activity. However, consistency is an issue if the database
application is replicated while it is up and running.
- 11
Source
Replica
Data
Log
Inconsistent
In the situation shown, Steps 1-4 are dependent write I/Os. The replica is inconsistent because Steps 1
and 2 never made it to the replica. To make the database consistent, some level of recovery would have
to be performed. In this example, it could be done by simply discarding the transaction that was
represented by Steps 1-4. Many databases are capable of performing such recovery tasks.
- 12
Replica
Consistent
- 13
At PIT
Later
Resynch
Source = Target
Source Target
Source = Target
Changes occur on the production volume after the creation of a PIT. Changes could also occur on the
target. Typically the target device re-synchronizes with the source device at some future time in order
to obtain a more recent PIT.
Note: The replication technology employed should have a mechanism to keep track of changes. This
makes the re-synchronization process much faster. If the replication technology does not track changes
between the source and target, every resynchronization operation has to be a full operation.
- 14
- 15
Logical Storage
LVM
Physical Storage
Logical Volume Managers (LVMs) introduce a logical layer between the operating system and the
physical storage. LVMs have the ability to define logical storage structures that can span multiple
physical devices. The logical storage structures appear contiguous to the operating system and
applications.
The fact that logical storage structures can span multiple physical devices provides flexibility and
additional functionality:
y Dynamic extension of file systems
y Host based mirroring
y Host based striping
The Logical Volume Manager provides a set of operating system commands, library subroutines, and
other tools that enable the creation and control of logical storage.
- 16
Volume Groups
y One or more Physical Volumes
form a Volume Group
y LVM manages Volume Groups
as a single entity
y Physical Volumes can be added
and removed from a Volume
Group as necessary
y Physical Volumes are typically
divided into contiguous equalsized disk blocks
Physical
Volume 1
Physical
Volume 2
Physical
Volume 3
Volume Group
Physical
Disk
Block
A Volume Group is created by grouping together one or more Physical Volumes. Physical Volumes:
y Can be added or removed from a Volume Group dynamically
y Cannot be shared between Volume Groups, the entire Physical Volume becomes part of a Volume
Group
Each Physical Volume is partitioned into equal-sized data blocks. The size of a Logical Volume is
based on a multiple of the equal-sized data block.
The Volume Group is handled as a single unit by the LVM.
y A Volume Group as a whole can be activated or deactivated
y A Volume Group would typically contain related information. For example, each host would have
a Volume Group which holds all the OS data, while applications would be on separate Volume
Groups.
Logical Volumes are created within a given Volume Group. A Logical Volume can be thought of as a
virtual disk partition, while the Volume Group itself can be though of as a disk. A Volume Group can
have a number of Logical Volumes.
- 17
Logical Volumes
Logical Volume
Logical Volume
Physical Volume 1
Physical Volume 2
Volume Group
2007 EMC Corporation. All rights reserved.
Logical Disk
Block
Physical Volume 3
Physical Disk
Block
Business Continuity Local Replication - 18
Logical Volumes (LV) form the basis of logical storage. They contain logically contiguous data blocks
(or logical partitions) within the volume group. Each logical partition is mapped to at least one
physical partition on a physical volume within the Volume Group. The OS treats an LV like a physical
device and accesses it via device special files (character or block). A Logical Volume:
y Can only belong to one Volume Group. However, a Volume Group can have multiple LVs
y Can span multiple physical volumes
y Can be made up of physical disk blocks that are not physically contiguous
y Appears as a series of contiguous data blocks to the OS
y Can contain a file system or be used directly. Note: There is a one-to-one relationship between LV
and a File System
Note: Under normal circumstances, there is a one-to-one mapping between a logical and physical
Partition. A one-to-many mapping between a logical and physical partition leads to mirroring of
Logical Volumes.
- 18
PVID1
Host
Logical Volume
VGDA
Physical
Volume 1
VGDA
Physical
Volume 2
Logical Volume
PVID2
Logical Volumes may be mirrored to improve data availability. In mirrored logical volumes, every
logical partition maps to 2 or more physical partitions on different physical volumes.
y Logical volume mirrors may be added and removed dynamically
y A mirror can be split and data contained used independently
The advantages of mirroring a Logical Volume are high availability and load balancing during reads if
the parallel policy is used. The cost of mirroring is additional CPU cycles necessary to perform two
writes for every write and the longer cycle time needed to complete the writes.
- 19
Many Logical Volume Manager vendors will allow the creation of File System snapshots while a File
System is mounted. File System snapshots are typically easier to manage than creating mirrored logical
volumes and then splitting them.
- 20
Host based replicas can be usually presented back to the same server:
y Using the replica from the same host for any BC operation adds an additional CPU burden on the
server
y Replica is useful for fast recovery if there is any logical corruption on the source at the File System
level
y Replica itself may become unavailable if there is a problem at the Volume Group level
y If the Server fails, then the replica and the source would be unavailable until the server is brought
online or another server is given access to the Volume group
y Presenting a LVM based local replica to a second host is usually not possible because the replica
will still be part of the volume group which is usually accessed by one host at any given time
Keeping track of changes after the replica has been created:
y If changes are not tracked, all future resynchronization will be a full operation
y Some LVMs may offer incremental resynchronization
- 21
Array
Source
Production
Server
Replica
Business
Continuity Server
- 22
Logical Volume 1
c12t1d1
c12t1d2
File System 1
Source
Vol 1
Replica
Vol 1
Source
Vol 2
Replica
Vol 2
Volume Group 1
- 23
Attached
Read/Write
Not Ready
Target
Source
Array
Full volume mirroring is achieved by attaching the target device to the source device and then copying
all the data from the source to the target. The target is unavailable to its host while it is attached to the
source, and the synchronization occurs.
y Target (Replica) device is attached to the Source device and the entire data from the source device
is copied over to the target device
y During this attachment and synchronization period, the Target device is unavailable
- 24
Detached - PIT
Read/Write
Read/Write
Target
Source
Array
After the synchronization is complete, the target can be detached from the source and be made
available for Business Continuity operations. The point-in-time (PIT) is determined by the time of
detachment or separation of the Source and Target. For example, if the detachment time is 4:00 PM,
the PIT of the replica is 4:00 PM.
- 25
For future re-synchronization to be incremental, most vendors have the ability to track changes at some
level of granularity, such as 512 byte block, 32 KB, etc. Tracking is typically done with some kind of
bitmap.
The target device must be at least as large as the source device. For full volume copies, the minimum
amount of storage required is the same as the size of the source.
- 26
Copy on First Access (COFA) provides an alternate method to create full volume copies. Unlike Full
Volume mirrors, the replica is immediately available when the session is started (no waiting for full
synchronization).
y The PIT is determined by the time of activation of the session. Just like the full volume mirror
technology, this method requires the Target devices to be at least as large as the source devices.
y A protection map is created for all the data on the Source device at some level of granularity (e.g.,
512 byte block, 32 KB, etc.). Then the data is copied from the source to the target in the
background based on the mode with which the replication session was invoked.
- 27
Read/Write
Target
Source
Write to Target
Read/Write
Read/Write
Source
Target
Read/Write
Source
2007 EMC Corporation. All rights reserved.
Target
Business Continuity Local Replication - 28
In the Copy on First Access mode (or the deferred mode), data is copied from the source to the target
only when:
y A write is issued for the first time after the PIT to a specific address on the source
y A read or write is issued for the first time after the PIT to a specific address on the target.
Since data is only copied when required, if the replication session is terminated, the target device only
has data that was copied (not the entire contents of the source at the PIT). In this scenario, the data on
the target cannot be used as it is incomplete.
- 28
In Full Copy Mode, the target is made available immediately and all the data from the source is copied
over to the target in the background.
y During this process, if a data block that has not yet been copied to the target is accessed, the
replication process jumps ahead and moves the required data block first.
y When a full copy mode session is terminated (after full synchronization), the data on the target is
still usable as it is a full copy of the original data.
- 29
Unlike full volume replicas, the target devices for pointer based replicas only hold pointers to the
location of the data but not the data itself. When the copy session is started, the target device holds
pointers to the data on the source device. The primary advantage of pointer based copies is the
reduction in storage requirement for the replicas.
- 30
Source
Save Location
The original data block from the Source is copied to the save location, when a data block is first
written to after the PIT.
y Prior to a new write to the source or target device:
Data is copied from the source to a save location
The pointer for that specific address on the Target then points to the save location
Writes to the Target result in writes to the save location and the updating of the pointer to the
save location
y If a write is issued to the source for the first time after the PIT, the original data block is copied to
the save location and the pointer is updated from the Source to the save location.
y If a write is issued to the Target for the first time after the PIT, the original data is copied from the
Source to the Save location, the pointer is updated and then the new data is written to the save
location.
y Reads from the Target are serviced by the Source device or from the save location based on the
where the pointer directs the read.
Source When data has not changed since PIT
Save Location When data has changed since PIT
Data on the replica is a combined view of unchanged data on the Source and the save location. Hence,
if the Source device becomes unavailable, the replica no longer has valid data.
- 31
It is too expensive to track changes at a bit by bit level because it would require an equivalent amount
of storage to keep track of which bit changed for both the Source and the Target.
Some level of granularity is chosen and a bit map is created -- one for the Source and one for the
Target. The level of granularity is vendor specific.
- 32
Target
Source
Target
Target
At PIT
After PIT
Re-synch
(Source to
Target)
0 = unchanged
2007 EMC Corporation. All rights reserved.
1 = changed
Business Continuity Local Replication - 33
- 33
Source
12:00 P.M.
Point-In-Time
06:00 P.M.
12:00 A.M.
: 12 : 01 : 02 : 03 : 04 : 05 : 06 : 07 : 08 : 09 : 10 : 11 : 12 : 01 : 02 : 03 : 04 : 05 : 06 : 07 : 08 : 09 : 10 : 11 :
A.M.
2007 EMC Corporation. All rights reserved.
P.M.
Business Continuity Local Replication - 34
Most array based replication technologies allow the Source devices to maintain replication
relationships with multiple Targets.
y This can also reduce RTO because the restore can be a differential restore
y Each PIT could be used for a different BC activity and also as restore points
In this example, a PIT is created every six hours from the same source. If any logical or physical
corruption occurs on the Source, the data can be recovered from the latest PIT and at worst, the RPO
will be 6 hours.
- 34
&
Replica
Source
Replica
Consistent
'
Inconsistent
Most array based replication technologies allow the creation of consistent replicas by holding I/O to all
devices simultaneously when the PIT is created.
y Typically, applications are spread out over multiple devices
Could be on the same array or multiple arrays
y Replication technology must ensure that the PIT is consistent for the whole application
Need mechanism to ensure that updates do not occur while PIT is created
y Hold I/O to all devices simultaneously for an instant, create PIT and release I/O
Cannot hold I/O for too long, application will timeout
- 35
- 36
y Solution
Restore data from replica to production
The restore would typically be done in an incremental manner and the
Applications would be restarted even before the synchronization is
complete leading to very small RTO
- 37
Perform Restore
Based on the type of failure, choose to either perform a restore to the production devices or shift
production operations to the replica devices. In either case, the recommendation would be to stop
access to the production and replica devices, then identify the replica to be used for the restore or
restart operations.
The choice of replica depends on the consistency of the data on the replica and the desired RPO (e.g., a
business may create PIT replicas every 2 hours; if a failure occurs, then at most only 2 hours of data
would have been lost). If a replica has been written (application testing for example) to after the
creation of the PIT, then this replica may not be a viable candidate for the restore or restart.
Note: RTO is a key driver in the choice of replication technology. The ability to restore or restart
almost instantaneously after any failure is very important.
- 38
With Full Volume replicas, all the data that was on the source device when the PIT was created is
available on the Replica (either with Full Volume Mirroring or Full Volume Copies). With Pointer
Based Replicas and Full Volume Copies in deferred mode (COFA), access to all the data on the
Replica is dependent on the health (accessibility) of the original source volumes. If the original source
volume is inaccessible for any reason, pointer based or Full Volume Copy on First Access replicas are
of no use in either a restore or a restart scenario.
- 39
Full Volume replicas have a number of advantages over Pointer based (COFW) and Copy On First
Access technologies.
y The replica has the entire contents of the original source device from the PIT and any activity to
the replica has no performance impact on the source device (there is no COFA or COFW penalty)
y Full Volume replicas can be used for any BC activity
y The only disadvantage is that the storage requirements for the replica are at least equal to that of
the source devices
- 40
The main benefit of Pointer based copies is the lower storage requirement for the replicas. This
technology is also very useful when the changes to the Source are expected to be less that 30% after
the PIT has been created. Heavy activity on the Target devices may cause performance impact on the
Source because any first writes to the Target require data to be copied from the source to the Save
location. Also, any reads which are not in the save area have to be read from the Source device. The
Source device needs to be accessible for any restart or restore operations from the Target.
- 41
Listed here, are some considerations for using Full Volume Copy on First Access (COFA).
The COFA technology requires at least the same amount of storage as the Source. The disadvantages
of the COFA penalty, and the fact that the replica would be of no use if the source volume were
inaccessible, make this technology less desirable. If a Full Copy mode is available, then always use the
Full Copy mode. The advantages are identical to that discussed for Full Volume replicas.
- 42
Pointer Based
Required Storage
100% of Source
Fraction of Source
Performance Impact
None
Some
RTO
Very small
Very small
Restore
Requires a healthy
source device
Data change
No limits
< 30%
This table summarizes the differences between Full Volume and Pointer Base replication technologies.
- 43
Module Summary
Key points covered in this module:
y Replicas and the possible use of Replicas
y Consistency considerations when replicating File
Systems and Databases
y Host and Array based Replication Technologies
Advantages/Disadvantages
Differences
Considerations
Selecting the appropriate technology
These are the key points covered in this module. Please take a moment to review them.
- 44
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
- 45
At this point, lets apply what weve learned to some real world examples. Upon completion of this
topic you will be able to:
List EMCs Local Replication Solutions for the Symmetrix and CLARiiON arrays;
Describe EMCs TimeFinder/Mirror Replication Solution; and
Describe EMCs SnapView - Snapshot Replication Solution
- 46
EMC TimeFinder/Clone
Full volume replication
EMC TimeFinder/SNAP
Pointer based replication
All the Local Replication solutions that were discussed in this module are available on EMC
Symmetrix and CLARiiON arrays.
y EMC TimeFinder/Mirror and EMC TimeFinder/Clone are full volume replication solutions on the
Symmetrix arrays, while EMC TimeFinder/Snap is a pointer based replication solution on the
Symmetrix. EMC SnapView on the CLARiiON arrays allows full volume replication via
SnapView Clone and pointer based replication via SnapView Snapshot.
y EMC TimeFinder/Mirror: Highly available, ultra-performance mirror images of Symmetrix
volumes that can be non-disruptively split off and used as point-in-time copies for backups,
restores, decision support, or contingency uses.
y EMC TimeFinder/Clone: Highly functional, high-performance, full volume copies of Symmetrix
volumes that can be used as point-in-time copies for data warehouse refreshes, backups, online
restores, and volume migrations.
y EMC SnapView Clone: Highly functional, high-performance, full volume copies of CLARiiON
volumes that can be used as point-in-time copies for data warehouse refreshes, backups, online
restores, and volume migrations.
y EMC TimeFinder/Snap: High function, space-saving, pointer-based copies (logical images) of
Symmetrix volumes that can be used for fast and efficient disk-based restores.
y EMC SnapView Snapshot: High function, space-saving, pointer-based copies (logical images) of
CLARiiON volumes that can be used for fast and efficient disk-based restores.
EMC TimeFinder/Mirror and EMC SnapView Snapshot are discussed in more detail on the next few
slides.
- 47
EMC TimeFinder/Mirror is an array based local replication technology for Full Volume Mirroring on
EMC Symmetrix Storage Arrays.
y TimeFinder/Mirror Business Continuance Volumes (BCV) are devices dedicated to local
replication.
y The BCVs are typically established with a standard Symmetrix device to create a Full Volume
Mirror.
y After the data has been synchronized, the BCV can be split from its source device and used for
any BC task. TimeFinder controls available on Open Systems and Mainframe environments.
- 48
Re-synchronization is incremental
STD
BCV
Establish
Incremental Establish
The TimeFinder Establish operation is the first step in creating a TimeFinder/Mirror replica. The
purpose of the establish operation is to synchronize the contents from the Standard device to the BCV.
The first time a BCV is established with a standard device, a full synchronization has to be performed.
Any future re-synchronization can be incremental in nature. The Symmetrix microcode can keep track
of changes made to either the Standard or the BCV.
y The Establish is a non-disruptive operation to the Standard device. I/O to Standard devices can
proceed during establish. Applications need not be quiesced during the establish operation.
y The Establish operation sets a Not Ready status on the BCV device. Hence, all I/O to the BCV
device must be stopped before the Establish operation is performed. Since BCVs are dedicated
replication devices, a BCV cannot be established with another BCV.
- 49
Changes tracked
STD
BCV
Split
The Point-in-Time of the replica is tied to the time when the Split operation is executed.
The Split operation separates the BCV from the Standard Symmetrix device and makes the BCV
device available for host access through its own device address. After the split operation, changes
made to the Standard or BCV devices are tracked by the Symmetrix Microcode. EMC
TimeFinder/Mirror ensures consistency of data on the BCV devices via the Consistent Split option.
- 50
STD
BCV
STD
BCV
The TimeFinder/Mirror Consistent Split option ensures that the data on the BCVs is consistent with the
data on the Standard devices. Consistent Split holds I/O across a group of devices using a single
Consistent Split command, thus all the BCVs in the group are consistent point-in-time copies. It is used
to create a consistent point-in-time copy of an entire system, database, or any associated set of
volumes.
The holding of I/Os can be either done by the EMC PowerPath multi-pathing software or the
Symmetrix Microcode (Enginuity Consistency Assist). With PowerPath-based consistent split
executed by the host doing the I/O, I/O is held at the host before the split.
Enginuity Consistency Assist (ECA) based consistent split can be executed by the host doing the I/O or
by a control host in an environment where there are distributed and/or related databases. I/O is held at
the Symmetrix until the split operation is completed. Since I/O is held at the Symmetrix, ECA can be
used to perform consistent splits on BCV pairs across multiple, heterogeneous hosts.
- 51
BCV
Incremental Restore
y Query
Provide current status of BCV/Standard volume
pairs
The purpose of the restore operation is to synchronize the data on the BCVs from a prior Point in Time
to the Standard devices. Restore is a recovery operation, so all I/Os to the Standard device should be
stopped and the device must be taken offline prior to a restore operation. The restore sets the BCV
device to a Not-Ready state, thus all I/Os to the BCV devices must be stopped and the devices must be
offline before issuing the restore command.
Operations on the Standard volumes can resume as soon as the restore operation is initiated, while the
synchronization of the Standards from the BCV is still in progress.
The query operation is used to provide current status of Standard/BCV volume pairs.
- 52
2:00 a.m.
Establish
Standard
volume
Split
Standard
volume
BCV
4:00 a.m.
or
BCV
4:00 a.m.
Establish
Incremental restore
Split
BCV
2007 EMC Corporation. All rights reserved.
6:00 a.m.
Business Continuity Local Replication - 53
TimeFinder/Mirror allows a given Standard device to maintain incremental relationships with multiple
BCVs.
This means that different BCVs can be established and then split incrementally from a standard
volume at different times of the day. For example, a BCV that was split at 4:00 a.m. can be reestablished incrementally, even though another BCV was established and split at 5:00 a.m. In this way,
a user can split and incrementally re-establish volumes throughout the day or night and still keep reestablish times to a minimum.
Incremental information can be retained between a STD device and multiple BCV devices, provided
the BCV devices have not been paired with different STD devices.
The incremental relationship is maintained between each STD/BCV pairing by the Symmetrix
Microcode.
- 53
BCV1
Standard
BCV2
Concurrent BCVs is a TimeFinder/Mirror feature that allows two BCVs to be simultaneously attached
to a standard volume. The BCV pair can be split, providing customers with two copies of the
customers data. Each BCV can be mounted online and made available for processing.
- 54
SnapView is software that runs on the CLARiiON Storage Processors and is part of the CLARiiON Replication
Software suite of products, which includes SnapView, MirrorView and SAN Copy.
SnapView can be used to make point in time (PIT) copies in 2 different ways Clones, also called BCVs or
Business Continuity Volumes, are full copies, whereas Snapshots use a pointer-based mechanism. Full copies are
covered later, when we look at Symmetrix TimeFinder. SnapView Snapshots is covered here.
The generic pointer-based mechanism has been discussed in a previous section, so well concentrate on
SnapView.
Snapshots require a save area, called the Reserved LUN Pool. The Reserved part of the name implies that the
LUNs are reserved for use by CLARiiON software, and therefore cannot be assigned to a host. LUNs which
cannot be assigned to a host are known as private LUNs in the CLARiiON environment.
To keep the number of pointers, and therefore the pointer map, at a reasonable size, SnapView divides the LUN
to be snapped, called a Source LUN, into areas of 64 kB in size. Each of these areas is known as a chunk. Any
change to data inside a chunk causes that chunk to be written to the Reserved LUN Pool, if it is being modified
for the first time. The 64 kB copied from the Source LUN must fit into a 64 kB area in the Reserved LUN, so
Reserved LUNs are also divided into chunks for tracking purposes.
The next 2 slides show more detail on the Reserved LUN Pool, and allocation of Reserved LUNs to a Source
LUN.
- 55
FLARE LUN 5
Private LUN 5
FLARE LUN 6
Private LUN 6
FLARE LUN 7
Private LUN 7
FLARE LUN 8
Private LUN 8
The CLARiiON storage system must be configured with a Reserved LUN Pool in order to use
SnapView Snapshot features. The Reserved LUN Pool consists of 2 parts: LUNs for use by SPA and
LUNs for use by SPB. Each of those parts is made up of one or more Reserved LUNs. The LUNs used
are bound in the normal manner. However, they are not placed in storage groups and allocated to hosts;
they are used internally by the storage system software. These are known as private LUNs because
they cannot be used, or seen, by attached hosts.
Like any LUN, a Reserved LUN is owned by only one SP at any time and may be trespassed if the
need arises (i.e., if an SP fails).
Just as each storage system model has a maximum number of LUNs it supports, each also has a
maximum number of LUNs which may be added to the Reserved LUN Pool.
The first step in SnapView configuration usually is the assignment of LUNs to the Reserved LUN
Pool. Only then are SnapView Sessions allowed to start. Remember that as snapable LUNs are added
to the storage system, the LUN Pool size has to be reviewed. Changes may be made online.
LUNs used in the Reserved LUN Pool are not host-visible, though they do count towards the
maximum number of LUNs allowed on a storage system.
Note: FLARE is the operating environment of the EMC CLARiiON Arrays.
- 56
Reserved LUN
Pool
Source LUNs
Snapshot 1a
Session 1a
Snapshot 1b
Session 1b
Private LUN 5
LUN 1
Private LUN 6
Private LUN 7
Private LUN 8
LUN 2
Snapshot 2a
Session 2a
In this example, LUN 1 and LUN 2 have been changed to Source LUNs by the creation of one or more
Snapshots on each. Three Sessions are started on those Source LUNs. Once a Session starts, the
SnapView mechanism tracks changes to the LUN and Reserved LUN Pool space is required. In this
example, the following occurs:
y Session 1a is started on Snapshot 1a
y Private LUN 5 in the Reserved LUN Pool is immediately allocated to Source LUN 1, and changes
made to that Source LUN are placed in Private LUN 5
y A second Session, Session 1b, is started on Snapshot 1b, and changes to the Source LUN are still
saved in Private LUN 5
y When PL 5 fills up, SnapView allocates the next available LUN, Private LUN 6, to Source LUN 1,
and the process continues
y Sessions 1a and 1b are now storing information in PL 6
y A Session is then started on Source LUN 2, and Private LUN 7 a new LUN, since Source LUNs
cannot share a Private LUN - is allocated to it
y Once that LUN fills, Private LUN 8 will be allocated
y If all private LUNs have been allocated, and Session 1b causes Private LUN 6 to become full, then
Session 1b is terminated by SnapView without warning. SnapView does notify the user in the SP
Event Log, and, if Event Monitor is active in other ways, the Reserved LUN Pool is filling up. This
notification allows ample time to correct the condition. Notification takes place when the Reserved
LUN Pool is 50% full, then again at 75%, and every 5% thereafter.
- 57
SnapView Terms
y Snapshot
The virtual LUN seen by a secondary host
Made up of data on the Source LUN and data in the RLP
Visible to the host (online) if associated with a Session
y Session
The mechanism that tracks the changes
Maintains the pointers and the map
Represents the point in time
y Roll back
Copy data from a (typically earlier) Session to the Source LUN
2007 EMC Corporation. All rights reserved.
Lets use an analogy to make the distinction easier to understand. Well compare this technology to
CD technology.
You can own a CD player, but have no CDs. Similarly, You can own CDs, but not have a player. CDs
are only useful if you can listen to them; also, you can only listen to one at a time on a player, no
matter how many CDs owned.
In the same way, a Session (the CD) is a Point-in-Time copy of data on a LUN. The exact time is
determined by the time at which the session starts.
The Snapshot (the CD player in our analogy) allows us to view the Session data (listen to the CD). The
sequence of slides that follows demonstrates the COFW process and the rollback process.
- 58
Chunk
1
Chunk
2
Chunk
3
Chunk
4
Source LUN
Primary Host
Secondary
Host
Snapshot
Map
SnapView
Map
Reserved LUN
2007 EMC Corporation. All rights reserved.
y
y
SP memory
Business Continuity Local Replication - 59
At the start of the animation the SnapView Map, Reserved LUN, and the Map in SP Memory
should all be empty. Solid arrows point from Snapshot chunks to the Source LUN chunks. Dotted
arrows to Source LUN and to Snapshot go from the Map area in SP Memory. Source LUN
chunks are labeled Chunk 0, Chunk 1, Chunk 2, Chunk 3, and Chunk 4.
Primary Host issues a write to Chunk 3 on the Source LUN. This is indicated by a dotted arrow to
Chunk 3 on the Source LUN from the Primary Host. A block travels out of the Primary Host. The
block waits between the Primary Host and the Source LUN. Chunk 3 is copied to the first Chunk
on the Reserved LUN, and this is now labeled Chunk 3. SnapView Map and SP Memory Map are
updated. The solid arrow from Snapshot to the Source LUN Chunk 3 disappears and new Solid
arrow from Snapshot to the Reserved LUN Chunk 3 appears. The dotted arrow to Chunk 3 on
Source LUN disappears. A dotted arrow to Reserved LUN chunk 3 appears.
Next the block travels to Chunk 3 on the Source LUN. 3 changes to 3.
Another block travels from the Primary Host to Chunk 3 on the Source LUN. This is placed there
and 3 changes to 3.
- 59
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Secondary
Host
SnapView
Map
Chunk
4
Snapshot
Map
Chunk
3
Reserved LUN
SP memory
Business Continuity Local Replication - 60
y Next Primary Host issues a write to Chunk 0 on the Source LUN. This is indicated by a dotted
arrow to Chunk 0 on the Source LUN from the Primary Host. A block travels out of the Primary
Host. The block waits between the Primary Host and the Source LUN. Chunk 0 is copied to the
second Chunk on the Reserved LUN, and this is now labeled Chunk 0. SnapView Map and SP
Memory Map are updated. The solid arrow from Snapshot to the Source LUN Chunk 0 disappears
and new Solid arrow from Snapshot to the Reserved LUN Chunk 0 appears. The dotted arrow to
Chunk 0 on Source LUN disappears. A dotted arrow to Reserved LUN chunk 0 appears.
y Next the block travels to Chunk 0 on the Source LUN. 0 changes to 0.
y Next the Secondary Host issues a read to Chunk 4 on the Snapshot. This is indicated by a dotted
arrow from Chunk 4 on the Snapshot to the Secondary Host.
y A block travels from Chunk 4 of the Source LUN to Chunk 4 of the Snapshot. Then the block
travels on to the Secondary Host.
y Next the Secondary Host issues a read to Chunk 0 on the Snapshot. This is indicated by a dotted
arrow from Chunk 0 on the Snapshot to the Secondary Host.
y A block travels from Chunk 0 of the Reserved LUN to Chunk 0 of the Snapshot. Then the block
travels on to the Secondary Host.
- 60
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Secondary
Host
SnapView
Map
Chunk
4
Snapshot
Map
Chunk
3
Reserved LUN
SP memory
Business Continuity Local Replication - 61
- 61
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Secondary
Host
SnapView
Map
Chunk
4
Snapshot
Chunk
3
Chunk
0
Reserved LUN
Map
SP memory
Business Continuity Local Replication - 62
- 62
Writes to Snapshot
Chunk
0
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Secondary
Host
SnapView
Map
Snapshot
Chunk
3
y
y
y
y
y
y
y
y
Map
Chunk
0
Reserved LUN
Chunk
4
SP memory
Business Continuity Local Replication - 63
Secondary host issues a write to Chunk 0 of the Snapshot. This is indicated by a dotted arrow
from Secondary host to Chunk 0 of the Snapshot.
A block starts from Secondary host and waits between the Secondary host and the Snapshot.
Chunk 0 on the Reserved LUN is copied over to the next Chunk in the Reserved LUN.
Block travels to Chunk 0 of Snapshot and then to the original Chunk 0 on the Reserved LUN. 0
changes to 0* in the Reserved LUN.
Next the Secondary host issues a write to Chunk 2 of the Snapshot. This is indicated by a dotted
arrow from Secondary host to Chunk 2 of the Snapshot.
A block travels from Secondary host and waits between the Secondary host and the Snapshot.
Chunk 2 is copied from Source LUN to the next available Chunk in the Reserved LUN. The solid
arrow from Chunk 2 of Snapshot to Chunk 2 of Source LUN disappears. Solid arrow from Chunk
2 of Snapshot to Chunk 2 in the Reserved LUN appears. Dotted arrow to Chunk 2 on the Source
LUN disappears and a dotted arrow to Chunk 2 on the Reserved LUN appears. SnapView Map
and the Map in SP memory are updated
Chunk 2 on the Reserved LUN is copied to the next Chunk on the Reserved LUN. A dotted
arrow appears.
The block travels to Chunk 2 on the Snapshot and then on to the original location of Chunk 2 on
the Reserved LUN. 2 on the Reserved LUN is changed to 2*.
- 63
Writes to Snapshot
Chunk
0
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
4
Chunk
3
Chunk
0*
Chunk
0
Reserved LUN
2007 EMC Corporation. All rights reserved.
Map
SP memory
Business Continuity Local Replication - 64
- 64
Writes to Snapshot
Chunk
0
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
4
Chunk
3
Chunk
0*
Chunk
0
Chunk
2*
Reserved LUN
2007 EMC Corporation. All rights reserved.
Map
Chunk
2
SP memory
Business Continuity Local Replication - 65
- 65
Chunk
1
Chunk
2
Chunk
3
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
4
Chunk
3
Chunk
0*
Chunk
0
Chunk
2*
Reserved LUN
2007 EMC Corporation. All rights reserved.
Map
Chunk
2
SP memory
Business Continuity Local Replication - 66
SnapView rollback allows a Source LUN to be returned to its state at a previously defined point in
time. When performing the rollback, you can choose to preserve or discard any changes made by the
secondary host. In this first example, changes are preserved. Meaning that the state of the Source LUN
at the end of the rollback process is identical to the Snapshot, as it appears now.
All chunks that are in the Reserved LUN Pool are copied over the corresponding chunks on the Source
LUN. Before this process starts, it is necessary to take the Source LUN offline (we are changing the
data structure without the knowledge of the host operating system, and it needs to refresh its view of
that structure). If this step is not performed, data corruption could occur on the Source LUN.
Note: No changes are made to the Snapshot or to the Reserved LUN Pool when this process takes
place.
- 66
Chunk
1
Chunk
2*
Chunk
3
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
4
Chunk
3
Chunk
0*
Chunk
0
Chunk
2*
Reserved LUN
2007 EMC Corporation. All rights reserved.
Map
Chunk
2
SP memory
Business Continuity Local Replication - 67
- 67
Chunk
1
Chunk
2*
Chunk
3
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
4
Chunk
3
Chunk
0*
Chunk
0
Chunk
2*
Reserved LUN
2007 EMC Corporation. All rights reserved.
Map
Chunk
2
SP memory
Business Continuity Local Replication - 68
In this example, all changes that have been made to the Snapshot by the secondary host are discarded,
and return the Source LUN to the state it was in when the session was started (the original PIT view).
To do this, the Snapshot needs to be deactivated. Deactivating the Snapshot discards all changes made
by the secondary host, and frees up areas of the Reserved LUN Pool which were holding those
changes. It also makes the Snapshot unavailable to the secondary host.
Once the deactivation has completed, the rollback process can be started. At this point, the Source
LUN needs to be taken offline. The Source LUN is then returned to its original state at the time the
session was started.
- 68
Chunk
1
Chunk
2
Chunk
3
Chunk
4
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
3
Chunk
0
Reserved LUN
Map
Chunk
2
SP memory
Business Continuity Local Replication - 69
- 69
Chunk
1
Chunk
2
Chunk
3
Chunk
4
Source LUN
Primary Host
Snapshot
Secondary
Host
SnapView
Map
Chunk
3
Chunk
0
Reserved LUN
Map
Chunk
2
SP memory
Business Continuity Local Replication - 70
- 70
Summary
Key points covered in this topic:
y EMCs Local Replication Solutions for the Symmetrix and
CLARiiON Arrays
y EMCs TimeFinder/Mirror Replication Solution
y EMCs SnapView - Snapshot Replication Solution
In this topic, we listed EMCs Local Replication Solutions for the Symmetrix and CLARiiON arrays;
Described EMCs TimeFinder/Mirror Replication Solution; andDescribed EMCs SnapView Snapshot Replication Solution
This concludes the module.
- 71
- 72
Local Replication
Case Study 1
Business Profile:
A Manufacturing Corporation maintains the storage of their mission critical applications
on high end Storage Arrays on RAID 1 volumes.
Current Situation/Issue:
A Full backup of their Key Application is run once a week. The Database application
takes up 1 TB of storage and has to be shutdown during the course of the full backup.
The Shutdown of the database is a requirement that cannot be changed.
The main concerns facing the corporation are:
1) The Backup window is too long and is negatively impacting the business (2 hours)
2) A disaster recovery test with the full backup tapes took an extremely long time (many
hours).
The company would like to:
1) Reduce the backup window during which the database application is shutdown to as
small a time window as possible (less than hour)
2) Ensure that the RTO from the full backup is reduced to under an hour
The companys IT group is very interested in leveraging some of the Local Replication
technologies that are available on their high end array.
Proposal:
Propose a local replication solution to address the companys concern. Justify how your
solution will ensure that the Companys needs are met.
Local Replication
Case Study 2
Business Profile:
A Manufacturing Corporation maintains the storage of their mission critical applications
on high end Storage Arrays on RAID 1 volumes.
Current Situation/Issue:
The Companys key database application takes up 1 TB of storage and has to be up 24x7.
The main concerns facing the corporation are:
1) Logical Corruption of the database (e.g. accidental deletion of table or table space)
2) Guaranteed restore operations with a minimum RPO of 1 hour and with an RTO of
less than hour
3) On occasion may have to restore to a point in time that is up to 8 hours old
Additional information:
Company would like to minimize the amount of storage used by the solution that will
address their concerns. On average 240 GB of data changes in a 24 hour period.
Customer is not concerned about physical failure of the database devices other
solutions in place already address this issue.
The companys IT group is very interested in leveraging some of the Local Replication
technologies that are available on their high end array.
Proposal:
Propose a local replication solution to address the companys concern. Justify how your
solution will ensure that the Companys needs are met. How much physical storage will
this replication actually need?
Remote Replication
After completing this module, you will be able to:
y Explain Remote Replication Concepts
Synchronous/Asynchronous
Connectivity Options
This module introduces the challenges and solutions for remote replication and describes possible
implementations.
The objectives for the module are shown here. Please take a moment to read them.
-1
y Synchronous Replication
Replica is identical to source at all times Zero RPO
y Asynchronous Replication
Replica is behind the source by a finite margin Small RPO
y Connectivity
Network infrastructure over which data is transported from source
site to remote site
The Replication concepts/considerations that were discussed for Local Replication apply to Remote
Replication as well. We explore the concepts that are unique to Remote replication.
Synchronous and Asynchronous replication concepts and considerations are explained in more detail in
the next few slides.
Data has to be transferred from the source site to a remote site over some network. This can be done
over IP networks, over the SAN, using DWDM (Dense Wave Division Multiplexing) or SONET
(Synchronous Optical Network), etc. We will discuss the various options later in the module.
The Fundamental difference between local and remote replication is that remote replicas can be at a
geographically different location. For example, applications at a data center in Boston could be
replicated to a data center in London. Though remote replicas can be used for various Business
Continuity operations, just like local replicas, the primary driver of remote replication is disaster
recovery. Because data has to be replicated over a distance, a network infrastructure is a necessity for
remote replication.
-2
Synchronous Replication
y A write has to be secured on the
remote replica and the source before
it is acknowledged to the host
Disk
4
Server
Data Write
Data Acknowledgement
Disk
Synchronous Data is committed at both the source site and the remote site before the write is
acknowledged to the host. Any write to the source must be transmitted to and acknowledged by the
remote before signaling a write complete to the host. Additional writes cannot occur until each
preceding write has been completed and acknowledged. It ensures that data at both sites are identical at
all times.
-3
Synchronous Replication
y Response Time Extension
Application response time will be
extended due to synchronous
replication
Max
y Bandwidth
Writes
MB/s
Average
Time
Applications response times are extended with any kind of Synchronous replication. This is due to the
fact that any write to source must be transmitted to and acknowledged by remote before signaling write
complete to the host. The response time depends on the distance between sites, available bandwidth,
and the network connectivity infrastructure.
The longer the distance, the more the response time. Speed of light is finite, every 200 Km (125 miles)
adds 1ms to the response time.
Insufficient bandwidth also causes response time elongation. With Synchronous replication, there is
sufficient bandwidth all the time. The picture on the slide shows the amount of data that has to be
replicated as a function of time. To minimize the response time elongation, ensure that the Max
bandwidth is provided by the network at all times. If we assume that only the average bandwidth is
provided for, then there are times during the day (the shaded section) when response times may be
unduly elongated, causing applications to time out.
The distances over which Synchronous replication can be deployed really depends on an applications
ability to tolerate the extension in response time. It is rarely deployed for distances greater than 200
Km (125 miles).
-4
Asynchronous Replication
y Write is acknowledged to host
as soon as it is received by the
source
Disk
y Finite RPO
Replica will be behind the Source by
a finite amount
Typically configurable
2007 EMC Corporation. All rights reserved.
1
2
Server
Data Write
Data Acknowledgement
Disk
Asynchronous - Data is committed at the source site and the acknowledgement is sent to the host. The
data is buffered and then forwarded to the remote site as the network capabilities permit. The data at
the remote site is behind the source by a finite RPO; typically the RPO would be a configurable value.
The primary benefit of Asynchronous replication is that there is no response time elongation.
Asynchronous replications are typically deployed over extended distances. The response time benefit
is offset by the finite RPO.
-5
Asynchronous Replication
y Response Time unaffected
y Bandwidth
Need sufficient bandwidth on average
y Buffers
Need sufficient buffers
Max
Writes
MB/s
Time
Extended distances can be achieved with Asynchronous replication because there is no impact on the
application response time. Data is buffered and then sent to the remote site. The available bandwidth
should be at least equal to the average write workload. Data is buffered during times when the
bandwidth is not enough, thus sufficient buffers should be designed into the solution.
Understanding the workload of the application and the bandwidth required for the replication is as
important for Asynchronous replication as Synchronous. While it is true that Asynchronous replication
requires less bandwidth than Synchronous, one still has to provide bandwidth which is equal to the
average write workload. Data will be buffered when the bandwidth is not enough. This buffering of
data causes the RPO to become larger. Insufficient bandwidth will lead to large RPOs which may not
be acceptable.
-6
Log Shipping
In the context of our discussion, Remote Replication refers to replication that is done between data
centers if it is host based, and between Storage arrays if it is array based. In the latter case, the two
arrays may be adjacent to each other in the same data center, or might be geographically separated.
Host based implies that all the replication is done by using the CPU resources of the host, using
software that is running on the host. Array based implies that all replication is done between Storage
Arrays and is handled by the Array Operating Environment.
-7
Log
Log
Physical
Volume 1
Physical
Volume 2
Physical
Volume 3
Volume Group
Local Site
Physical
Volume 1
Network
Physical
Volume 2
Physical
Volume 3
Volume Group
Remote Site
Some LVM vendors provide remote replication at the Volume Group level.
Duplicate Volume Groups need to exist at both the local and remote sites before replication starts. This
can be achieved in a number of ways
y Over IP
y Tape backup/restore etc.
All writes to the source Volume Group are replicated to the remote Volume Group by the LVM.
Typically the writes are queued in a log file and sent to the remote site in the order received over a
standard IP network. It can be done synchronously or asynchronously.
y Synchronous Write must be received by remote before the write is acknowledged locally to the
host
y Asynchronous Write is acknowledged immediately to the local host and queued and sent in order
-8
Production work can continue at the source site if there is a network failure. The writes that need to be
replicated are queued in the log file and sent over to the remote site when the network issue is
resolved. If the log files fill up before the network outage is resolved, a complete resynchronization of
the remote site would have to be performed. Thus, the size of the log file determines the length of
network outage that can be tolerated.
In the event of a failure at the source site (e.g. server crash, site wide disaster), production operations
can be resumed at the remote site with the remote replica. The exact steps that need to be performed to
achieve this depends on the LVM that is in use.
-9
y Disadvantages
Extended network outages require large log files
CPU overhead on host
For maintaining and shipping log files
A significant advantage of using LVM based remote replication is the fact that storage arrays from
different vendors can be used at the two sites. For example, at the production site, a high-end array
could be used while at the remote site, a second tier array could be used. In a similar manner, the
RAID protection at the two sites could be different as well.
Most of the LVM based remote replication technologies allow the use of standard IP networks that are
already in place, eliminating the need for a dedicated network. Asynchronous mode supported by
many LVMs eliminates the response time issue of synchronous mode while extending the RPO.
Log files need to be configured appropriately to support extended network outages. Host based
replication technologies use host CPU cycles.
- 10
IP Network
Original
Logs
Log Shipping is a host based replication technology for databases offered by most DB Vendors
y Initial State - All the relevant storage components that make up the database are replicated to a
standby server (done over IP or other means) while the database is shutdown
y Database is started on the production server, as and when log switches occur the log file that was
closed is sent over IP to the standby server
y Database is started in standby mode on the standby server; when log files arrive, they are applied to
the standby database
y Standby database is consistent up to the last log file that was applied
Advantages
y Minimal CPU overhead on production server
y Low bandwidth (IP) requirement
y Standby Database consistent to last applied log
RPO can be reduced by controlling log switching
Disadvantages
y Need host based mechanism on production server to periodically ship logs
y Need host based mechanism on standby server to periodically apply logs and check for consistency
y IP network outage could lead to standby database falling further behind
- 11
Remote Array
Network
Production
Server
Source
Distance
Replica
DR Server
Replication Process
y A Write is initiated by an application/server
y Received by the source array
y Source array transmits the write to the remote array via dedicated channels (ESCON, Fibre
Channel or Gigabit Ethernet) over a dedicated or shared network infrastructure
y Write received by the remote array
Only Writes are forwarded to the remote array
y Reads are from the source devices
- 12
Network links
Source
Target
Synchronous Replication ensures that the replica and source have identical data at all times. The
source array issues the write complete to the host/server only when the write has been received both at
the remote array and the source array. When the write complete is sent, the replica and source are
identical.
The sequence of operations is:
y Write is received by the source array from host/server
y Write is transmitted by source array to the remote array
y Remote array sends acknowledgement to the source array
y Source array signals write complete to host/server
- 13
Network links
Source
Target
Applications do not suffer any response time elongation with Asynchronous replication because any
write is acknowledged to the host as soon as the write is received by the source array. Asynchronous
replication can be used for extended distances. Bandwidth requirements for Asynchronous will be
lower than Synchronous for the same workload. Vendors ensure data consistency in different ways.
The sequence of operations is shown here:
A Write is received by the source array from the host;
The Source array signals write complete to the host;
The Write is transmitted by source array to the remote array; and then
The Remote array sends acknowledgement to the source array.
- 14
The data on the remote replicas will be behind the source by a finite amount in Asynchronous
replication, thus steps must be taken to ensure consistency. Some vendors achieve consistency by
maintaining write ordering, wherein the remote array applies writes to the replica devices in the exact
order that they were received at the source. Other vendors leverage the dependent write I/O logic that
is built into most databases and applications.
Cache buffered Asynchronous replication technologies buffer writes in cache for a period of time, and
then close the buffer in a consistent manner and receive new writes in a new buffer. When the buffer is
open, if a particular location is written to more that once (locality of reference), only the final write is
sent to the remote array. Thus, if a particular location is written to 10 times, only the last I/O is sent to
the remote array when the buffer is closed. This method is different from the asynchronous technique
which maintains write ordering. With write ordering 10 I/Os will be sent to the remote array as
compared to the 1 I/O in the cache buffered method. Data consistency is maintained with both
techniques, but the cache buffered technique would require less bandwidth if the workload has a high
locality of reference (same data location written to multiple times).
- 15
Disk buffered consistent PITs is a combination of Local and Remote replications technologies. The
idea is to make a Local PIT replica and then create a Remote replica of the Local PIT. The advantage
of disk buffered PITs is lower bandwidth requirements and the ability to replicate over extended
distances. Disk buffered replication is typically used when the RPO requirements are of the order of
hours or so, thus a lower bandwidth network can be used to transfer data from the Local PIT copy to
the remote site. The data transfer may take a while, but the solution would be designed to meet the
RPO.
Lets take a look at a two disk buffered PIT solutions.
- 16
Source
Local
Replica
REMOTE
Network Links
Local
Replica
Remote
Replica
Disk buffered replication allows for the incremental resynchronization between a Local Replica which
acts as a source for a Remote Replica.
Benefits include:
y Reduction in communication link cost and improved resynchronization time for long-distance
replication implementations
y The ability to use the various replicas to provide disaster recovery testing, point-in-time backups,
decision support operations, third-party software testing, and application upgrade testing or the
testing of new applications.
- 17
BUNKER
Sync
Source
Network
Links
Remote
Replica
Local
Replica
REMOTE
Network
Links
Local
Replica
Remote
Replica
- 18
Tracking changes to facilitate incremental re-synchronization between the source devices and remote
replicas is done via the use of bitmaps in a manner very similar to that discussed in the Local
Replication lecture. Two bitmaps, one for the source and one for the replica, would be created. Some
vendors may keep the information of both bitmaps at both the source and remote sites, while others
may simply keep the source bitmap at the source site and the remote bitmap at the remote site. When a
re-synchronization (source to replica or replica to source) is required, the source and replica bitmaps
are compared and only data that was changed is synchronized.
- 19
While remote replication is in progress the remote devices will typically not be available for use. This
is to ensure that the no changes are made to the remote replicas. The purpose of the remote replica is to
provide a good starting point for any recovery operation.
Prior to any recovery efforts with the remote replicas, it is always a good idea to create a local replica
of the remote devices. The local replica can be used as a fall back if the recovery process somehow
corrupts the remote replicas.
Restarting operations at the remote site and subsequently restoring operation back to the primary site
requires a tremendous amount of upfront planning. The simple statement, Start operations at the
Remote site, would have to be planned well ahead of time to account for various failure scenarios.
- 20
y Asynchronous
The choice of the appropriate array based remote replication depends on specific needs.
What are the RPO requirements? What is the distance between sites? What is the primary reason for
remote replication? etc.
- 21
A dedicated or a shared network must be in place for remote replication. Storage arrays have dedicated
ESCON, Fibre Channel or Gigabit Ehternet adapters, which are used for remote replication. The
network between the two arrays could be ESCON or Fibre Channel for the entire distance. Such
networks would be typically used for shorter distance. For extended distances, an optical or IP network
must be used. Examples of optical networks are DWDM and SONET (discussed later). Protocol
converters may have to be used to connect the ESCON or Fibre Channel adapters from the arrays to
these networks. Gigabit Ethernet adapters can be connected directly to the IP network.
A network is required for remote replication. Because this topic is complex, the next three slides are
meant to give you an overview of the network options that are available.
- 22
Optical
Electrical
Optical
Lambda
Gigabit Ethernet
- 23
OC48
OC48
OC3
SONET
STM-16
STM-1
STM-16
SDH
2007 EMC Corporation. All rights reserved.
- 24
Rated Bandwidth
Link
Bandwidth Mb/s
Escon
200
Fibre Channel
1024 or 2048
Gigabit Ethernet
1024
T1
1.5
T3
45
E1
E3
34
OC1
51.8
OC3/STM1
155.5
OC12/STM4
622.08
OC48/STM16
2488.0
The slide lists the rated bandwidth in Mb/s for standard WAN (T1, T3, E1, E3), SONET (OC1, OC3,
OC12, OC48) and SDH (STM1, STM4, STM16) Links. The rated bandwidth of ESCON, Fibre
Channel, and Gigabit Ethernet is also listed.
- 25
Module Summary
Key points covered in this module:
y Remote Replication Concepts
Synchronous/Asynchronous
Connectivity Options
These are the key points covered in this module. Please take a moment to review them.
- 26
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
- 27
At this point, lets apply what weve learned to some real world examples. Upon completion of this
topic you will be able to:
Enumerate EMCs Remote Replication Solutions for the Symmetrix and CLARiiON arrays;
Describe EMCs SRDF/Synchronous Replication Solution; and
Describe EMCs MirrorView/A Replication Solution
- 28
All remote replication solutions that were discussed in this module are available on EMC Symmetrix
and CLARiiON Arrays.
The SRDF (Symmetrix Remote Data Facility) family of products provides Synchronous,
Asynchronous and Disk Buffered remote replication solutions on the EMC Symmetrix Arrays.
The MirrorView family of products provides Synchronous and Asynchronous remote replication
solutions on the EMC CLARiiON Arrays.
SRDF/Synchronous (SRDF/S): High-performance, host-independent, real-time synchronous remote
replication from one Symmetrix to one or more Symmetrix systems.
MirrorView/Synchronous (MirrorView/S): Host-independent, real-time synchronous remote
replication from one CLARiiON to one or more CLARiiON systems.
SRDF/Asynchronous (SRDF/A): High-performance extended distance asynchronous replication for
Symmetrix arrays using a Delta Set architecture for reduced bandwidth requirements and no host
performance impact. Ideal for Recovery Point Objectives of the order of minutes.
MirrorView/Asynchronous (MirrorView/A): Asynchronous remote replication on CLARiiON arrays.
Designed with low-bandwidth requirements, delivers a cost-effective remote replication solution ideal
for Recovery Point Objectives (RPOs) of 30 minutes or greater.
SRDF/Automated Replication: Rapid business restart over any distance with no data exposure through
advanced single-hop and multi-hop configurations using combinations of TimeFinder/Mirror and
SRDF on Symmetrix Arrays.
- 29
EMC SRDF/Synchronous is an Array based Synchronous Remote Replication technology for EMC
Symmetrix Storage Arrays. SRDF R1 and R2 volumes are devices dedicated for Remote replication.
R2 volumes are on the Target arrays, while R1 volumes are on the Source arrays. Data written to R1
volumes is replicated to R2 volumes.
- 30
SRDF R1 and R2 volumes can have any local RAID protection. SRDF R2 volumes are in a Read Only
state when remote replication is in effect. SRDF R2 volumes are accessed under certain circumstances.
- 31
SRDF/Synchronous
1. Write received by Symmetrix containing Source volume
2. Source Symmetrix sends write data to Target
Source Host
Channel
Director (CD)
Channel
Director (CD)
2
Remote Link
Director (RLD)
Target Host
Disk
Director (DD)
Remote Link
Director (RLD)
Remote Link
Director (RLD)
Channel
Director (CD)
Channel
Director (CD)
Disk
Director (DD)
Disk
Director (DD)
- 32
Before
RW
Source
Volume
RO
RO
Target
Volume
After
Source
Volume
RW
Target
Volume
Failover operations are performed if the SRDF R1 Volumes become unavailable and the decision is
made to start operations on the R2 Devices. Failover could also be performed when DR processes are
being tested or for any maintenance tasks that have to be performed at the source site.
If failing over for a Maintenance operation:
y For a clean, consistent, coherent point in time copy which can be used with minimal recovery on
the target side, some or all of the following steps may have to be taken on the source side:
Stop All Applications (DB or whatever else is running)
Unmount file system.
Deactivate the Volume Group
A failover leads to a RO state on the source side. If a device suddenly becomes RO from a RW
state, the reaction of the host can be unpredictable if the device is in use; therefore, the
suggestion to stop applications, un-mount and deactivate Volume Groups.
- 33
Before
RO
Source
Volume
RW
Target
Volume
RW
After
RO
sync
Source
Volume
Target
Volume
The main purpose of the Failback operation is to allow the resumption of operations at the primary site
on the source devices. Failback is typically invoked after a failover has been performed and production
tasks are being performed on the Target site on the R2 devices. Once operations can be resumed at the
Primary site, the Failback operation can be invoked. Ensure that applications are properly quiesced
and volume groups deactivated before failback is invoked.
When failback is invoked, the Target Volumes become Read Only, the source volumes become Read
Write, and any changes that were made at the Target site while in the failed over state are propagated
back to the source site.
- 34
Before
RW
Source
Volume
RO
Target
Volume
RW
After
Source
Volume
RW
Target
Volume
The SRDF Split operation is used to allow concurrent access to both the Source and Target volumes.
Target volumes are made Read Write and the SRDF replication between the Source and Target is
suspended.
- 35
RW
RO
Target
Volume
Source
Volume
Establish
2007 EMC Corporation. All rights reserved.
RW
RO
Target
Volume
Source
Volume
Restore
Business Continuity Remote Replication - 36
During current operations while in a SRDF Split state, changes could occur on both the Source and
Target volumes. Normal SRDF replication can be resumed by performing an establish or a restore
operations.
With either establish or restore, the status of the Target volume goes to Read Only. Prior to establish or
restore, all access to the target volumes must be stopped.
The Establish operation is used when changes to the Target volume should be discarded while
preserving changes that were made to the Source volumes.
The Restore operation is used when changes to the Source volume should be discarded while
preserving changes that were made to the Target volumes. Prior to a restore operation, all access to the
source and target volumes must be stopped. The Target volumes go to Read Only state, while the data
on the Source volumes are overwritten with the data on the Target volumes.
- 36
- 37
MirrorView/A Terms
y Primary storage system
Holds the local image for a given mirror
y Bidirectional mirroring
A storage system can hold local and remote images
y Mirror Synchronization
Process that copies data from local image to remote image
The terms primary storage system and secondary storage system are terms relative to each mirror.
Because MirrorView/A supports bidirectional mirroring, a storage system which hosts local images for
one or more mirrors may also host remote images for one or more other mirrors.
The process of updating a remote image with data from the local image is called synchronization.
When mirrors are operating normally, they are either in the synchronized state or synchronizing. If a
failure occurs, and the remote image cannot be updated, perhaps because the link between the
CLARiiONs has failed, then the mirror is in a fractured state. Once the error condition is corrected,
synchronization restarts automatically.
- 38
MirrorView/A Configuration
y MirrorView/A Setup
MirrorView/A software must be loaded on both Primary and
Secondary storage system
Remote LUN must be exactly the same size as local LUN
Secondary LUN does not need to be the same RAID type as Primary
Reserved LUN Pool space must be configured
MirrorView/A software must be loaded on both CLARiiONs, regardless of whether or not the
customer wants to implement bi-directional mirroring.
The remote LUN must be the same size as the local LUN, though not necessarily the same RAID type.
This allows flexibility in DR environments, where the backup site need not match the performance of
the primary site.
Because MirrorView/A uses SnapView Snapshots as part of its internal operation, space must be
configured in the Reserved LUN Pool for data chunks copied as part of a COFW operation. SnapView
Snapshots, the Reserved LUN Pool, and COFW activity were discussed in an earlier module.
MirrorView/A, like other CLARiiON software, is managed by using either Navisphere Manager if a
graphical interface is desired, or Navisphere CLI for command-line management.
Hosts can not attach to a remote LUN while it is configured as a secondary (remote) mirror image. If
you promote the remote image to be the primary mirror image (in other words, exchange roles of the
local and remote images), as is done in a disaster recovery scenario, or if you remove the secondary
LUN from the mirror, and thereby turn it into an ordinary CLARiiON LUN, then it may be accessed
by a host.
- 39
Secondary
Image
A B C D E F
Host
Tracking
DeltaMap
0 0 0 0 0 0
Transfer
DeltaMap
1 1 1 1 1 1
RLP
MAP
Snapshot
MAP
MirrorView/A makes use of bitmaps, called DeltaMaps because they track changes, to log where data
has changed, and needs to be copied to the remote image. As with SnapView Snapshots, the
MirrorView image is seen as consisting of 64 kB areas of data, called chunks or extents.
This animated sequence shows the initial synchronization of a MirrorView/A mirror. The Transfer
DeltaMap has all its bits set, to indicate that all extents need to be copied across to the secondary. At
the time the synchronization starts, a SnapView Session is started on the primary, and it will track all
changes in a similar manner to that used by Incremental SAN Copy. At the end of the initial
synchronization, the secondary image is a copy of what the primary looked like when the
synchronization started. Any changes made to the primary since then are flagged by the Tracking
DeltaMap.
- 40
Secondary
Image
A B C D E F
Host
Tracking
DeltaMap
0 0 0 0 0 0
Transfer
DeltaMap
0 1 1 1 1 1
RLP
MAP
Snapshot
MAP
- 41
Secondary
Image
A B C D E F
A B
Host
Tracking
DeltaMap
0 0 0 0 0 0
Transfer
DeltaMap
0 0 1 1 1 1
RLP
MAP
Snapshot
MAP
- 42
Secondary
Image
A B C D E F
A B
Host
Tracking
DeltaMap
0 1 0 0 0 0
Transfer
DeltaMap
0 0 1 1 1 1
RLP
MAP
Snapshot
MAP
- 43
Secondary
Image
A B C D E F
A B C
Host
Tracking
DeltaMap
0 1 0 0 0 0
Transfer
DeltaMap
0 0 0 1 1 1
RLP
MAP
Snapshot
MAP
- 44
Secondary
Image
A B C D E F
A B C
Host
Tracking
DeltaMap
0 1 0 0 1 0
Transfer
DeltaMap
0 0 0 1 1 1
RLP
MAP
Snapshot
MAP
- 45
Secondary
Image
A B C D E F
A B C D
Host
Tracking
DeltaMap
0 1 0 0 1 0
Transfer
DeltaMap
0 0 0 0 1 1
RLP
MAP
Snapshot
MAP
- 46
Secondary
Image
A B C D E F
A B C D
Host
Tracking
DeltaMap
0 1 0 0 1 0
Transfer
DeltaMap
0 0 0 0 1 1
RLP
MAP
Snapshot
MAP
- 47
Secondary
Image
A B C D E F
A B C D E
Host
Tracking
DeltaMap
0 1 0 0 1 0
Transfer
DeltaMap
0 0 0 0 0 1
RLP
MAP
Snapshot
MAP
- 48
Secondary
Image
A B C D E F
A B C D E F
Host
Tracking
DeltaMap
0 1 0 0 1 0
Transfer
DeltaMap
0 0 0 0 0 0
RLP
MAP
Snapshot
MAP
- 49
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Tracking
DeltaMap
0 1 0 0 1 0
Transfer
DeltaMap
0 0 0 0 0 0
RLP
MAP
Snapshot
MAP
An update cycle starts, either automatically at the prescribed time, or initiated by the user. Prior to the
start of data movement to the secondary, MirrorView/A starts a SnapView Session on the secondary, to
protect the original data if anything goes wrong during the update cycle.
After the update cycle completes successfully, the SnapView Session and Snapshot on the secondary
side are no longer needed, and are destroyed.
- 50
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 1 0 0 1 0
Tracking
DeltaMap
0 0 0 0 0 0
RLP
MAP
Snapshot
MAP
- 51
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 1 0 0 1 0
Tracking
DeltaMap
1 0 0 0 0 0
RLP
MAP
Snapshot
MAP
- 52
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 0
RLP
MAP
Snapshot
MAP
- 53
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 0
RLP
MAP
Snapshot
MAP
- 54
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 0 0
Tracking
DeltaMap
1 0 0 0 1 0
RLP
MAP
Snapshot
MAP
B E
- 55
MirrorView/A Update
Primary
Image
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 0 0
Tracking
DeltaMap
1 0 0 0 1 0
RLP
MAP
Snapshot
MAP
- 56
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 0 1
RLP
MAP
Snapshot
MAP
Should the update cycle fail for any reason (here a primary storage system failure) and it becomes
necessary to promote the secondary, then the safety Session is rolled back and the secondary image is
returned to the state it was in prior to the start of the update cycle.
- 57
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 1
RLP
MAP
Snapshot
MAP
- 58
Secondary
Image
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 1
RLP
MAP
Snapshot
MAP
- 59
A B C D E F
Promote
Secondary
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 1
RLP
A B C D E F
MAP
Snapshot
MAP
- 60
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 1
RLP
MAP
Snapshot
MAP
- 61
A B C D E F
A B C D E F
Host
Transfer
DeltaMap
0 0 0 0 1 0
Tracking
DeltaMap
1 0 0 0 1 1
RLP
MAP
Snapshot
MAP
- 62
Consistency Groups
y Group of secondary images treated as a unit
y Local LUNs must all be on the same CLARiiON
y Remote LUNs must all be on the same CLARiiON
y Operations happen on all LUNs at the same time
Ensures a restartable image group
Consistency Groups allow all LUNs belonging to a given application, usually a database, to be treated
as a single entity and managed as a whole. This helps to ensure that the remote images are consistent,
i.e. all made at the same point in time. As a result, the remote images are always restartable copies of
the local images, though they may contain data which is not as new as that on the primary images.
It is a requirement that all the local images of a Consistency Group be on the same CLARiiON, and
that all the remote images for a Consistency Group be on the same remote CLARiiON. All information
related to the Consistency Group is sent to the remote CLARiiON from the local CLARiiON.
The operations which can be performed on a Consistency Group match those which may be performed
on a single mirror, and affect all mirrors in the Consistency Group. If for some reason an operation
cannot be performed on one or more mirrors in the Consistency Group, then that operation fails and the
images remain unchanged.
- 63
In this topic, we enumerated EMCs Remote Replication solutions for the Symmetrix and CLARiiON
arrays; described EMCs SRDF/Synchronous Replication and MirrorView/A Replication solutions.
This concludes the module.
- 64
Section Summary
Key Points covered in this section:
y Overview of Business Continuity
y The solutions and the supporting technologies that
enable business continuity and uninterrupted data
availability
Backup and Recovery
Local Replication
Remote Replication
These are the key points covered in this section. Please take a moment to review them.
If you have not already done so, please review the Case Studies prior to taking the assessment.
This concludes the training. Please proceed to the Course Completion slide to take the Assessment.
- 65
Remote Replication
Case Study
Business Profile:
A Manufacturing Corporation maintains the storage of their mission critical applications
on high-end Storage Arrays on RAID 1 volumes. The corporation has two data centers
which are 50 miles apart.
Current Situation/Issue:
The corporations mission critical Database application takes up 1 TB of storage on a
high end Storage Array. In the past year, top management has become extremely
concerned because they do not have DR plans which will allow for zero RPO recovery if
there is a site failure. The primary DR Site is the 2nd Data Center 50 miles away.
The company would like explore remote replication scenarios which will allow for near
zero RPO and a minimal RTO. The company is aware of the large costs associated with
network bandwidth and would like explore other remote replication technologies in
addition to the zero RPO solution.
Proposal:
Propose a remote replication solution to address the companys concern. Justify how your
solution will ensure that the Companys needs are met.