CISv2 Student Guide - Final PDF

Welcome to Cloud Infrastructure and Services version 2.
Copyright © 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 , 2014 EMC Corporation. All Rights
Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change
without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC Snap, EMC SourceOne,
EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor, ApplicationXtender, ArchiveXtender, Atmos, Authentica,
Authentic Problems, Automated Resource Manager, AutoStart, AutoSwap, AVALONidm, Avamar, Captiva, Catalog Solution, C-Clip, Celerra,
Celerra Replicator, Centera, CenterStage, CentraStar, ClaimPack, ClaimsEditor, CLARiiON, ClientPak, Codebook Correlation Technology,
Common Information Model, Configuration Intelligence, Configuresoft, Connectrix, CopyCross, CopyPoint, Dantz, DatabaseXtender, Direct
Matrix Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum, elnput, E-Lab, EmailXaminer, EmailXtender,
Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony, Global File Virtualization, Graphic Visualization,
Greenplum, HighRoad, HomeBase, InfoMover, Infoscape, Infra, InputAccel, InputAccel Express, Invista, Ionix, ISIS, Max Retriever,
MediaStor, MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale, PixTools, Powerlink, PowerPath, PowerSnap, QuickScan,
Rainfinity, RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the RSA logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, Smarts,
SnapImage, SnapSure, SnapView, SRDF, StorageScope, SupportMate, SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix
VMAX, TimeFinder, UltraFlex, UltraPoint, UltraScale, Unisphere, VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture,
Virtual Provisioning, VisualSAN, VisualSRM, Voyence, VPLEX, VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo, and
where information lives, are registered trademarks or trademarks of EMC Corporation in the United States and other countries.
All other trademarks used herein are the property of their respective owners.
© Copyright 2014 EMC Corporation. All rights reserved. Published in the USA.
Revision Date: <Published DATE /MM/YYYY>
Revision Number: Course Number>. <Product Version>.<<Edit Version> (Ex: MR-1CP-SYMCFG.5874.2.0)
Copyright © 2014 EMC Corporation. All rights reserved Module 0: Course Introduction
This course educates participants on building cloud infrastructure based on cloud computing reference
model. The reference model includes five fundamental layers (physical, virtual, control, orchestration,
and service) and three cross-layer functions (business continuity, security, and service management)
for building a cloud infrastructure. For each layer and function, this course covers the comprising
technologies, components, processes, and mechanisms. This course takes an open-approach to
describe concepts and technologies. However, EMC-related product examples are included to
reinforce the concepts and technologies learnt in the course.
The course follows the U.S. National Institute of Standards and Technology as a guide for all
definitions of cloud computing. Upon completing this course, participants will have the knowledge to
make informed decisions on technologies, processes, and mechanisms required to build cloud
infrastructure.
Module 1 provides an introduction to cloud computing. It gives the definition of cloud
computing, describes its essential characteristics, and covers its key benefits. Further, the
module describes the primary cloud service models and cloud deployment models.
Copyright © 2014 EMC Corporation. All rights reserved [Course Title] 4

This lesson provides the definition of cloud computing and describes its essential
characteristics. This lesson also describes the key benefits that cloud computing provides.

Cloud computing is a very popular subject of discussion and a lot of keen interest is shown
in it by both individuals and organizations. As cloud adoption is rapidly becoming a strategic
business decision in many organizations, cloud computing is not a catchphrase that it once
was. Organizations are looking at the cloud as being essential to their businesses and
operations. As cloud computing evolves and spreads globally, many organizations, including
enterprises, government departments, research organizations, financial institutions, and
universities are either adopting cloud computing or are earnestly planning their move to
cloud computing. In the surveys conducted by groups, such as Gartner, International Data
Group (IDG), and North Bridge, a majority of the organizations surveyed responded that
they are either identifying or have identified the IT operations that are candidates for cloud
computing. The organizations also responded that they either have a dedicated budget or
assign a significant percentage of their IT budget for cloud computing.
Estimates and forecasts reveal that cloud adoption will rise considerably in the coming years
as. Cloud computing is foreseen as one of the major “disruptive” technologies of the coming
decade, in the sense that it will significantly transform businesses, economies, and lives
globally. Also, emergence of technology trends, such as mobility, Big Data analytics, social
media, and the growth of the BYOD (bring your own device) practice is driving organizations
to optimize and innovate business models through investment in cloud computing.
According to Gartner, “the adoption of the cloud is rising rapidly and there is no sign that it
is going back.”
Copyright © 2014 EMC Corporation. All rights reserved Module #: Module Name 6
The U.S. National Institute of Standards and Technology (NIST) in its Special Publication (SP)
800-145 defines cloud computing as “a model for enabling convenient, on-demand network
access to a shared pool of configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly provisioned and released with
minimal management effort or service provider interaction.”
A cloud is a collection of network-accessible computing resources. To a user, a cloud
represents an abstraction of a computing infrastructure consisting of hardware and software
resources that the user can access over a network. A cloud infrastructure is built, operated,
and managed by a cloud service provider. Cloud computing is a model that enables users to
hire and provision computing resources as a service from the cloud infrastructure of a cloud
service provider. A cloud service is any combination of computing resources, such as
hardware resources, platform software, and application software that are offered by a cloud
provider. The provider maintains shared pools of computing resources such as compute,
storage, and software. The resources are made available to consumers as services over a
network, such as the Internet. Consumers can provision the resources from the pools as and
when required. The consumers can themselves carry out the provisioning process without
the need to interact with the provider during the process. The resources are returned to the
pool when they are released.
The computing resources that make up the cloud infrastructure are deployed in data
centers. A data center is a facility that houses centralized IT systems and components
including compute systems, storage systems, network equipment, and platform and
application software. A data center also has supporting infrastructure such as power supply
and heating, ventilation, and air conditioning (HVAC) systems. (Contd.)
The operations staff of a data center monitors operations and maintains IT and
infrastructural equipment around the clock. A cloud data center may reside at a single
physical location or comprise of multiple data centers distributed across geographical
locations and connected to each other over a network.
The cloud model is similar to using a utility service, such as electricity, wherein a consumer
simply plugs in an electrical appliance to a socket and turns it on. The consumer is typically
unaware of how the electricity is generated or distributed and only pays for the amount of
electricity used. Similarly, in cloud computing, consumers pay only for the services that they
use, without the risks and costs associated with owning the computing resources.
Consumers can pay for cloud services based on a subscription or based on resource usage.
Many organizations now see the cloud as being an extension of their IT resources
procurement strategy. It may well become the predominant way in which organizations
acquire and use computing technology in the future. Through cloud computing, even
smaller companies can obtain required computing resources and compete in ways that were
previously expensive and often cost-prohibitive.
The figure on the slide depicts a generic cloud computing environment. The term “cloud”
originates from the cloud-like bubble that is generally used to represent a system such as a
network or a compute cluster in technical architecture diagrams. However, this is not the
case in cloud computing. A computing infrastructure can be classified as a cloud only if it has
some specific essential characteristics. These characteristics are discussed next.
A cloud infrastructure has some specific characteristics. In SP 800-145, NIST specifies that
cloud infrastructure should have the five essential characteristics listed below –
• On-demand self-service
• Broad network access
• Resource pooling
• Rapid elasticity
• Measured service
Note: This course uses the following terminology –
• “Cloud service provider” or “cloud provider” or “service provider” or “provider” is an
organization that provides cloud services. The provider may be an external provider or
internal to the organization, for example, the IT department.
• “Cloud consumer” or “consumer” is a person or an organization that is a customer of a
cloud. Also, a cloud itself may be a customer of another cloud.
• “Compute” or “server” or “host” is a physical compute system that executes various
platform and application software.
• “Cloud infrastructure” or “cloud” is the collection of hardware and software that are
provided as services to consumers. It also includes hardware and software to manage the
cloud itself. The cloud infrastructure has the five essential characteristics as specified by
NIST.
“A consumer can unilaterally provision computing capabilities, such as server time or
networked storage, as needed, automatically, without requiring human interaction with
each service provider.” – NIST
In cloud computing, the consumers have the ability to provision any computing resource
that they require, on demand from a cloud, i.e. at any time they want. Self-service means
that the consumers themselves carry out all the activities required to provision the cloud
resource.
To enable on-demand self-service, a cloud provider makes available a simple and user-
friendly self-service portal, which is a web site that allows consumers to view and order
cloud services. The cloud provider publishes a service catalog on the self-service portal. The
service catalog lists items, such as service offerings, service prices, service functions,
request processes, and so on. A potential consumer can use the self-service portal via a
browser to view the cloud services listed in the service catalog. The consumer can then
place a request for the required service(s) through the self-service portal. The request gets
processed automatically by the cloud infrastructure, without human intervention from the
cloud provider’s side. On-demand self service enables the consumers to provision cloud
services in a simple and flexible manner. For example, if a consumer requires compute
systems to host applications and databases, the resources can be quickly and easily
provisioned from the cloud. This eliminates several time-consuming resource acquisition
and configuration processes and also the dependency on internal IT. This considerably
reduces the time needed to provision new or additional computing resources. Module 6 of
this course covers in detail the topics of self-service portal and service catalog.
“Capabilities are available over the network and accessed through standard mechanisms
that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones,
tablets, laptops, and workstations).” – NIST
Consumers access cloud services on any end-point device and from any location over a
network, such as the Internet or an organization’s private network. For instance, a cloud
application, such as a web-based document creator and editor, can be accessed and used at
any time over the Internet. Users can access and edit documents from any Internet-
connected device, eliminating the need to install the application or any specialized client
software on the device. In cloud computing, network-accessible capabilities go beyond
applications. Cloud solutions provide access to data, to compute, to storage, to facilities
such as data backup and recovery, and to essentially any data center capability, from any
place and any device. Cloud services are accessed via the network, from a broad range of
end-point devices, such as desktops, laptops, tablets, mobile phones, and thin clients. The
devices may have heterogeneous underlying hardware and software platforms.
Any network communication involves the use of the standard network specifications, the
protocols and the mechanisms that are detailed in the Open Systems Interconnection (OSI)
conceptual model and the TCP/IP protocol suite. Each of the two networking models
specifies a set of abstraction layers, wherein each layer is a set of network-related entities,
functions, and protocols, and provides services to the layer above it. The top-most layer in
each model is the Application Layer, which is the layer that applications interact with to
exchange data with other applications over a network connection. (Contd.)
“The provider’s computing resources are pooled to serve multiple consumers using a multi-
tenant model, with different physical and virtual resources dynamically assigned and
reassigned according to consumer demand. There is a sense of location independence in
that the customer generally has no control or knowledge over the exact location of the
provided resources but may be able to specify location at a higher level of abstraction (e.g.,
country, state, or datacenter). Examples of resources include storage, processing, memory,
and network bandwidth.” – NIST
In cloud computing, resources such as storage, processor, memory, and network bandwidth
are pooled to serve multiple consumers. Resource pooling enables computing resources to
be dynamically assigned, released, and reassigned according to consumer demand. This, in
turn, enables cloud providers to achieve high levels of resource utilization and to flexibly
provision or reclaim resources. Consumers can provision resources from the pool, as
required and can release a resource when it is no longer required. Upon release, the
resource is returned to the pool and made available for reallocation. For example, the
storage capacities of multiple storage systems can be combined to obtain a single large
storage pool from which storage can be provisioned to multiple consumers. The same can
be done with compute system processors and with network bandwidth. This is known as a
multi-tenant model.
Multi-tenancy refers to an architecture in which multiple independent consumers (or
tenants) are serviced using a single set of resources. A tenant could be an individual user, a
user group, or an organization. The multi-tenant model enables a provider to offer services
at a lower cost through economy of scale. This is similar to tenants sharing a physical
building, such as a hotel. Just as the building may be occupied by multiple residents or
tenants, each with their own private space, a multi-tenant cloud infrastructure contains
pools of different resource types that serve multiple independent consumers (or tenants).
(Contd.)
“Capabilities can be rapidly and elastically provisioned, in some cases automatically, to scale
rapidly outward and inward commensurate with demand. To the consumer, the capabilities
available for provisioning often appear to be unlimited and can be appropriated in any
quantity at any time.” – NIST
Rapid elasticity refers to the ability of a cloud infrastructure to adapt to the variations in
workload by quickly and dynamically expanding (scaling outward) or reducing (scaling
inward) computing resources, and to proportionately maintain the required performance
level. For example, an organization might require double the processing capacity for a
specific duration to enable the deployed application to handle increased workload. For the
remaining period, the organization might want to release the idle computing resources to
save costs. The workload variations may be seasonal, exponential, transient, and so on.
Consumers can leverage the rapid elasticity characteristic of a cloud infrastructure when
they have such variations in workloads and IT resource requirements. It is also important for
consumers to be able to provision the resources dynamically because a cloud infrastructure
enables consumers to provision resources at any time that they want. Dynamic resource
provisioning can be manual or automated. It requires monitoring of resource usage, and
provisioning additional resources, as and when required, to meet the demand. In cloud
systems, elastic provisioning is typically done through automation, since carrying out the
tasks manually can be a time-consuming, cumbersome, and error-prone. The characteristic
of rapid elasticity gives consumers a sense of availability of unlimited computing resources
that can be provisioned at any time. (Contd.)
“Cloud systems automatically control and optimize resource use by leveraging a metering
capability at some level of abstraction appropriate to the type of service (e.g., storage,
processing, bandwidth, and active user accounts). Resource usage can be monitored,
controlled, and reported, providing transparency for both the provider and consumer of the
utilized service.” – NIST
A cloud infrastructure has a metering capability that continuously monitors resource usage
(for example: processor time, network bandwidth, and storage capacity) and provides
reports on resource utilization and information about the current demand on the cloud.
Metering helps cloud providers with capacity and service planning and enables to
automatically control and optimize resource use for delivering cloud services according to
agreed service levels. The monitoring of resource usage supports the cloud characteristic of
rapid elasticity as it helps in identifying when additional resources need to be dynamically
provisioned (or released) to meet workloads. Metering also provides consumers with a
better sense of resource consumption and provides transparency in billing and in verifying
that service levels were met.
The key benefits of cloud computing are as follows –
Business agility: In a traditional environment, the process of acquiring new or additional
computing resources might comprise rigid procedures and approvals. This may result in the
resource acquisition process taking up a considerable amount of time, which in turn can
delay operations and increase the time-to-market. Cloud computing provides the capability
to provision computing resources quickly and at any time, thereby considerably reducing the
time required to deploy new applications and services. This enables businesses to reduce
the time-to-market and to respond more quickly to market changes. Agility also enables
rapid development and experimentation, which in turn facilitates innovation, which is
essential in research and development, the discovery of new markets and revenue
opportunities, creating new customer segments, and the development of new products.
Reduced IT costs: In a traditional environment, resources are often acquired and dedicated
to specific business applications. Also, to the extent allowed by budget, resources are
provisioned to accommodate the maximum estimated or peak usage requirements of the
application. These practices frequently result in higher up-front costs, the creation of IT
silos, the underutilization of resources, and increased energy consumption. Cloud
computing enables consumers to hire any required computing resources based on pay-per-
use or subscription pricing. This reduces a consumer’s IT capital expenditure (CAPEX) as
investment is required only for the resources needed to access the cloud services. Also, the
consumer hires only those resources from the cloud that are required, thereby eliminating
silos and underutilized resources. Additionally, the expenses associated with IT
infrastructure configuration, management, floor space, power, and cooling are reduced.
Thus, cloud adoption has the potential of lowering the total cost of ownership (TCO) for a
consumer. (Contd.)
Business continuity: It is possible for IT services to be rendered unavailable due to causes
such as natural disasters, human error, unplanned events, and planned maintenance. The
unavailability of IT services can lead to significant financial losses to organizations and may
also affect their reputations. Through the use of cloud business continuity solutions an
organization can mitigate the impact of downtime and recover from outages that adversely
affect business operations. For example, an organization may use cloud-based backup for
maintaining additional copies of their data, which can retrieved in the event of an outage.
Also, an organization can save on the capital expenses required for implementing a backup
solution for their IT infrastructure.
Flexible scaling: Organizations may have the need for additional computing resources at
times when workloads are greater. However, they would not want to incur the capital
expense of purchasing the additional compute systems and then having idle compute
systems on the floor when not required, which could be the case most of the time. They
would also want to release the compute resources after the task is completed. In cloud
computing, consumers can unilaterally and automatically scale computing resources to
meet workload demand. This is significantly more cost-effective than buying new computing
resources that are only used for a short time or only during specific periods.
Flexibility of access: In a traditional environment, computing resources are accessed from
dedicated devices, such as a desktop or a laptop. An application has to be installed on the
device in order to be used. In this environment, it is usually not possible to access the
application if the user is away from the device where it is installed. In cloud computing,
applications and data reside centrally and can be accessed over a network from any device
(desktop, mobile, thin client, and so on) and from any location. This eliminates a consumer’s
dependency on a specific end-point device. This also enables Bring Your Own Device (BYOD),
which is a recent trend in computing, whereby employees are allowed to use non-company
devices as business machines. BYOD and thin clients create an opportunity to reduce
acquisition and operational costs.
Application development and testing: Organizations have to invest in procuring IT resources
to support application development and testing. Typically, the developed applications are
tested on wide range of platforms, due to which organizations need to invest in and
maintain multiple platforms for development and testing. Also, the developed applications
may have to be tested under heavy workload, which might require a large amount of
computing resources for a short period of time. Cloud computing enables organizations to
develop and test their applications at a greater scale. Also, organizations can create
compute systems of different hardware and software configurations to test applications
under different environments. Organization can avoid upfront capital cost and pay only for
the resources that they consume. Organizations can also speed up application delivery,
while meeting the budget and time-to-market requirements.
Simplified Infrastructure Management: In a traditional environment, an organization’s IT
department has to manage a wide range of hardware and software resources. The tasks
involve configuration, applying the latest patches and updates, and carrying out upgrades
and replacements. Furthermore, workloads and manpower requirements increase with the
size of the IT infrastructure. When an organization uses cloud services, their infrastructure
management tasks are reduced to managing only those resources that are required to
access the cloud services. The cloud infrastructure is managed by the cloud service provider
and tasks such as software updates and renewals are handled by the cloud provider. The
provider ensures that the cloud infrastructure remains modern and up-to-date with
consumer requirements. (Contd.)
This lesson covers the three primary cloud service models – Infrastructure as a Service,
Platform as a Service, and Software as a Service.

A cloud service model specifies the cloud services and the capabilities that are provided to
consumers. In Special Publication 800-145, the U.S. National Institute of Standards and
Technology (NIST) classifies cloud service offerings into the three primary models listed
below –
• Infrastructure as a Service (IaaS)
• Platform as a Service (PaaS)
• Software as a Service (SaaS)
The different service models provide different capabilities and are suitable for different
customers and business objectives. The drivers and considerations for each cloud service
model are covered in Module 2 of this course.
Note: Many alternate cloud services models based on IaaS, PaaS, and SaaS have been
defined in various publications and by different industry groups, to indicate certain
specialized cloud services and capabilities that are provided by them. Some such cloud
service models are Network as a Service (NaaS), Case as a Service (CaaS), Desktop as a
Service (DaaS), Business Process as a Service (BPaaS), Test Environment as a service (TEaaS),
Mobile Backend as a service (MBaaS), and so on. However, these models eventually belong
to one of the three primary cloud service models.
“The capability provided to the consumer is to provision processing, storage, networks, and
other fundamental computing resources where the consumer is able to deploy and run
arbitrary software, which can include operating systems and applications. The consumer
does not manage or control the underlying cloud infrastructure but has control over
operating systems, storage, and deployed applications; and possibly limited control of select
networking components (for example, host firewalls).” – NIST
In IaaS, consumers hire computing resources (such as, compute, storage, and network) from
a cloud service provider. The underlying cloud infrastructure is deployed and managed by
the cloud service provider. Consumers can deploy and configure software, such as operating
system (OS), database, and applications on the cloud resources. Typically the users of IaaS
are IT system administrators. IaaS can even be implemented internally by an organization,
with internal IT managing the resources and services. IaaS pricing can be subscription-based
or usage-based. Some resources that are charged on the basis of usage include processor
time, storage space, and network bandwidth (for data exchange). Keeping in line with the
cloud characteristics, the provider pools the underlying resources and they are shared by
multiple consumers through a multi-tenant model.
“The capability provided to the consumer is to deploy onto the cloud infrastructure
consumer-created or acquired applications created using programming languages, libraries,
services, and tools supported by the provider. The consumer does not manage or control
the underlying cloud infrastructure including network, servers, operating systems, or
storage, but has control over the deployed applications and possibly configuration settings
for the application-hosting environment.” – NIST
In PaaS, a cloud service typically includes compute, storage, and network resources along
with an OS, a database, a software development framework, middleware, and tools to
develop, test, deploy, and manage applications. PaaS enables application developers to
design and develop cloud-based applications using the programming languages, the class
libraries, and the tools supported by the provider. PaaS offerings typically enable consumers
to build highly scalable cloud applications that can support a large number of end users. The
elasticity and scalability is facilitated transparently by the cloud infrastructure. Moreover,
PaaS helps application testers to test the applications in various cloud-based environments.
PaaS also enables application deployers to publish or update the applications on the
underlying cloud infrastructure. Further, PaaS enables application administrators to
configure, monitor, and tune the cloud applications.
Most PaaS offerings are “polyglot” in nature, which means that they support multiple
operating systems, programming languages and frameworks for application development
and deployment. PaaS usage fees are typically calculated based on factors such as the
number of consumers, the types of consumers (developer, tester, and so on), the time for
which the platform is in use, and the storage, processing, or network resources consumed
by the platform. WISA (Windows, Internet Information Services, SQL Server, and ASP.NET)
and LAMP (Linux, Apache, MySQL, and PHP/Python/Perl) are examples of solution stacks
provided through PaaS for developing and deploying cloud applications.
“The capability provided to the consumer is to use the provider’s applications running on a
cloud infrastructure. The applications are accessible from various client devices through
either a thin client interface, such as a web browser (for example, web-based email), or a
program interface. The consumer does not manage or control the underlying cloud
infrastructure including network, servers, operating systems, storage, or even individual
application capabilities, with the possible exception of limited user-specific application
configuration settings.” – NIST
In SaaS, the entire cloud infrastructure and applications are owned and managed by the
cloud provider. In the SaaS model, a provider hosts an application centrally in the cloud and
offers it to multiple consumers for use as a service. In SaaS, a given version of an
application, with a specific configuration (hardware and software) typically provides service
to multiple consumers by partitioning their individual sessions and data. SaaS applications
execute in the cloud and usually do not need installation on end-point devices. This enables
a consumer to access the application on demand from any location and use it through a web
browser on a variety of end-point devices. Some SaaS applications may require a client
interface to be locally installed on an end-point device. Customer Relationship Management
(CRM), email, Enterprise Resource Planning (ERP), and office suites are examples of
applications delivered through SaaS.
This lesson covers the four primary cloud deployment models – public cloud, private cloud,
community cloud, and hybrid cloud.

A cloud deployment model provides a basis for how cloud infrastructure is built, managed,
and accessed. In Special Publication 800-145, NIST specifies the four primary cloud
deployment models listed below –
• Public cloud
• Private cloud
• Hybrid cloud
• Community cloud
The cloud deployment models may be used for any of the cloud service models – IaaS, PaaS,
and SaaS.
The different deployment models present a number of tradeoffs in term of the control, the
scale, the cost, and the availability of resources. The drivers and the considerations for each
cloud deployment model are covered in Module 2 of this course.
“The cloud infrastructure is provisioned for open use by the general public. It may be
owned, managed, and operated by a business, academic, or government organization, or
some combination of them. It exists on the premises of the cloud provider.” – NIST
A public cloud is a cloud infrastructure deployed by a provider to offer cloud services to the
general public and/or organizations over the Internet. In the public cloud model, there may
be multiple tenants (consumers) who share common cloud resources. A provider typically
has default service levels for all consumers of the public cloud. The provider may migrate a
consumer’s workload at any time and to any location. Some providers may optionally
provide features that enable a consumer to configure their account with specific location
restrictions. Public cloud services may be free, subscription-based or provided on a pay-per-
use model.
Public cloud provides the benefits of low up-front expenditure on IT resources and
enormous scalability. However, some concerns for the consumers include network
dependency, risks associated with multi-tenancy, limited or no visibility and control over the
cloud resources and data, and restrictive default service levels.
The figure on the slide depicts a generic public cloud that is used by enterprises and by
individuals. The figure includes some virtual components for relevance and accuracy. The
virtual components are described later in the course in Module 4.
“The cloud infrastructure is provisioned for exclusive use by a single organization comprising
multiple consumers (for example, business units). It may be owned, managed, and operated
by the organization, a third party, or some combination of them, and it may exist on or off
premises.” – NIST
A private cloud is a cloud infrastructure that is set up for the sole use of a particular
organization. The cloud services implemented on the private cloud are dedicated to
consumers, such as the departments and business units within the organization. Many
organizations may not wish to adopt public clouds as they are accessed over the open
Internet and used by the general public. With a public cloud, an organization may have
concerns related to privacy, external threats, and lack of control over the computing
resources and data. When compared to a public cloud, a private cloud offers an
organization a greater degree of privacy, and control over the cloud infrastructure,
applications and data. The private cloud model is typically adopted by larger-sized
organizations that have the resources to deploy and operate private clouds.
There are two variants of a private cloud – on-premise and externally-hosted. These are
discussed next.
The on-premise private cloud, also known as an internal cloud, is hosted by an organization
on its data centers within its own premises. The on-premise private cloud model enables an
organization to have complete control over the infrastructure and data. In this model, the
organization’s IT department is typically the cloud service provider. In some cases, a private
cloud may also span across multiple sites of an organization, with the sites interconnected
via a secure network connection.
The on-premise private cloud model enables an organization to standardize IT resources,
management processes, and cloud services. Standardization simplifies the private cloud
environment and the infrastructure management process and creates an opportunity to
save operational costs. It reduces the variation in the hardware and software components
used for the private cloud deployment. Standardization is typically achieved by using
compatible products for technology components such as compute, storage, networking or
management. Standardization also helps in automation of resource and service
management. Automation eliminates the need for IT to perform repetitive manual
processes and tasks associated with activities, such as configuration and provisioning.
However, not all automation products are fully compatible with all hardware. In such cases,
a standardized environment may reduce the amount of customization and integration
required to implement automation.
Organizations choosing the on-premise private cloud approach would incur significant
CAPEX for the IT resources as compared to the public cloud approach. This may give rise to
challenges regarding infrastructure size and resource scalability. The on-premise private
cloud model is best suited for organizations that require complete control over their
infrastructure, resource configurations, applications, data, and security mechanisms.
In the externally-hosted private cloud model, an organization outsources the
implementation of the private cloud to an external cloud service provider. The cloud
infrastructure is hosted on the premises of the external provider and not within the
consumer organization’s premises. The provider manages the cloud infrastructure and
facilitates an exclusive private cloud environment for the organization.
The organization’s IT infrastructure connects to the externally-hosted private cloud over a
secure network. The provider enforces security mechanisms in the private cloud per the
consumer organization’s security requirements. In this model, the cloud infrastructure may
be shared by multiple tenants. However, the provider has a security perimeter around the
private cloud resources of the consumer organization. The organization’s private cloud
resources are separated from other cloud tenants by access policies implemented by the
provider’s software. A number of possible mechanisms can be used to maintain this
separation and protect against threats. These are discussed later in the course in Module 8.
Organizations choosing the externally-hosted private cloud model can save on the CAPEX
associated with IT resources, such as compute systems, storage systems, and other
supporting infrastructure. Also, an organization can hire cloud resources in any quantity
from the provider, unlike the on-premise private cloud model, in which the resources must
be provisioned by the organization up front.
In a community cloud model, the cloud infrastructure is provisioned for exclusive use by a
specific community of consumers from organizations that have shared concerns (for
example, mission, security requirements, policy, and compliance considerations). It may be
owned, managed, and operated by one or more of the organizations in the community, a
third party, or some combination of them, and it may exist on or off premises. – NIST
A community cloud is a cloud infrastructure that is set up for the sole use by a group of
organizations with common goals or requirements. The organizations participating in the
community typically share the cost of the community cloud service. If various organizations
operate under common guidelines and have similar requirements, they could all share the
same cloud infrastructure and lower their individual investments. Since the costs are shared
by fewer consumers than in a public cloud, this option may be more expensive. However, a
community cloud may offer a higher level of control and protection against external threats
than a public cloud.
There are two variants of a community cloud – on-premise and externally-hosted. These are
discussed next.
In an on-premise community cloud, one or more participant organizations provide cloud
services that are consumed by the community. Each participant organization may provide
cloud services, consume services, or both. At least one community member must provide
cloud services for the community cloud to be functional. The cloud infrastructure is
deployed on the premises of the participant organization(s) providing the cloud services.
The organizations consuming the cloud services connect to the clouds of the provider
organizations over a secure network. The organizations providing cloud services require IT
personnel to manage the community cloud infrastructure. Participant organizations that
provide cloud services may implement a security perimeter around their cloud resources to
separate them from their other non-cloud IT resources. Additionally, the organizations that
consume community cloud services may also implement a security perimeter around their
IT resources that access the community cloud.
Many network configurations are possible in a community cloud. The figure on the slide
depicts an on-premise community cloud, the services of which are consumed by enterprises
P, Q, and R. The community cloud comprises two cloud infrastructures that are deployed on
the premises of Enterprise P and Enterprise Q, and combined to form a community cloud.
In the externally-hosted community cloud model, the participant organizations of the
community outsource the implementation of the community cloud to an external cloud
service provider. The cloud infrastructure is hosted on the premises of the external cloud
service provider and not within the premises of any of the participant organizations. The
provider manages the cloud infrastructure and facilitates an exclusive community cloud
environment for the participant organizations.
The IT infrastructure of each of the participant organizations connects to the externally-
hosted community cloud over a secure network. The provider enforces security mechanisms
in the community cloud as per the requirements of the participant organizations. In this
model, the cloud infrastructure may be shared by multiple tenants. However, the provider
has a security perimeter around the community cloud resources and they are separated
from other cloud tenants by access policies implemented by the provider’s software.
Unlike an on-premise community cloud, the participant organizations can save on the up-
front costs of IT resources in case of an externally-hosted community cloud. Also, using an
external provider’s cloud infrastructure for the community cloud may offer access to a larger
pool of resources as compared to an on-premise community cloud.
“The cloud infrastructure is a composition of two or more distinct cloud infrastructures
(private, community, or public) that remain unique entities, but are bound by standardized
or proprietary technology that enables data and application portability (for example, cloud
bursting for load balancing between clouds).” – NIST
A hybrid cloud is composed of two or more individual clouds, each of which can be private,
community, or public clouds. There can be several possible compositions of a hybrid cloud
as each constituent cloud may be of one of the five variants discussed previously. As a
result, each hybrid cloud has different properties in terms of parameters such as
performance, cost, security, and so on. A hybrid cloud may change over time as component
clouds join and leave. In a hybrid cloud environment, the component clouds are combined
through the use of open or proprietary technology, such as interoperable standards,
architectures, protocols, data formats, application programming interfaces (APIs), and so on.
The use of such technology enables data and application portability.
The figure in the slide depicts a hybrid cloud that is composed of an on-premise private
cloud (deployed by enterprise Q) and a public cloud (serving enterprise and individual
consumers in addition to enterprise Q). A usage scenario of a hybrid cloud involves hosting
mission-critical applications on a private cloud, while less critical applications are hosted on
a public cloud. By deploying less critical applications in the public cloud, an organization can
leverage the scalability and cost benefits of the public cloud. Another common usage
scenario of a hybrid cloud is “cloud bursting”, in which an organization uses a private cloud
for normal workloads, but optionally accesses a public cloud to meet higher workload
requirements. Cloud bursting allows a consumer to enjoy greater elasticity than their own
infrastructure would permit.
The Concepts in Practice section covers two product examples of PaaS and SaaS cloud
service models. The products covered are EMC Mozy and Pivotal Cloud Foundry.

Cloud Foundry is an open source Platform as a Service project by Pivotal. Cloud Foundry is
written primarily in the Ruby language and its source is available under Apache License 2.0.
It allows developers to develop and deploy applications without being concerned about
issues related to configuring and managing the underlying cloud infrastructure. It supports
multiple programming languages and frameworks including Java, Ruby, Node.js, and Scala. It
also supports multiple database systems including MySQL, MongoDB, and Redis. The Cloud
Foundry open-source community allows members to contribute to the project. Cloud
Foundry includes a self-service application execution engine, an automation engine for
application deployment and lifecycle management, a scriptable command line interface
(CLI), and integration with development tools for application deployment. Its open
architecture enables addition of frameworks, an application services interface, and a cloud
provider interface.
Mozy is a solution by EMC Corporation that provides for an automated and secure cloud-
based online backup and recovery through Software as a Service. Mozy provides protection
against risks like file corruption, unintended deletion, and hardware failure for compute and
mobile systems. It is built on highly scalable and available back-end storage architecture.
Mozy’s web-based management console enables consumers to specify the data to be
backed up and when to perform backups. Backups are encrypted and may be automatic or
scheduled periodically. Mozy has three main products – MozyHome, MozyPro, and
MozyEnterprise. MozyHome is for the individual consumer, MozyPro is for small businesses
and MozyEnterprise is for larger organizations. Mozy services are available at a monthly
subscription fee. Mozy does not require consumers to purchase any new hardware and
requires minimal IT resources to manage.
This module provided an introduction to cloud computing. It covered the definition of cloud
computing, the essential cloud characteristics, and the key cloud computing benefits. The
module also described the primary cloud service models and cloud deployment models.
This module focuses on the cloud computing reference model, deployment options, and
solutions for building cloud infrastructure. The module also focuses on various factors that
should be considered by a cloud service providers while deploying cloud infrastructure.
Copyright © 2014 EMC Corporation. All rights reserved Module 2: Cloud Infrastructure Building Blocks 44
This lesson introduces the cloud computing reference model. It describes the entities and
functions of the five layers of the model. It also describes the three cross-layer functions of
the cloud computing reference model.
According to OASIS (Organization for the Advancement of Structured Information
Standards), a reference model is an abstract framework for understanding significant
relationships among the entities of some environment, and for the development of
consistent standards or specifications supporting that environment. A reference model is
based on a small number of unifying concepts and may be used as a basis for education and
explaining standards. A reference model is not directly tied to any standards, technologies or
other concrete implementation details, but it does seek to provide a common semantics
that can be used unambiguously across and between different implementations.
Key goals of reference model are:
• Convey fundamental principles and basic functionality of a system it represents
• Facilitate efficient communication of system details between stakeholders
• Provide a point of reference for system designers to extract system specifications
• Enhance an individual’s understanding of the representative system
• Document the system for future reference and provide a means for collaboration
The cloud computing reference model is abstract model that characterizes and standardizes
the functions of a cloud computing environment by partitioning it into abstraction layers and
cross-layer functions. This reference model groups the cloud computing functions and
activities into five logical layers and three cross-layer functions.
The five layers are physical layer, virtual layer, control layer, service orchestration layer, and
service layer. Each of these layers specify various types of entities that may exist in a cloud
computing environment, such as compute systems, network devices, storage devices,
virtualization software, security mechanisms, control software, orchestration software,
management software, and so on. It also describes the relationships among these entities.
The three cross-layer function are business continuity, security, and service management.
Business continuity and security functions specify various activities, tasks, and processes
that are required to offer reliable and secure cloud services to the consumers. Service
management function specify various activities, tasks, and processes that enables the
administrations of the cloud infrastructure and services to meet providers business
requirements and consumers expectation.
Physical layer is the foundation layer of the cloud infrastructure. Physical layer specifies the
physical entities that operate at this layer such as compute systems, networking devices,
and storage devices. This layer also specifies the entities such as operating environment,
protocols, tools, and processes that enables the physical entities of this layer to perform
their functions and serves other layers of the cloud infrastructure. A key function of this
layer is to execute the request generated from the virtualization layer or control layer.
Examples of requests from the layers includes storing data on the storage devices,
performing communication among compute systems, executing programs on a compute
systems, creating backup copy of data, or executing security policy to block an unauthorized
activity.
Virtual layer is deployed on the physical layer. It specifies the entities that operate at this
layer such as virtualization software, resource pools, and virtual resources. A key function of
this layer is to abstract physical resources, such as compute, storage, and network, and
making them appear as virtual resources. Virtualization software deployed on compute
systems, network devices, and storage devices preform the abstraction of the physical
resources on which they are deployed. Abstracting the physical resources enables
multitenant environment, thereby improving utilization of the physical resources. Improved
utilization of physical resources results in increased return-on-investment (ROI) on the
infrastructure entities.
Virtualization software is also responsible for pooling physical resources from which virtual
resources are created. Examples of virtual resources include virtual machines, virtual
volume, and virtual network. The request to create resource pools and virtual resources is
generated by the control layer. After receiving the request from the control layer, the virtual
layer executes the requests. Apart from creating resource pools and virtual resources,
virtualization software also support features that enables optimized resource utilization that
further increases return-on-investment.
Other key functions of this layer includes executing the requests generated by control, and it
also include forwarding requests to the physical layer to get them executed. Examples of
requests generated by control layers include creating pools of resources and creating virtual
resources.
Note: While deploying cloud infrastructure, organization may choose not to deploy virtual
layer. In such an environment, the control layer is deployed over the physical layer and it can
directly request physical layer to perform an operation. Further, it is also possible that part of
the infrastructure is virtualized and rest is not virtualized.
Control layer can be deployed either on virtual layer or on physical layer. It specifies the
entities that operate at this layer such as control software. A key function of this layer
includes executing the requests generated by service layer in collaboration with
orchestration layer. Another key function of this layer includes forwarding requests to the
virtual and/or physical layer to get them executed. Examples of requests generated by
service layer includes creating service instance such as compute system instance for IaaS
and application instance for SaaS.
The other key functions that are performed by control software are resource configuration,
resource pool configuration, and resource provisioning. The control software in
collaboration with virtualization software enables resource pooling, dynamic allocation of
resources, creating virtual resources, and optimizing utilization of resources. The control
software initiates all the requests such as resource configuration, resource pooling, resource
provisioning, and so on. These requests are passed on to the virtual layer or physical layer. In
the absence of virtual layer, requests generated by control layer are passed on to the
physical layer. In this case these requests are fulfilled by the operating environment in
collaboration with the control software.
This layer also exposes resources (physical and/or virtual) to and supports the service layer
where cloud service interfaces are exposed to consumers.
Service orchestration layer specifies the entities that can operate at this layer such as a
orchestration software. A key function of this layer is to provide workflows for executing
automated tasks to accomplish a desired outcome. Workflow refers to a series of inter-
related tasks that perform a business operation. The orchestration software enables this
automated arrangement, coordination, and management of the tasks. This helps to group
and sequence tasks with dependencies among them into a single, automated workflow.
Associated with each service listed in the service catalog, there is an orchestration workflow
defined. When a consumer selects a service from the service catalog, an associated
workflow in the orchestration layer is triggered. Based on this workflow the orchestration
software interacts with various entities (from control layer, business continuity function,
security function, and service management function) to invoke provisioning tasks to be
executed by the entities.
The service layer is accessible to the cloud consumers’. This layer specifies the entities that
can operate at this layer such as service catalog and self-service portal. A key function of this
layer is to store and present the information about all the services offered to the cloud
consumers’ in a service catalog. A service catalog is a database of information about the
cloud services offered by a service provider. The service catalog includes a variety of
information about the services, including description of the services, the types of services,
cost, supported SLAs, security mechanisms, and so on.
Another key function of this layer is to enable cloud consumers to access and manage the
cloud services via a self-service portal. A self-service portal displays the service catalog to
consumers. Consumers can use this web portal to request cloud services. In addition to the
service catalog it also provides interface to access and manage rented service instances. The
provisioning and management requests are passed on the orchestration layer, where the
orchestration workflows to fulfill the requests are defined.
Business continuity (BC) function specifies the adoption of proactive and reactive measures
that enables a business to mitigate the impact of planned and unplanned downtime.
Proactive measures include activities, tasks, processes such as business impact analysis, risk
assessment, and technology solutions deployment (such as backup and replication).
Reactive measures include activities, tasks, processes such as disaster recovery and disaster
restart to be invoked in the event of a service failure. This function supports all the layers –
physical, virtual, control, orchestration, and service – to provide uninterrupted services to
the consumers. The BC function in a cloud environment enables a business to ensure
availability of services in line with the Service Level Agreement (SLA).
Security function specifies the adoption of administrative and technical mechanism that can
mitigate or minimize security threats and provides a secure cloud environment.
Administrative mechanisms include security and personnel policies or standard procedures
to direct the safe execution of various operations. Technical mechanisms are usually
implemented through tools or devices deployed on the IT infrastructure. Examples of
technical mechanisms include firewall, intrusion detection and prevention systems,
antivirus, and so on.
Further, this function also specifies the adoption of governance, risk and compliance (GRC)
which specifies processes that help an organization ensure that their acts are ethically
correct and in accordance with their risk appetite (the risk level an organization chooses to
accept), internal policies and external regulations.
This function supports all the layers – physical, virtual, control, orchestration, and service –
to provide secure services to the consumers.
Service management function specifies adoption of activities related to service portfolio
management and service operation management. Adoption of these activities enables an
organization to align the creation and delivery of cloud services to meet their business
objectives and to the expectations of cloud service consumers.
Service portfolio management encompasses the set of business-related services that:
• Define the service roadmap, service features, and service levels
• Assess and prioritize where investments across the service portfolio are most needed
• Establish budgeting and pricing
• Deal with consumers in supporting activities such as taking orders, processing bills, and
collecting payments
Service portfolio management also performs market research, measures service adoption,
collects information about competitors, and analyzes feedback from consumers in order to
quickly modify and align services according to consumer needs and market conditions.
Service operation management enables cloud administrators to manage cloud infrastructure
and services. Service operation management tasks includes the handling of infrastructure
configuration, resource provisioning, problem resolution, capacity, availability, and
compliance conformance. All of these tasks enables ensuring that services and service
levels are delivered as committed. Service operation management also includes monitoring
cloud services and their constituent elements. This enables the provider to gather
information related to resource consumption and generate bills. This function supports all
the layers to perform monitoring, management, and reporting for the entities of the
infrastructure.
This lesson covered the cloud computing reference model. It covered the entities and
functions of the five layer. Finally, it covered the activities performed in the three cross-layer
functions.
This lesson describes the greenfield and brownfield deployment options for building cloud
infrastructure. Further, the lesson describes the two technology solution for building a cloud
infrastructure.
Before building a cloud infrastructure, organizations must identify which deployment option
is appropriate for them. There are two deployment options for building a cloud
infrastructure and they are greenfield deployment or brownfield deployment. A greenfield
deployment is typically used when an infrastructure does not exist and an organization have
to build the cloud infrastructure starting from physical layer. On the other hand, a
brownfield deployment is used when some of the infrastructure entities exist, which can be
transformed to cloud infrastructure by deploying the remaining entities required for the
cloud infrastructure. For example, consider that an organization want to use a brownfield
deployment to transform their existing datacenter, which has the physical, virtual, and
control layers deployed. In such cases, the data center also has the business continuity,
security, and service management in place. However, this three cross-layer functions are
limited to a non-cloud environment. While transforming the existing data center to cloud
infrastructure, the organization will have to deploy the orchestration layer and the service
layer. Further, the BC, security and the service management functions will have to be
transformed to support cloud environment.
In both deployment options, apart from deploying the five layers and three cross-layer
functions the organizations have to consider several factors that will enable them to deploy
the cloud services that will meet consumers’ expectations. These factors are covered in
lesson 3 of this module.
There are two solutions for building a cloud infrastructure: by integrating best-of-breed
cloud infrastructure components or by acquiring and implementing a cloud-ready converged
infrastructure.
In an integrate best-of-breed cloud infrastructure components solution, organizations have
the flexibility to use and integrate infrastructure components from different vendors. This
solution allow organizations to design their cloud infrastructure by repurposing their existing
infrastructure components (in a brownfield deployment), providing a cost advantage for this
solution.
When this method is used to build a cloud infrastructure, an organization may have to spend
a significant amount of IT staff time evaluating individual, disparate hardware components,
installing hardware, and integrating compute, storage, and network components. The IT
staff may also have to spend effort integrating and testing hardware, middleware, and
software. They also need to check compatibility of all the components to ensure that the
combined components interoperate and function as expected. This may delay deployment
of cloud services. Further, scaling of such an infrastructure takes longer because each
component that is scaled requires integration with the existing infrastructure and testing for
compatibility. Finally, this solution requires acquiring cloud infrastructure management tools
and deploying them on the infrastructure.
A cloud ready converged infrastructure solution provides a modular design that combines
compute, storage, network, virtualization, and management components into a single
package. This package is a self-contained unit that can be utilized to deploy cloud services,
or can be aggregated with additional packages to support the demand for more capacity
and performance. The package is pre-configured, reducing the time to deploy cloud
services. Further, in addition to integrating various components into a package, this solution
offers single management software capable of managing all hardware and software within
the package.
A cloud-ready converged infrastructure solution has built-in capabilities that provide secured
multi-tenancy. However, additional security mechanisms should be deployed to prevent
external attacks. The solution is capable of managing and mitigating failure scenarios in
hardware, software, and cloud services.
A potential area of concern regarding cloud-ready converged infrastructure solutions is lack
of flexibility to use infrastructure components from different vendors. Some vendors may
provide organizations with the flexibility to choose multi-vendor infrastructure components
such as network devices, compute systems, and hypervisors for this solution.
This lesson covered the greenfield and brownfield deployment options for building cloud
infrastructure. The lesson also covered the two solution, best-of-breed cloud infrastructure
components and cloud-ready converged infrastructure to build cloud infrastructure.
This lesson describes various factors that should be considered while building cloud
infrastructure.
After deciding on the deployment option and solution to build the cloud infrastructure, a
cloud service provider have to consider several factors to deliver cloud services that meets
their business objectives and consumers expectations. These slides lists the key factors a
service provider must consider while building a cloud infrastructure.
Governance is the active distribution of decision-making rights and accountability among
different stakeholders in an organization. It also describes the rules and procedures for
making and monitoring those decisions to determine and achieve desired behaviors and
results. The role of governance in IT is to implement, maintain, and continuously improve
the controls on the use of IT resources. IT governance enables a service provider to:
1. Ensure that IT resources are implemented and used according to agreed-upon policies
and procedures
2. Ensure that these resources are properly controlled and maintained
3. Ensure that these resources are providing value to the organization
Instituting IT governance usually involves establishing a review board, which is a team of
members from across business units and IT. This review board is responsible for creating
rules and processes that the organization must follow to ensure that policies are being met.
These rules and processes might include the following:
1. Understanding business issues, such as regulatory requirements or funding
2. Establishing best practices and monitoring these processes
3. Assigning responsibility for things such as standards, design, review, and certifications
(Contd.)
Depending on the size, structure, geographic presence, and culture of an organization, one
of these fundamental governance models can be implemented:
1. A centralized model provides one governance body for the entire organization. This fits
best with a smaller or a strongly centralized organization where governance policies are,
for the most part, consistent throughout the organization.
2. A federated model proposes separate governance bodies, one for each business unit. A
business unit can be a functional organization, a product group, or a geographic location.
Each business unit has its own set of governance policies. Even though the services for a
given business unit can be independently standardized, managed and owned, a single,
enterprise-wide governance body can still subject all services to a common governance
system.
3. A distributed model proposes separate governance bodies for each business unit. These
governance bodies function autonomously and are not controlled by any common
governance system.
The organization can choose a governance model that best meets its requirements. After a
governance model is chosen, the organization then needs to take steps to establish or
transform to the chosen governance model.
A cloud service provider need to institute or transform the organization to a proactive and
services-based model. This requires developing several new functions that perform tasks
related to cloud services, such as service definition and creation, service administration and
management, service governance and policy formulation, and service consumer
management. Some of these tasks can be combined to become the responsibility of an
individual or organizational role. The new roles can be categorized as front or back office
functions. The front office supports IT business functions to enable IT to build services that
align with business needs and can be used generally across all the business units. The front
office works closely with business consumers to design, market and promote the available
services. The new service-based roles of the front office include service manager and
account manager.
• A service manager is responsible for understanding consumers’ needs and industry trends
to drive an effective product strategy. The service manager ensures that IT delivers cost-
competitive services that have the features that clients need. The service manager is also
responsible for managing consumers’ expectations of product offerings and serves as key
interface between clients and IT staff.
• An account manager supports service managers in service planning, development and
deployment. The account manager maintains day-to-day contact to ensure consumers’
needs are met.
(Contd.)
A service provider need to institute or transform the financial/payback/showback/pricing
model that will enable them to manage their budgeting, accounting, and chargeback
requirements. The model helps service provider to plan for investments to offer cloud
services and determines the IT budget for cloud infrastructure and operations for the
lifecycle of services.
The service provider should perform service valuation. Service valuation determines the
price (or chargeback) a consumer is expected to pay for a service, which helps recover the
cost of providing the service, ensuring profitability, and meeting the provider’s ROI and
reinvestment goals. The service provider aggregates all types of costs (both CAPEX and
OPEX) down to service element level of granularity by mapping elements to relevant cloud
services. Then it calculates service costs on per-unit basis by dividing the aggregated cost for
a service by some logical unit of demand such as GB of storage or an hour of usage for that
service.
However, the per-unit service costs may vary over time, depending on demand for or
utilization of the services and service elements. Thus, service provider should track demand
and utilization to establish a stable per-unit cost baseline. Finally, service provider may add
some margin amount over per-unit service cost to define service price, or may establish the
price at the true cost of service depending on the provider’s business goal. The service
provider then define chargeback or showback model(s) based on the pricing strategy for
cloud services.
A chargeback model defines how consumers need to pay for the consumed services. A list of
common chargeback models along with their descriptions provided below.
• Pay-as-you-go: Metering and pricing is based on the consumption of cloud resources by
the consumers. Consumers do not pay for unused resources.
• Subscription by time: Consumers are billed for a subscription period. The cost of
providing a cloud service for the subscription period is divided among a predefined
number of consumers. For example, in a private cloud, if three business units are
subscribing to a service that costs $60,000 a month to provide, then the chargeback per
business unit is $20,000 for the month.
• Subscription by peak usage: Consumers are billed according to their peak usage of IT
resources for a subscription period. For example, a provider may charge a consumer for
their share of peak usage of network bandwidth.
• Fixed cost or pre-pay: Consumers commit up-front on the required cloud resources for
the committed period such as one year or three years. They pay fixed charge periodically
through a billing cycle for the service they use regardless of the utilization of resources.
• User-based: Pricing is based on the identity of a user (a person) of cloud service. In this
model, the number of users logged in is tracked and billing is based on that number.
Service provider deploys chargeback tools in the cloud infrastructure. These tools enable
service provider to define a chargeback model. Based on the model, these tools
automatically collect billing data, store billing records in a billing system, and generate the
billing report per consumer.
Tools play an important role in building cloud infrastructure; therefore an early step in
building the infrastructure is to deploy the necessary technologies using the tools. Examples
of key tools used for building cloud infrastructure include virtualization software,
orchestration software, security software, business continuity software, self-service portal
software, and so on. These tools enable the service provider to build and offer cloud
services to the consumers. Apart from considering tools that enables providers to build a
cloud infrastructure, providers should also consider tools that will enable them connect
multiple clouds or application. Examples of such tools include cloud integration tools, APIs,
and specialized connection, transformation, and business logic programs. These types of
tools are specially useful while deploying hybrid or community cloud. Also, such tools are
important to consider when a service provider is providing brokerage services.
Cloud integration tools enables connecting cloud applications with other cloud and non-
cloud applications to leverage the capabilities of multiple applications. Cloud integration
technology integrates multiple cloud applications using application programming interface
(API) support. These APIs enable secure access to the data of integrated applications.
However, integration cannot be accomplished only with APIs because they do not perform
functions such as transformation of data formats, data mapping, data validation, and error
processing. These functions are typically handled by specialized connection, transformation,
and business logic programs. These programs gather data with the help of APIs, then
transform formats as required, and validate the accuracy of the transformation.
(Contd.)
A service-level agreement (SLA) is a contract negotiated between a provider and a consumer
that specifies various parameters and metrics such as cost, service availability, maintenance
schedules, performance levels, service desk response time, and consumer’s and provider’s
responsibilities.
SLAs must be carefully written before offering to a consumer. SLAs are part of a service
contract: an agreement between the cloud service provider and the cloud service consumer
stating the terms of service usage. A legal contract that must be established with the
consumer before a service can be used. When writing a legal contract, key considerations
include business level policies such as: data privacy, data ownership, data retention, secure
deletion, security, confidentiality, auditing, regulatory requirements, redundancy,
jurisdiction, disruption resolution, compensation for data loss and misuse, excess usage,
availability and performance metrics, payments and penalty methods, contracted services, a
list of services not covered, licensed software, and service termination.
Finally, a disaster recovery plan, penalties, and an exit clause should be included. An SLA
should include indication of how unexpected incidents will be handled and what actions will
be taken in case of a prolonged service outage. It should cover penalties for not meeting the
SLA. The SLA should also include clauses related to termination of the service by both
consumer and provider.
Cloud vendor lock-in refers to a situation where a consumer is unable to move readily from
the current provider to another. This condition may result from various causes such as high
cost of migration, significant re-engineering effort requirement for an application migration,
lack of open standards, or restrictions imposed by the current provider.
When building a cloud infrastructure providers must avoid using proprietary tools, APIs, or
file formats, which may cause vendor lock-in. Use of widely accepted open standard tools,
APIs, and file formats not only prevents vendor lock-in, but also make services offered using
open tools more acceptable to consumers. Use of open standards provides interoperability
and portability among providers, which consumers typically prefer. For example, the
provider may use APIs based on open standards that enable an application’s data to migrate
to another provider with minimal or no change to its format. Likewise, if the provider
supports use of Open Virtual Machine Format (OVF), which is an open standard for virtual
machine format, then a virtual machine created in one provider’s environment can be
migrated to another provider with minimal or no changes.
Sometimes providers may impose restrictions or burdensome penalties for migrating to
another provider, causing lock-in. Including an appropriate exit clause in the SLA can prevent
vendor lock-in due to restrictions and penalties.
While building a cloud infrastructure, providers must consider challenges associated with
software (application and operating system) licenses. It is important to asses these
challenges at an early stage. Software licensing challenges are relevant to infrastructure as a
service (IaaS) and platform as a service (PaaS) models.
Consumers can use their existing software license in the cloud only if it is cloud enabled.
Therefore, providers must identify whether the consumer’s existing software license is cloud
enabled. If not, then the consumers can pay additional fees to get their license cloud
enabled. Alternatively, consumers can use the software provided by service provider and
pay a fee for the software usage.
Further, the service provider in collaboration with software vendors and consumers must
work to understand the software license rights and its usage. This is important because the
cloud service provider may have to create redundant systems by replication to combat
against unplanned outage or disasters. Understanding the license rights and its usage will
enable service providers preventing any non-compliance and violation of the license
agreement.
The slide lists key factors that must be considered while deploying SaaS.
The slide lists key factors that must be considered while deploying PaaS and IaaS.
Migration strategy and considerations depend on whether a consumer plans to migrate
their application (in case of IaaS) or only their data (in case of SaaS).
For application migration, service providers must work with consumers to develop a
migration strategy for their application. Also, they must identify the various dependencies of
the application. For example, if an application depends on an authentication service that is
on-premise, then appropriate configuration changes are required in order to make the
application work after migrating to a cloud. Based on dependencies, a consumer may
choose one of two migration strategies. The strategies are forklift migration and hybrid
migration.
1. In the forklift migration strategy, the application and all of its related components are
migrated to the cloud at once. This strategy is typically used for tightly coupled
applications or self-contained applications. Tightly coupled applications are multiple
applications that are dependent on each other and cannot be separated. Self-contained
applications are applications that can be treated as a single entity.
2. In a hybrid migration strategy, an application and its components are moved to the
cloud in parts. This strategy is a lower-risk approach to migrate applications to the cloud.
This is because parts of an application can be moved to the cloud and optimized before
moving other parts. This reduces the risk of unexpected behavior of the application
when it is moved to the cloud. This strategy is typically good for applications with many
loosely coupled components.
In some cases, consumers may only require migration of data. The data can be migrated to
the cloud by deploying replication technology to copy the data from the consumer’s data
center to the cloud. While migrating data to the cloud, provider must consider factors such
as network bandwidth, data security, data integrity, data consistency, jurisdiction, and so on.
After the application or data is migrated to the cloud, the provider must work with the
consumer to test their application to ensure that it is working as expected. The degree of
testing may vary depending on the scope and magnitude of the consumer’s requirements.
While developing a test strategy providers in collaboration with consumers must consider
the following:
• Define roles and responsibilities of the personal involved in test and quality assurance
process
• Identify tools required to perform test management and automation
• Design tests for data migration to the cloud
• Design test cases to perform various testing modes such as stress, performance,
functional, interoperability, and compatibility
Apart from testing the application, the provider must also test other cloud capabilities such
as fault tolerance, disaster recovery, security controls, and any other capabilities to ensure
that the migrated application has successfully been configured with the capabilities that are
committed by the provider.
This lesson covered several factors that must be considered while building cloud
infrastructure.
The Concepts in Practice section covers two product examples Vblock and EMC VSPEX.
Vblock is a completely integrated cloud infrastructure offering that includes compute,
storage, network, and virtualization products. These products are provided by EMC (storage
solution provider), VMware (virtualization solution provider), and Cisco (networking and
compute solution provider), who have formed a coalition to deliver Vblocks.
Vblock is an integrated IT infrastructure solution that combines compute, storage, network,
virtualization, security, and management software into a single package. This solution is a
self-contained unit that accelerates deployment of a cloud infrastructure. Vblocks are pre-
architected, preconfigured, pretested and have defined performance and availability
attributes. Rather than customers buying and assembling individual IT infrastructure
components, Vblock provides a validated solution and is factory-ready for deployment and
production. This saves significant cost and deployment time associated with building a cloud
infrastructure.
(Contd.)
This module covered the cloud computing reference model. It also covered the greenfield
and brownfield deployment options. Further, it covered the two technology solutions —
best-of-breed cloud infrastructure components and cloud-ready converged infrastructure —
that can be used to build the cloud infrastructure. Finally, it covered various factors to
consider while building a cloud infrastructure.
Module 3 covers the physical layer of a cloud infrastructure. This module describes physical
compute system, its components, and its types. This module also covers storage system
architectures. Further this module describes network connectivity and the types of network
communication.
Copyright © 2014 EMC Corporation. All rights reserved Module 3: Physical Layer 89
The physical layer – highlighted in the figure on the slide – is the foundation layer of the
cloud reference model. The process of building a cloud infrastructure is typically initiated
with the cloud service provider setting up the physical hardware resources of the cloud
infrastructure. Based on requirements such as performance, scalability, cost, and so on, the
provider has to make a number of decisions while building the physical layer, including
choosing suitable compute, storage, and network products and components, and the
architecture and design of each system.
The physical layer comprises compute, storage and network resources, which are the
fundamental physical computing resources that make up a cloud infrastructure. As discussed
in Module 1, the physical resources are typically pooled to serve multiple consumers.
Physical compute systems host the applications that a provider offers as services to
consumers and also execute the software used by the provider to manage the cloud
infrastructure and deliver services. A cloud provider also offers compute systems to
consumers for hosting their applications in the cloud. Storage systems – typically offered
along with compute – store business data and the data generated or processed by the
applications deployed on the compute systems. Networks, such as a local area network
(LAN), connect physical compute systems to each other, which enables the applications
running on the compute systems to exchange information. A storage network connects
compute systems to storage systems, which enables the applications to access data from
the storage systems. If a cloud provider uses physical computing resources from multiple
cloud data centers to provide services, networks connect the distributed computing
resources enabling the data centers to work as a single large data center. Networks also
connect multiple clouds to one another – as in case of the hybrid cloud model – to enable
them to share cloud resources and services.
This lesson provides an introduction to compute system and describes the key software
deployed on compute systems in a cloud environment. This lesson also describes the key
components and the types of physical compute systems.
The primary role of a compute system (or compute) is to host and execute different types of
software programs. For example, physical compute systems run a provider’s software that is
used for managing the cloud infrastructure and also the provider’s SaaS applications.
Compute systems also execute the business applications that consumers deploy on them.
Consumers may deploy their applications entirely on cloud compute systems or may
leverage the cloud during peak workload periods.
In a cloud environment, two or more compute systems are typically combined together into
a cluster – a group of compute systems that function together, sharing certain network and
storage resources, and are viewed as a single system. Compute clusters are typically
implemented to provide high availability and load balancing of compute resources. Compute
clustering is covered in detail in Module 7.
A cloud provider typically offers compute systems to consumers in two ways – shared
hosting and dedicated hosting. In shared hosting, the compute systems are shared among
multiple consumers. For example, a provider hosts a consumer’s website on the same
compute system as the websites of other consumers. In dedicated hosting, a provider offers
to a consumer dedicated compute systems that are not shared with any other consumer.
A cloud provider may install an operating system (OS) on the compute hardware or enable
consumers to install the OS of their choice. The OS manages the physical components and
application execution, and provides a user interface (UI) for users to operate and use the
compute system. The key tasks of the OS include physical device management, memory
management, processor scheduling, and data storage management.
(Contd.)
The provider needs to purchase the required software and install them on compute systems.
Installing and managing the cloud software involves up-front costs to a provider. If the cloud
is intended to support process-intensive or data-intensive workloads, the software must be
installed on multiple compute systems in a cluster. A cloud provider deploys on a compute
system software such as the cloud self-service portal, the application software and platform
software that are offered as services (PaaS and SaaS) to consumers, virtualization software,
cloud infrastructure management software, and so on. The provider also enables consumers
to deploy their platform software and business applications on the compute systems. The
slide provides a list and a brief description of the software that are deployed on compute
systems in a cloud infrastructure.
A compute system typically comprises the following key physical hardware components
assembled inside an enclosure:
Processor: A processor, also known as a Central Processing Unit (CPU), is an integrated
circuit (IC) that executes the instructions of a software program by performing fundamental
arithmetical, logical, and input/output operations. Common processor architectures are x86
(32-bit) and x64 (64-bit). Modern processors have multiple cores (independent processing
units), each capable of functioning as an individual processor.
Random-Access Memory (RAM): The RAM or main memory is a volatile data storage device
internal to a compute system. The processor can execute only those software programs that
are loaded into the RAM and can only read data from and write data to the RAM.
Read-Only Memory (ROM): A ROM is a type of semiconductor memory that contains the
boot firmware (that enables a compute system to start), power management firmware, and
other device-specific firmware.
Motherboard: A motherboard is a printed circuit board (PCB) to which all compute
components connect. It has sockets to hold components such as the microprocessor chip,
RAM, and ROM. It also has network ports, I/O ports to connect devices such as keyboard,
mouse, and printers, and essential circuitry to carry out computing operations. A
motherboard may additionally have integrated components, such as a graphics processing
unit (GPU), a network interface card (NIC), and adapters to connect to external storage
devices.
(Contd.)
When building a cloud infrastructure, care should be taken in selecting the type of compute
system that meets the requirements for delivering the cloud services. The compute systems
used in data centers are typically classified into three categories:
• Tower compute system
• Rack-mounted compute system
• Blade compute system
A tower compute system, also known as a tower server, is a compute system built in an
upright enclosure called a “tower”, which is similar to a desktop cabinet. Tower servers have
a robust build, and have integrated power supply and cooling. They typically have individual
monitors, keyboards, and mice. Tower servers occupy significant floor space and require
complex cabling when deployed in a data center. They are also bulky and a group of tower
servers generates considerable noise from their cooling units. Tower servers are typically
used in smaller environments. Deploying a large number of tower servers in large
environments may involve substantial expenditure.
A rack-mounted compute system, also known as a rack server or an industrial server, is a
compute system designed to be fixed on a frame called a “rack”. A rack is a standardized
enclosure containing multiple mounting slots called “bays”, each of which holds a server in
place with the help of screws. A single rack contains multiple servers stacked vertically in
bays, thereby simplifying network cabling, consolidating network equipment, and reducing
floor space use. Each rack server has its own power supply and cooling unit. A “rack unit”
(denoted by U or RU) is a unit of measure of the height of a server designed to be mounted
on a rack. One rack unit is 1.75 inches. A rack server is typically 19 inches (482.6 mm) in
width and 1.75 inches (44.45 mm) in height. This is called a 1U rack server. Other common
sizes of rack servers are 2U and 4U. Some common rack cabinet sizes are 27U, 37U, and
42U.
Typically, a console with a video screen, keyboard, and mouse is mounted on a rack to
enable administrators to manage the servers in the rack. A KVM (keyboard, video, and
mouse) switch connects the servers in the rack to the console and enables the servers to be
controlled from the console. An administrator can switch between servers using keyboard
commands, mouse commands, or touchscreen selection. Using a KVM switch eliminates the
need for a dedicated keyboard, monitor, and mouse for each server and saves space and
reduces cable clutter. Some concerns with rack servers are that they are cumbersome to
work with, and they generate a lot of heat because of which more cooling is required, which
in turn increases power costs.
A blade compute system, also known as a blade server, is an electronic circuit board
containing only core processing components, such as processor(s), memory, integrated
network controllers, storage drive, and essential I/O cards and ports. Each blade server is a
self-contained compute system and is typically dedicated to a single application. A blade
server is housed in a slot inside a blade enclosure (or chassis), which holds multiple blades
and provides integrated power supply, cooling, networking, and management functions. The
blade enclosure enables interconnection of the blades through a high speed bus and also
provides connectivity to external storage systems.
The modular design of blade servers makes them smaller, which minimizes floor space
requirements, increases compute system density and scalability, and provides better energy
efficiency as compared to tower and rack servers. It also reduces the complexity of the
compute infrastructure and simplifies compute infrastructure management. It provides
these benefits without compromising on any capability that a non-blade compute system
provides. Some concerns with blade servers include the high cost of a blade system (blade
servers and chassis), and the proprietary architecture of most blade systems due to which a
blade server can typically be plugged only into a chassis from the same vendor.
This lesson described the key software that are deployed on a compute system in a cloud
environment. This lesson also covered the key components of a compute system, such as
the processor, RAM, ROM, motherboard, and the chipset. Finally, this lesson described the
three common types of physical compute systems – tower, rack-mounted, and blade – that
are used in building a cloud infrastructure.
This lesson describes the common types of persistent storage devices. This lesson also
describes Redundant Array of Independent Disks (RAID) and its use in data protection and
storage performance improvement. Further, this lesson describes the types of storage
system architectures, namely block-based, file-based, object-based, and unified storage
systems.
Data created by individuals, businesses, and applications need to be persistently stored so
that it can be retrieved when required for further processing or analysis. A storage system is
the repository for saving and retrieving digital data and is integral to any cloud
infrastructure. A storage system has devices that enable persistent storage and retrieval of
data, which are called storage devices, or simply, storage. Apart from providing storage
along with compute for use in processing, a provider may also offer Storage as a Service,
which enables consumers to store their data on the provider’s storage systems in the cloud.
This enables them to leverage cloud resources for the purpose of data backup and long-
term data retention.
A cloud storage infrastructure is typically created by logically aggregating and pooling the
storage resources from one or more data centers to provide virtual storage resources. Cloud
storage provides massive scalability and rapid elasticity of storage resources. A cloud storage
infrastructure is typically shared by multiple tenants or consumers which improves the
utilization of storage resources.
Storage system architecture is a critical design consideration for building cloud
infrastructure. A cloud provider must choose the appropriate storage, and ensure adequate
capacity to maintain the overall performance of the environment.
A magnetic disk is a circular storage medium made of non-magnetic material (typically an
alloy) and coated with a ferromagnetic material. Data is stored on both surfaces (top and
bottom) of a magnetic disk by polarizing a portion of the disk surface. A disk drive is a
secondary storage device that comprises multiple rotating magnetic disks, called platters,
stacked vertically. Each platter has a rapidly moving arm to read and write data to and from
the disk. Disk drives are currently the most popular storage medium for storing and
accessing data for performance-intensive applications. Disks support rapid access to random
data locations and data can be written or retrieved quickly for a number of simultaneous
users or applications. Disk drives use pre-defined protocols, such as Advanced Technology
Attachment (ATA), Serial ATA (SATA), Small Computer System Interface (SCSI), Serial Attached
SCSI (SAS), and Fibre Channel (FC). These protocols reside on the disk interface controllers
that are typically integrated with the disk drives. Each protocol has its unique performance,
cost, and capacity characteristics.
A solid-state drive (SSD) uses semiconductor-based memory, such as NAND and NOR chips,
to store and retrieve data. SSDs, also known as “flash drives”, deliver the ultra-high
performance required by performance-sensitive applications. These devices, unlike
conventional mechanical disk drives, contain no moving parts and therefore do not exhibit
the latencies associated with read/write head movement and disk rotation. Compared to
other available storage devices, SSDs deliver a relatively high number of input/output
operations per second (IOPS) with very low response times. They also consume less power
and typically have a longer lifetime as compared to mechanical drives. However, flash drives
do have the highest cost per gigabyte ($/GB) ratio.
(Contd.)
RAID (Redundant Array of Independent Disks) is a storage technology in which multiple disk
drives are combined into a logical unit called a RAID group and data is written in blocks
across the disks in the RAID group. RAID protects against data loss when a drive fails,
through the use of redundant drives and parity. RAID also helps in improving storage system
performance as read and write operations are served simultaneously from multiple disk
drives. For example, if the RAID group has four disk drives, data is written across all 4 of
them simultaneously, which provides four times better write performance as compared to
using a single drive. Similarly, during read operation, the data is retrieved simultaneously
from each drive.
RAID is typically implemented by using a specialized hardware controller present either on
the host or on the array. The key functions of a RAID controller are management and control
of drive aggregations, translation of I/O requests between logical and physical drives, and
data regeneration in the event of drive failures.
The three different RAID techniques that form the basis for defining various RAID levels are
striping, mirroring, and parity. These techniques determine the data availability and
performance of a RAID group as well as the relative cost of deploying the storage solution. A
cloud provider must select the appropriate RAID levels to meet the requirements of cloud
service delivery.
Striping is a technique to spread data across multiple drives in order to use the drives in
parallel and increase performance as compared to the use of a single drive. Each drive in a
RAID group has a predefined number of contiguously addressable blocks (the smallest
individually addressable unit of storage) called a “strip”. A set of aligned strips that span
across all the drives within the RAID group is called a “stripe”. All strips in a stripe have the
same number of blocks. Although striped RAID provides improved read-write performance,
it does not provide any data protection in case of disk failure.
Mirroring is a technique in which the same data is stored simultaneously on two different
drives, resulting in two copies of the data. This is called a “mirrored pair”. Even if one drive
fails, the data is still intact on the surviving drive and the RAID controller continues to
service data requests using the surviving drive of the mirrored pair. When the failed disk is
replaced with a new disk, the controller copies the data from the surviving disk of the
mirrored pair to the new disk. This activity is transparent to the host. In addition to
providing data redundancy, mirroring enables fast recovery from disk failure. Since mirroring
involves duplication of data, the amount of storage capacity needed is twice the amount of
data being stored. This increases costs because of which mirroring is typically preferred for
mission-critical applications that cannot afford the risk of any data loss. Mirroring improves
read performance because read requests can be serviced by both disks. However, compared
to a single disk and striping, write performance is slightly lower in mirroring because each
write request manifests as two writes on the disk drives.
Parity is a value derived by performing a mathematical operation on individual strips of data
and stored on a portion of a RAID group. It enables the recreation of missing data in case of
a drive failure. Parity is a redundancy technique that ensures data protection without
maintaining a full set of duplicate data. The RAID controller calculates the parity using
techniques such as “bitwise exclusive or” (XOR). Parity information can be stored on
separate, dedicated disk drives or distributed across the drives in a RAID group. Compared
to mirroring, parity implementation considerably reduces the cost associated with data
protection. However, a limitation of parity implementation is that parity is recalculated
every time there is a change in data, which may affect the performance of the RAID array.
In the figure on the slide, the first four disks, labeled D1 to D4, contain data. The data
elements are 4, 6, 1, and 7. The fifth disk, labeled P, stores the parity information i.e. 18,
which is the sum of the data elements. If one of the drives fails, the missing value can be
calculated by subtracting the sum of the remaining elements from the parity value.
RAID levels are implementations of the striping, mirroring, and parity techniques. Some
RAID levels use a single technique, while others use a combination of the techniques. The
commonly used RAID levels are RAID 0 – that uses striping, RAID 1 – that uses mirroring,
RAID 1+0 – which is a combination RAID 1 and RAID 0, and RAID 3, 5, and 6 – that use a
combination of striping and parity.
Data can be accessed from a compute system (or a compute cluster) through block-level,
file-level or object-level schemes. External storage systems can be connected to the
compute system directly or over a network. An application on the compute system stores
and accesses data using the underlying infrastructure comprising an OS, a file system,
network connectivity and storage. In general, an application requests data by specifying the
file name and the location. The file system maps the file attributes to the logical block
address (LBA) of the data and sends it to the storage system. The LBA simplifies addressing
by using a linear address to access the block of data. The storage system converts the LBA to
a physical address called the cylinder-head-sector (CHS) address and fetches the data.
In block-level access, a storage volume (a logical unit of storage composed of multiple
blocks, typically created from a RAID set) is created and assigned to the compute system to
house created file systems. In this case, an application data request is sent to the file system
and converted into a block-level (logical block address) request. This block level request is
sent over the network to the storage system. The storage system then converts the logical
block address to a CHS address and fetches the data in block-sized units.
In file-level access, the file system is created on a separate file server, which is connected to
storage. A file-level request from the application is sent over the network to the file server
hosting the file system. The file system then converts the file-level request into block-level
addressing and sends the request to the storage to access the data.
(Contd.)
Storage system architectures are based on the data access methods. Common variants are
block-based, file-based, object-based, and unified storage systems. A unified storage system
architecture uses all the three data access methods. A cloud provider may deploy one or
more types of these storage systems to meet the requirements of different applications.
A block-based storage system enables the creation and assignment of storage volumes to
compute systems. The compute OS (or hypervisor) discovers these storage volumes as local
drives. A file system can be created on these storage volumes, for example NTFS in a
Windows environment, which can then be formatted and used by applications.
A block-based storage system typically comprises four key components:
• Front-end Controller(s)
• Cache Memory
• Back-end Controller(s)
• Physical disks
The front-end controller provides the interface between the storage system and the
compute systems. Typically, there are redundant controllers in the front-end for high
availability, and each controller contains multiple ports. Each front-end controller has
processing logic that executes the appropriate transport protocol, such as Fibre Channel,
iSCSI, or FCoE (discussed later in this module) for storage connections. Front-end controllers
route data to and from a cache memory via an internal data bus.
The cache is a semiconductor memory where data is placed temporarily to reduce the time
required to service I/O requests from the compute system. The cache improves storage
system performance by isolating compute systems from the mechanical delays associated
with disk drives. Accessing data from the cache typically takes less than a millisecond.
(Contd.)
A file-based storage system, also known as Network-Attached Storage (NAS), is a dedicated,
high-performance file server having either integrated storage or connected to external
storage. NAS enables clients to share files over an IP network. NAS supports NFS and CIFS
protocols to give both UNIX and Windows clients the ability to share the same files using
appropriate access and locking mechanisms. NAS systems have integrated hardware and
software components, including a processor, memory, NICs, ports to connect and manage
physical disk resources, an OS optimized for file serving, and file sharing protocols. A NAS
system consolidates distributed data into a large, centralized data pool accessible to, and
shared by, heterogeneous clients and application servers across the network. Consolidating
data from numerous and dispersed general purpose servers onto NAS results in more
efficient management and improved storage utilization. Consolidation also offers lower
operating and maintenance costs.
There are two common NAS deployment options - traditional NAS systems (scale-up NAS)
and scale-out NAS systems.
A traditional NAS solution provides the capability to scale the capacity and performance of a
single NAS system. Scaling up a NAS system involves upgrading or adding NAS components
and storage to the NAS system. These NAS systems have a fixed capacity ceiling, and
performance is impacted as the capacity limit is approached.
Scale-out NAS is designed to address the rapidly growing area of unstructured data (data
that does not fit in tables and rows), especially Big Data (data sets whose size or scale break
traditional tools). Scale-out NAS enables the creation of a clustered NAS system by pooling
multiple processing and storage nodes together. The cluster works as a single NAS system
and is managed centrally. The capacity of the cluster can be increased by simply adding
nodes to the it. A node contains common server components and may or may not have
disks. As each node is added to the cluster, it increases the aggregated disk, cache,
processor, and network capacity of the cluster as a whole. Nodes can be non-disruptively
added to the cluster when more performance and capacity is needed. Scale-out NAS creates
a single file system that runs on all nodes in the cluster. As nodes are added, the file system
grows dynamically and data is evenly distributed (or redistributed) to all nodes in the cluster.
Object-based storage is a way to store file data in the form of objects based on the content
and other attributes of the data rather than the name and location of the file. An object
contains user data, related metadata (size, date, ownership, etc.), and user defined
attributes of data (retention, access pattern, and other business-relevant attributes). The
additional metadata or attributes enable optimized search, retention and deletion of
objects. For example, when an MRI scan of a patient is stored as a file in a NAS system, the
metadata is basic and may include information such as file name, date of creation, owner,
and file type. When stored as an object, the metadata component of the object may include
additional information such as patient name, ID, attending physician’s name, and so on,
apart from the basic metadata.
Each object stored in the object-based storage system is identified by a unique identifier
called the object ID. The object ID allows easy access to objects without having to specify
the storage location. The object ID is generated using specialized algorithms (such as a hash
function) on the data and guarantees that every object is uniquely identified. Any changes in
the object, like user-based edits to the file, results in a new object ID. This makes object-
based storage a preferred option for long term data archiving to meet regulatory or
compliance requirements. The object-based storage system uses a flat, non-hierarchical
address space to store data, providing the flexibility to scale massively. Cloud service
providers leverage object-based storage systems to offer Storage as a Service because of its
inherent security, scalability, and automated data management capabilities. Object-based
storage systems support web service access via REST and SOAP.
(Contd.)
The object-based storage system has three key components: nodes, internal (private)
network, and storage. The object-based storage system is composed of one or more nodes.
In this context, a node is a server that runs the object-based storage operating environment
and provides services to store, retrieve, and manage data in the system. The object-based
storage system node has two key services: metadata service and storage service. The
metadata service is responsible for generating the object ID from the contents of a file. It
also maintains the mapping between the object IDs and the file system namespace. The
storage service manages a set of drives on which the data is stored. The nodes connect to
the storage via an internal network. The internal network provides both node-to-node
connectivity and node-to-storage connectivity. The application server accesses the object-
based storage node to store and retrieve data over an external network. In some
implementations, the metadata service might reside on the application server or on a
separate server.
Object-based storage provides the capability to automatically detect and repair corrupted
objects, and to alert the administrator of any potential problem. It also provides on-demand
reporting and event notification. Some object-based storage systems support storage
optimization techniques such as single instance storage, where only one instance of an
object is stored, thereby optimizing usable capacity.
Unified storage or multiprotocol storage has emerged as a solution that consolidates block,
file, and object-based access within one storage platform. It supports multiple protocols
such as CIFS, NFS, iSCSI, FC, FCoE, REST, and SOAP for data access. Such a unified storage
system is managed using a single interface. A unified storage system consists of the
following key components: storage controller, NAS head, OSD node, and storage.
The storage controller provides block-level access to compute systems through various
protocols. It contains front-end ports for direct block access. The storage controller is also
responsible for managing the back-end storage pool in the storage system. The controller
configures storage volumes and presents them to NAS heads and OSD nodes, as well as to
the compute systems.
A NAS head is a dedicated file server that provides file access to NAS clients. The NAS head
connects to the storage via the storage controller. The system usually has two or more NAS
heads for redundancy. The NAS head configures the file systems on assigned volumes,
creates NFS, CIFS, or mixed shares, and exports the shares to the NAS clients.
The OSD node also accesses the storage through the storage controller. The volumes
assigned to the OSD node appear as physical disks. These disks are configured by the OSD
nodes, enabling them to store object data.
This lesson described the common types of persistent data storage devices. This lesson also
described the different RAID techniques (striping, mirroring, and parity) used for data
protection and for improving storage performance. Finally, this lesson described the
different data access methods and the storage system architectures based on them,
including block-based, file-based, object-based, and unified storage systems.
This lesson covers the types of network communication and describes compute-to-compute
communication. This lesson also covers compute-to-storage communication via a storage
area network (SAN) and the classification of SAN. Further, this lesson describes inter-cloud
communication.
A network establishes communication paths between the devices in an IT infrastructure.
Devices that are networked together are typically called “nodes”. A network enables
information exchange and resource sharing among a large numbers of nodes spread across
geographic regions and over long distances. A network may also be connected to other
networks to enable data transfer between nodes.
Cloud providers typically leverage different types of networks supporting different network
protocols and transporting different classes of network traffic. As established in the
discussion of fundamental cloud characteristics, cloud consumers require reliable and
secure network connectivity to access cloud services. A provider connects the cloud
infrastructure to a network enabling clients (consumers) to connect to the cloud over the
network and use cloud services. For example, in an on-premise private cloud the clients
typically connect to the cloud infrastructure over an internal network, such as a LAN. In case
of a public cloud, the cloud infrastructure connects to an external network, typically the
Internet, over which consumers access cloud services.
Cloud service providers may also use IT resources at one or more data centers to provide
cloud services. If multiple data centers are deployed, the IT resources from these data
centers may be logically aggregated by connecting them over a wide area network (WAN).
This enables both migration of cloud services across data centers and provisioning cloud
services using resources from multiple data centers. Also, multiple clouds may be inter-
connected over a WAN to enable workloads to be moved or distributed across clouds. This
scenario was covered in Module 1 as part of the discussion on cloud bursting in a hybrid
cloud environment.
Networks in a cloud environment may be classified into various types based on attributes
such as communication protocol, topology, transport medium, and so on. Generally network
communication may be categorized into: compute-to-compute communication, compute-
to-storage communication, and inter-cloud communication.
Compute-to-compute communication typically uses protocols based on the Internet
Protocol (IP). Each physical compute system (running an OS or a hypervisor) is connected to
the network through one or more physical network cards. Physical switches and routers are
the commonly-used interconnecting devices. A switch enables different compute systems in
the network to communicate with each other. A router or an OSI Layer-3 switching device
allows different networks to communicate with each other. Commonly-used network cables
are copper cables and optical fiber cables. The figure on the slide shows a network (Local
Area Network - LAN or Wide Area Network - WAN) that provides interconnections among
the physical compute systems. The cloud provider has to ensure that appropriate switches
and routers, with adequate bandwidth and ports, are in place to ensure the required
network performance.
A network of compute systems and storage systems is called a storage area network (SAN).
A SAN enables the compute systems to access and share storage systems. Sharing improves
utilization of the storage systems. Using a SAN facilitates centralizing storage management,
which in turn simplifies and potentially standardizes the management effort.
SANs are classified based on protocols they support. Common SAN deployments are Fibre
Channel SAN (FC SAN), Internet Protocol SAN (IP SAN), and Fibre Channel over Ethernet SAN
(FCoE SAN).
An FC SAN is a high speed, dedicated network of compute systems and shared storage
systems that uses Fibre Channel (FC) protocol to transport data, commands, and status
information between the compute and storage systems. FC protocol primarily implements
the Small Computer System Interface (SCSI) command set over FC, although it also supports
other protocols such as Asynchronous Transfer Mode (ATM), Fibre Connection (FICON), and
IP. SCSI over FC overcomes the distance and accessibility limitations associated with
traditional, direct-attach SCSI protocol systems. FC protocol provides block-level access to
the storage systems. It also provides a serial data transfer interface that operates over both
copper and optical fiber cables. Technical committee T11, a committee within International
Committee for Information Technology Standards (INCITS), is responsible for FC interface
standards. The latest FC implementations of 16 Gigabit Fibre Channel (GFC) offers data
transfer speeds up to 16 Gbps. The FC architecture is highly scalable, and theoretically a
single FC SAN can accommodate approximately 15 million nodes.
Note: The term “Fibre” refers to the protocol, whereas the term “fiber” refers to the medium.
The key FC SAN components include network adapters, cables and connectors, and
interconnecting devices.
Each node requires one or more network adapters to provide a physical interface for
communicating with other nodes. Examples of network adapters are FC host bus adapters
(HBAs), and storage system front-end adapters. An FC HBA has SCSI-to-FC processing
capability. It encapsulates OS (or hypervisor) storage I/Os (usually SCSI I/O) into FC frames
before sending the frames to FC storage systems over an FC SAN.
FC SAN predominantly uses optical fiber to provide physical connectivity between nodes.
Copper cables might be used for shorter distances. A connector may attach at the end of a
cable to enable swift connection and disconnection of the cable to and from a port.
FC switches and directors are the interconnecting devices commonly used in an FC SAN to
forward data from one physical switch port to another. Directors are high-end switches with
a higher port count and better fault-tolerance capabilities than smaller switches (also known
as “departmental” switches) . Switches are available with a fixed port count or with a
modular design. In a modular switch, the port count is increased by installing additional port
cards into empty slots. Modular switches enable online installation of port cards. The
architecture of a director is usually modular, and its port count is increased by inserting line
cards or blades to the director’s chassis.
A fabric is created with an FC switch (or an FC director) or a network of switches that enable
all nodes to connect to each other and communicate. Each switch in a fabric contains a
unique domain identifier (ID), which is part of the fabric’s addressing scheme. Each node
port in a fabric has a unique 24-bit FC address for communication. Further, each network
adapter and node port (network adapter port) in the FC environment is assigned a 64-bit
unique identifier called the World Wide Name (WWN). Unlike an FC address, which is
assigned dynamically, a WWN is a static name. WWNs are burned into the hardware or
assigned through software. There are two types of WWN: World Wide Node Name (WWNN)
is used to physically identify FC network adapters and World Wide Port Name (WWPN) is
used to physically identify FC adapter ports or node ports. For example, a dual-port FC HBA
has one WWNN and two WWPNs.
A port in a switched fabric can be one of the following types:
• N_Port is an end-point in the fabric. This port is also known as the node port. Typically, it
is a compute system port (FC HBA) or a storage system port connected to a switch in a
fabric.
• E_Port is a switch port that forms a connection between two FC switches. This port is
also known as an expansion port. The E_Port on an FC switch connects to the E_Port of
another FC switch in the fabric through ISLs.
• F_Port is a port on a switch that connects an N_Port. It is also known as a fabric port.
• G_Port is a generic port on some vendors’ switches. It can operate as an E_Port or an
F_Port and determines its functionality automatically during initialization.
IP SAN uses the Internet Protocol (IP) for the transport of storage traffic. It transports block
I/O over an IP-based network. IP is a mature technology, and using IP SAN as a storage
networking option provides several advantages. Cloud providers may have an existing IP-
based network infrastructure, which could be used for storage networking. Use of an
existing IP-based network therefore may be a more economical option than investing in
building a new FC SAN infrastructure. In addition, many robust and mature security options
are available for IP networks. Many long-distance, disaster recovery (DR) solutions already
leverage IP-based networks. Therefore, with IP SAN, providers can extend the geographical
reach of their storage infrastructure.
Two primary protocols that leverage IP as the transport mechanism for block-level data
transmission are Internet SCSI (iSCSI) and Fibre Channel over IP (FCIP).
iSCSI encapsulates SCSI commands and data into IP packets. These IP packets are
transported over an IP-based network. iSCSI network components include:
• iSCSI initiators such as a software iSCSI adapter and an iSCSI HBA
• iSCSI targets such as a storage system with iSCSI port or an iSCSI gateway
• IP-based network
An iSCSI initiator sends commands and associated data to a target and the target returns
data and responses to the initiator. The software iSCSI adapter is an OS (or hypervisor)
kernel-resident software that uses an existing NIC of the compute system to emulate an
iSCSI initiator. An iSCSI HBA has a built-in iSCSI initiator and is capable of providing
performance benefits over software iSCSI adapters by offloading the entire iSCSI and TCP/IP
processing from the processor of the compute system. If an iSCSI-capable storage system is
deployed, then an iSCSI initiator can directly communicate with the storage system over an
IP-based network. This type of iSCSI implementation is called native iSCSI. Otherwise, in an
iSCSI implementation that uses a storage system with only FC ports, an iSCSI gateway is
used. This gateway device performs the translation of IP packets to FC frames and vice
versa, thereby bridging the connectivity between the IP and FC environments. This type of
iSCSI implementation is called bridged iSCSI. The figure on the slide shows both native and
bridged iSCSI implementations.
FCIP is an encapsulation of FC frames into IP packets. FCIP is a tunneling protocol that
enables distributed FC SAN islands to be interconnected over the existing IP-based
networks. This enables transporting FC data between disparate FC SANs that may be
separated by a long distance. In an FCIP environment, an FCIP entity such as an FCIP gateway
is deployed at either end of the tunnel between two FC SAN islands, as shown in the figure
on the slide. An FCIP gateway encapsulates FC frames into IP packets and transfers them to
the remote gateway through the tunnel. The remote FCIP gateway de-encapsulates the FC
frames from the IP packets and sends the frames to the remote FC SAN. FCIP is extensively
used in disaster recovery implementations in which data is replicated to storage located at a
remote site.
An FCIP implementation is capable of merging interconnected fabrics into a single fabric. In
a merged fabric, the fabric service related traffic travels between interconnected FC SANs
through the FCIP tunnel. However, only a small subset of nodes at either end of the FCIP
tunnel requires connectivity across the tunnel. Thus, the majority of FCIP implementations
today use some switch-specific features to prevent the fabrics from merging and also
restrict which nodes are allowed to communicate across the fabrics.
FCoE SAN is a converged enhanced Ethernet (CEE) network that is capable of transporting FC
data along with regular Ethernet traffic over high speed (such as 10 Gbps or higher) Ethernet
links. It uses the FCoE protocol that encapsulates FC frames into Ethernet frames. FCoE is
based on an enhanced Ethernet standard that supports Data Center Bridging (DCB)
functionalities. DCB ensures lossless transmission of FC traffic over Ethernet.
FCoE SAN provides the flexibility to deploy the same network components for transferring
both compute-to-compute traffic and FC storage traffic. This helps in reducing the
complexity of managing multiple discrete network infrastructures. FCoE SAN uses multi-
function network adapters and switches. Therefore, FCoE reduces the number of adapters,
cables, and switches, along with power and space consumption required in a data center.
An FCoE SAN consists of converged network adapters (CNAs), FCoE switches, cables, and
FCoE storage ports.
A CNA is a physical adapter that provides the functionality of both NIC and FC HBA in a
single device. It consolidates both FC traffic and regular Ethernet traffic on a common
Ethernet infrastructure. CNAs connect compute systems to FCoE switches. They are
responsible for encapsulating FC traffic onto Ethernet frames and forwarding them to FCoE
switches over CEE links.
Instead of CNA, a software FCoE adapter may also be used. A software FCoE adapter is
software on the compute system that performs FCoE processing. FCoE processing consumes
compute system processor cycles. With software FCoE adapters, the compute system
implements FC protocol in software that handles SCSI to FC processing. The software FCoE
adapter performs FC to Ethernet encapsulation. Both FCoE traffic (Ethernet traffic that
carries FC data) and regular Ethernet traffic are transferred through supported NICs on the
compute system.
The figure on the slide shows an FCoE implementation that consolidates both FC SAN traffic
and LAN (Ethernet) traffic on a common Ethernet infrastructure.
An FCoE switch has the functionalities of both an Ethernet switch and an FC switch. It has a
Fibre Channel Forwarder (FCF), an Ethernet Bridge, and a set of ports that can be used for
FC, Ethernet or FCoE connectivity. The function of the FCF is to encapsulate the FC frames
received from an existing FC SAN into the Ethernet frames, and also to de-encapsulate the
Ethernet frames received from the Ethernet Bridge to the FC frames.
Some vendors offer FCoE ports in their storage systems. These storage systems connect
directly to FCoE switches. The FCoE switches form FCoE fabrics between compute and
storage systems and provide end-to-end FCoE support. The figure on the slide shows an
FCoE implementation with an FCoE-capable storage system.
The cloud tenets of rapid elasticity, resource pooling, and broad network create a sense of
availability of limitless resources in a cloud infrastructure that can be accessed from any
location over a network. However a single cloud does not have an infinite number of
resources. A cloud that does not have adequate resources to satisfy service requests from
clients, may be able to fulfill the requests if it is able to access resources from another cloud.
For example, in a hybrid cloud scenario, a private cloud may access resources from a public
cloud during peak workload periods. There may be several combinations of inter-cloud
connectivity as depicted in the figure on the slide. Inter-cloud connectivity enables clouds to
balance workloads by accessing and using computing resources, such as processing power
and storage resources from other cloud infrastructures. The cloud provider has to ensure
network connectivity of the cloud infrastructure over a WAN to the other clouds for
resource access and workload distribution.
This lesson covered the types of network communication, compute-to-compute
communication, and compute-to-storage communication over a storage area network
(SAN). This lesson also covered the classification of SAN – FC SAN, IP SAN, and FCoE SAN –
and described the components and architecture of each. Finally, this lesson covered inter-
cloud communication.
The Concepts in Practice section covers EMC Symmetrix VMAX Cloud Edition, EMC VNX,
EMC Isilon, EMC Atmos, EMC XtremIO, and EMC Connectrix.
EMC Symmetrix Virtual Matrix (VMAX) Cloud Edition is a self-service, enterprise-class
cloud storage delivery platform. It enables private companies and service providers building
private, hybrid, or public clouds to efficiently deliver mission-critical storage services. It
provides high performance, reliability, availability, and scalability for mission-critical
applications. VMAX Cloud Edition is a Tier 1 storage that is easily accessible by non-storage
experts through web-based, self-service access. It uses a building-block approach to create a
multi-tenant, “as-a-Service” delivery platform, and provides tenants with a choice of
performance levels.
The EMC VNX family is a group of products that provide a unified storage platform that
consolidates block, file, and object access into one solution. The VNX series is built for
medium-sized and enterprise-class businesses. It enables organizations to dynamically grow,
share, and manage multi-protocol file systems and multi-protocol block storage access. The
VNX operating environment enables Windows and UNIX/Linux users to share files suing NFS
and CIFS. It also supports FC, iSCSI, and FCoE access.
EMC Isilon is a scale-out NAS storage product family powered by the OneFS operating
environment. Isilon enables pooling multiple nodes together to construct a clustered NAS
system. OneFS is the operating environment that creates a single file system that spans
across all nodes in an Isilon cluster. EMC Isilon provides the capability to manage and store
large (petabyte-scale), high-growth data in a single system with the flexibility to meet a
broad range of performance requirements.
EMC Atmos is a cloud storage platform for enterprises and service providers to deploy
public, private, or hybrid cloud storage. It enables to store, manage, and protect globally
distributed, unstructured content at scale. Atmos is a scale-out object architecture that
stores data as objects with associated metadata. It enables storage to be scaled out without
the need to rewrite applications. Some of the key cloud features of Atmos include a global
namespace, REST API-driven storage, multi-tenancy, self-service, and metering and
chargeback.
EMC XtremIO is an all-flash, block-based, scale-out enterprise storage array that provides
substantial improvements to I/O performance. It is purpose-built to leverage flash media
and delivers new levels of real-world performance, administrative ease, and advanced data
services for applications. It uses a scale-out clustered design that grows capacity and
performance linearly to meet any requirement. XtremIO arrays are created from building
blocks called "X-Bricks" that are each a high-availability, high-performance, fully
active/active storage system with no single point of failure. XtremIO's powerful operating
system, XIOS, manages the XtremIO storage cluster. XIOS ensures that the system remains
balanced and always delivers the highest levels of performance with no administrator
intervention. XtremIO helps administrators become more efficient by enabling system
configuration in a few clicks, provisioning storage in seconds and monitoring the
environment with real-time metrics.
The EMC Connectrix family is group of networked storage connectivity products. EMC offers
the following connectivity products under the Connectrix brand:
• Enterprise directors – Ideal for large enterprise connectivity. Offer high port density and
high component redundancy. Deployed in high-availability or large-scale environments
• Departmental switches – Designed to meet workgroup, department-level, and
enterprise-level requirements. Provide high availability through features such as non-
disruptive software and port upgrade, and redundant and hot-swappable components
• Multi-purpose switches – Support various protocols such as FC, iSCSI, FCIP, FCoE, and
FICON. Include FCoE switches, FCIP gateways, and iSCSI gateways. Multiprotocol
capabilities offer many benefits, including long-distance SAN extension, greater resource
sharing, and simplified management.
This module described the key components of a compute system and the common types of
physical compute systems – tower, rack-mounted, and blade – that are deployed in cloud
data centers. This module also described the common types of persistent storage devices,
the different RAID techniques (striping, mirroring, and parity), and the types of storage
system architectures – block-based, file-based, object-based, and unified storage systems.
This module also covered compute-to-compute communication, compute-to-storage
communication (SAN), and SAN classification – FC SAN, IP SAN, and FCoE SAN. Finally, this
module covered inter-cloud communication.
This module focuses on entities of virtualization layer. The module describe the
virtualization software, resource pool, and virtual resources.
Copyright © 2014 EMC Corporation. All rights reserved Module 4: Virtual Layer
Virtual layer is deployed on the physical layer. It specifies the entities that operate at this
layer such as virtualization software, resource pools, and virtual resources. A key function of
this layer is to abstract physical resources, such as compute, storage, and network, and
making them appear as virtual resources. Other key functions of this layer includes
executing the requests generated by control, and it also include forwarding requests to the
physical layer to get them executed. Examples of requested generated by control layers
include creating pools of resources and creating virtual resources.
This lesson covers overview of virtual layer, virtualization software, resource pool, and
virtual resources.
Virtualization refers to the logical abstraction of physical resources (such as compute,
network, and storage) that enables a single hardware resource to support multiple
concurrent instances of systems or multiple hardware resources to support single instance
of system. This involves making physical resources appear as a logical resources that are
able to transcend their physical constraints. For example, multiple disk drives can be
concatenated and presented as a single disk drive to a compute system. Similarly, a single
disk drive can be partitioned and presented as multiple disk drives to a compute system.
With virtualization, it is also possible to make a resource appear larger than it actually is.
Further, the abstraction of physical resources due to virtualization enables multitenant
environment, which improving utilization of the physical resources.
Virtualization enables consolidation of IT resources that helps to optimize their
infrastructure resource utilization. Improving the utilization of IT assets can reduce the costs
associated with purchasing new hardware. It also reduce space and energy costs associated
with maintaining the resources. Moreover, less people are required to administer these
resources, which further lowers the cost. Virtual resources are created using software that
enables faster deployment, compared to deploying physical resources. Virtualization
increases flexibility by allowing to create and reclaim the logical resources based on
business requirements.
While building cloud infrastructure, virtual layer is deployed on physical layer. This layer
enables fulfilling two key characteristics of cloud infrastructure, resource pooling and rapid
elasticity.
Virtual layer specifies the entities that operate at this layer such as virtualization software,
resource pools, and virtual resources. Virtual layer is built by deploying virtualization
software on compute systems, network devices, and storage devices.
The virtualization software preforms the abstraction of the physical resources on which they
are deployed. The key functions of a virtualization software is to pool resources and create
virtual resources. Virtualization software are deployed on compute systems, network
devices, and storage devices.
The software used for compute virtualization is known as the hypervisor. The hypervisor is
software that is installed on the x86 based compute system and enables multiple operating
systems to run concurrently on a physical compute system. The hypervisor along with
hypervisor management software (control software, which is discussed in module 5) is the
fundamental component for deploying software defined compute environment. The
hypervisor abstracts the physical compute hardware to create multiple virtual machine,
which to the operating systems look and behave like physical compute systems. The hypervisor
provides standardized hardware resources such as processor, memory, network, and disk to all the virtual
machines.
A hypervisor has two key components: kernel and virtual machine manager (VMM). A hypervisor kernel
provides the same functionality as does the kernel of any other operating system, including process creation,
file system management, and process scheduling. It is designed and optimized to run multiple virtual machines
concurrently. A VMM abstracts hardware and appears as a physical compute system with processor,
memory, I/O devices, and other components essential for operating systems and applications to run. Each
virtual machine is assigned a VMM that gets a share of the processor, memory, I/O devices, and storage
from the physical compute system to successfully run the virtual machine.
Hypervisors can be categorized into two types: bare-metal hypervisor and hosted hypervisor. A bare-metal
hypervisor is directly installed on the hardware. It has direct access to the hardware resources of the compute
system. Therefore, it is more efficient than a hosted hypervisor. However, this type of hypervisor may have
limited device drivers built in. Therefore, hardware certified by the hypervisor vendor is usually required to run
bare-metal hypervisors. A bare-metal hypervisor is designed for enterprise data centers and cloud
infrastructure, and supports advanced capabilities such as resource management, high availability, security,
and so on. In contrast to a bare-metal hypervisor, a hosted hypervisor is installed as an application on an
operating system. In this approach, the hypervisor does not have direct access to the hardware and all
requests must pass through the operating system running on the physical compute system. Hosted
hypervisors are compatible with all the devices that are supported by the operating system on which it is
installed. Using this type of hypervisor adds overhead compared to a bare-metal hypervisor, because typically
there are many services and processes running on an operating system that are consuming compute system
resources. A hosted hypervisor is therefore most suitable for development, testing, and training purposes.
The network virtualization software is either built into the operating environment of a
network device, installed on an independent compute system (discussed in module 5), or
available as hypervisor’s capability. The network virtualization software abstracts physical
network resources to create virtual resources such as virtual LANs or virtual SANs.
The network virtualization software built into the network device operating environment
has the ability to abstract the physical network. It has the ability to divide a physical network
in to multiple virtual networks such as virtual LANs and virtual SANs.
The network virtualization software installed on an independent compute system is the
fundamental component for deploying software defined network environment. This
software provide entire network infrastructure a single control point, enabling automated
and policy based network management.
The network virtualization can also be available as hypervisor’s capability, which emulates
network connectivity among VMs on a physical compute system. This software enables
creating virtual switches, that appears to the VM as physical switches.
The storage virtualization software is either built into the operating environment of a
storage device, installed on an independent compute system (discussed in module 5), or
available as hypervisor’s capability. The storage virtualization software abstracts physical
storage resources to create virtual resources such as virtual volumes or virtual arrays.
The storage virtualization software built into the array operating environment has the ability
to pool and abstract the physical storage devices and present it as a logical storage.
The storage virtualization software installed on an independent compute system is the
fundamental component for deploying software defined storage environment. The software
has the ability to pool and abstract the existing physical storage devices and present it as an
open storage platform. With the help of control software (discussed in module 5), the
storage virtualization software can perform tasks, such as virtual volume creation, apart
from creating virtual arrays. This software provide entire storage infrastructure a single control
point, enabling automated and policy based management.
The storage virtualization can also be available as hypervisor’s capability, which enables
creating virtual disk, that appears to the operating systems as physical disk drives.
A resource pool is aggregation of computing resources (processing power, memory, storage,
and network bandwidth), which provides an aggregated view of these resources to the
control layer. Virtualization software in collaboration with the control software pools the
resources. For example, storage virtualization software pools capacity of multiple storage
devices to appear as a single large storage capacity. Similarly, by using compute
virtualization software, the processor capacity of the pooled physical compute system can
be viewed as aggregation of the power of all processors (in megahertz). Resources in a pool
can be added or removed dynamically. Resource pool is detailed in the next lesson.
Virtual resources are created by allocating physical resources from resource pool. These
virtual resources share pooled physical resources. Examples of virtual resources include
virtual machine, virtual volume, and virtual network. Virtualization software also enables
capacity to be added to or removed from the virtual resources without any disruption to
applications or users. Virtual resources are detailed in lesson 3, 4, and 5.
This lesson covered virtual layer, virtualization software (compute, network, and storage),
resource pool, and virtual resources.
This lesson covers resource pool, examples of resource pooling, identity pool, and
classification of pools.
A resource pool is a logical abstraction of the aggregated computing resources, such as
processing power, memory capacity, storage, and network bandwidth, that is managed
centrally. Cloud services obtain computing resources from resource pools. Resources from
the resource pools are dynamically allocated according to consumer demand up to a limit
defined for each cloud service. The allocated resources are returned to the pool when they
are released by consumers, making them available for reallocation. The figure in the slide
shows the allocation of resources from a resource pool to service A and service B that are
assigned to consumer A and consumer B respectively.
Resource pools are designed and sized according to the service requirements. A cloud
administrator can create, remove, expand, or contract a resource pool as needed. In a cloud
infrastructure, multiple pools of same or different resource types may be configured to
provide various cloud services. For example, two independent storage pools in a cloud
having different performance characteristics can provide resources to a high-end and a mid-
range storage service. Also, an application service, for example, can obtain processing power
from a processor pool and network bandwidth from a network bandwidth pool.
Cloud services comprising virtual machines (VMs) consume processing power and memory
capacity respectively from the processor and memory pools from which they are created.
Figure on the slide illustrates an example of pooling processing power and memory capacity,
and allocating resources to VMs that are elements of service A and service B. These cloud
services are assigned to consumer A and consumer B.
In the figure, a processor pool aggregates the processing power of three physical compute
systems running hypervisor; likewise, a memory pool aggregates the memory capacity of
these compute systems. Therefore, the processor pool has 12000 MHz of processing power
and the memory pool possesses 18 GB of memory capacity. Each VM is allocated 1500 MHz
of processing power and 2 GB of memory capacity at the time they are created. After
allocation of resources to the VMs, the processor pool has 4500 MHz processing power and
the memory pool has 8 GB memory capacity remaining, which can be allocated to new or
existing VMs according to service demand.
A storage pool in a block-based storage system comprises the aggregated physical storage
space of a set of physical drives. Storage space is allocated from the storage pool to virtual
volumes (also called logical unit number – LUN) that are created from the pool. These virtual
volumes are provisioned to consumers upon receiving their storage requests. Figure on the
slide illustrates an example where storage space of a set of physical drives are pooled and
required storage space is allocated to virtual volumes from the pool .
In the figure, a storage pool in a block-based storage system aggregates the storage space of
four physical drives. Combining the usable storage space of these drives, the storage pool
has 4000 GB of storage space. Three virtual volumes are provisioned from this pool, which
are elements of service A, service B, and service C. These services are assigned to three
consumers – consumer A, consumer B, and consumer C. These virtual volumes are allocated
200 GB, 400 GB, and 800 GB of storage space as per storage requirement of consumers.
After allocation of storage resources to the virtual volumes, the storage pool has 2600 GB
storage space remaining, which can be allocated to new or existing virtual volumes
according to service demand.
Figure on the slide illustrates a more complex storage pooling example, where a higher-level
storage pool is created by aggregating the storage space of four storage pools configured
within four block-based storage systems. Storage space from the higher-level storage pool is
allocated to virtual volumes that are elements of service A, service B, and service C. These
services are assigned to consumer A, consumer B, and consumer C.
Pooling across multiple storage systems provides a unified platform for provisioning storage
services that can store data at massively scale. Multiple such pools can be created in a cloud
environment having different performance and availability levels. They cater to the needs of
various storage service offerings.
Cloud services comprising VMs obtain network bandwidth from network bandwidth pools.
VMs are allocated appropriate network bandwidth to meet required service level. Figure on
the slide illustrates an example where a network bandwidth pool is created by aggregating
the network bandwidth of three physical network interface cards (NICs). These NICs are
installed on a physical compute system running hypervisor.
As shown in the figure, the network bandwidth pool has 3000 Mbps of network bandwidth.
Service A and service B are allocated 600 Mbps and 300 Mbps network bandwidth
respectively as per data transfer requirement of consumers. Service A and service B are
assigned to consumer A and consumer B respectively. After allocation of bandwidth to the
services, the network bandwidth pool has 2100 Mbps network bandwidth remaining, which
can be allocated to new or existing services as needed.
An identity pool, unlike a resource pool, specifies a range of network identifiers (IDs) such as
virtual network IDs and MAC addresses. These IDs are allocated from identity pools to the
elements of cloud services. In a service, for example, constituent virtual networks obtain IDs
from a virtual network ID pool; likewise, VMs in a service get MAC addresses from a MAC
address pool.
An identity pool may map or allocate IDs to a particular service or to a group of services. For
example, service A is mapped to pool A containing IDs 1 to 10 and service B is mapped to
pool B containing IDs 11 to 100. If an identity pool is run out of IDs, then administrators may
create an additional pool or add more IDs to the existing pool. The 1-to-1 mapping between
an identity pool and a service eases tracking the use of IDs by a particular service. However,
this increases management complexity as many identity pools must be created and
managed depending on the number of services.
Both resource and identity pools are classified based on various criteria, such as
performance, capacity, type of application, resource location, protection level, regulatory
compliance, and suitability or availability to specific organizations, departments or
consumer roles. This helps standardizing pools based on predefined criteria. Pools of
different classes are used to create a variety of service offerings, providing choices to the
cloud service consumers. Pool classifications are guided by the consumer requirements.
Before configuring the resource pools, a provider needs to define classification criteria that
will be used to create the pools.
Multiple classes may be defined for each type of pool. Each class is marked with a name
such as ‘Gold’, ‘Silver’, and ‘Bronze’. Typically, one would expect ‘Gold’ class to cost more
than ‘Bronze’. The figure in the slide provides an example of classifying storage pools. In this
example, three names are used to distinguish three different storage pool classes.
This lesson covered resource pool and resource pooling examples such as pooling
processing power, memory capacity, storage and network bandwidth. It also covered
identity pool and classification of pools based on predefined criteria.
This lesson covers virtual machine (VM), VM hardware, VM files, and file system to manage
VM files. This lesson also covers VM console, VM template, and virtual appliance. Further
this lesson covers VM network and its components.
A virtual machine (VM) is a logical compute system that, like a physical compute system,
runs an operating system (OS) and applications. A VM is created on a physical compute
system by a hosted or a bare-metal hypervisor. A VM has a self-contained operating
environment, comprising an OS, applications, and virtual hardware, such as a virtual
processor, memory, storage, and network resources. An OS – called a “guest” OS – is
installed on a VM in the same way as an OS is installed on a physical compute system. From
the perspective of the guest OS, the VM appears as a physical compute system. Each VM has
its own configuration for hardware, software, network, security, and so on. The VM behaves
like an physical compute system, but does not have direct access either to the underlying
host OS (when a hosted hypervisor is used) or to the hardware of the physical compute
system on which it is created. The hypervisor translates the VM’s resource requests and
maps the virtual hardware of the VM to the hardware of the physical compute system. For
example, a VM’s I/O requests to a virtual disk drive are translated by the hypervisor and
mapped to a file on the physical compute system’s disk drive.
Compute virtualization software enables creating and managing several VMs – each with a
different OS of its own – on a physical compute system or a compute cluster. In a cloud
environment, a provider typically provisions VMs to consumers to deploy their applications.
The VM hardware and software are configured to meet the application’s requirements. The
VMs of consumers are isolated from each other so that the applications and services
running on one VM do not interfere with those running on other VMs. The isolation also
provides fault tolerance, so that if one VM crashes, the other VMs remain unaffected.
When a VM is created, it is presented with virtual hardware components that appear as
physical hardware components to the guest OS. Within a given vendor’s environment, each
VM has standardized hardware components that make them portable across physical
compute systems. Based on the requirements, the virtual components can be added or
removed from a VM. However, not all components are available for addition and
configuration. Some hardware devices are part of the virtual motherboard and cannot be
modified or removed. For example, the video card and the PCI controllers are available by
default and cannot be removed. As shown in the figure on the slide, the typical hardware
components of a VM include virtual processor(s), virtual motherboard, virtual RAM, virtual
disk, virtual network adapter, optical drives, serial/parallel ports, peripheral devices, and so
on.
A VM can be configured with one or more virtual processors. The number of virtual
processors in a VM can be increased or reduced, based on the requirements. When a VM is
started, its virtual processors are scheduled by the hypervisor kernel to run on the physical
processors. Each VM is assigned a virtual motherboard with the standardized devices
essential for a compute system to function. Virtual RAM is the amount of physical memory
allocated to a VM and it can be configured based on the requirements. The virtual disk is a
large physical file, or a set of files that stores the VM’s OS, program files, application data,
and other data associated with the VM. A virtual network adapter functions like a physical
network adapter. It provides connectivity between VMs running on the same or different
compute systems, and between a VM and physical compute systems. Virtual optical drives
and floppy drives can be configured to connect to either physical devices or to image files,
such as ISO and floppy images (.flp), on the storage. SCSI/IDE virtual controllers provide a
way for the VMs to connect to the storage devices. The virtual USB controller is used to
connect to a physical USB controller and to access connected USB devices. Serial/parallel
ports proved an interface for connecting peripherals to the VM.
From a hypervisor’s perspective, a VM is a discrete set of files on a storage device. Some of
the key files that make up a VM are the configuration file, the virtual disk file, the memory
file, and the log file. The configuration file stores the VM’s configuration information,
including VM name, location, BIOS information, guest OS type, virtual disk parameters,
number of processors, memory size, number of adapters and associated MAC addresses,
SCSI controller type, and disk drive type. The virtual disk file stores the contents of a VM’s
disk drive. A VM can have multiple virtual disk files, each of which appears as a separate disk
drive to the VM. The memory state file stores the memory contents of a VM and is used to
resume a VM that is in a suspended state. The snapshot file stores the running state of the
VM including its settings and the virtual disk, and may optionally include the memory state
of the VM. It is typically used to revert the VM to a previous state. Log files are used to keep
a record about the VM’s activity and are often used for troubleshooting purposes.
When a VM is created, the associated VM files are created and placed on the storage that is
presented to the hypervisor. A file system is configured on the storage to manage the VM
files. Most hypervisors support two types of file systems: the hypervisor’s native file system,
and a shared file system, such as NFS or CIFS.
A hypervisor’s native file system is usually a clustered file system, and the storage presented
to it is typically optimized to store the VM files. The file system can be deployed on the
storage provisioned either from a local storage, or from external storage devices connected
through Fibre Channel, iSCSI, or FCoE. The file system allows multiple hypervisors, running
on different physical compute systems, to read from and write to the same shared storage
resources concurrently. This enables high availability capabilities, such as the migration of
VMs between clustered hypervisors in the event of failure of one of the hypervisors or
compute systems. A locking mechanism ensures that a VM is not powered on by multiple
hypervisors at the same time. When a hypervisor fails, the locking mechanism for each VM
running on the physical compute system is released. It is then possible for the VMs to be
restarted on other hypervisors.
A shared file system enables VM files to be stored on remote file servers or NAS devices that
are accessed over an IP-based network. The file systems are accessed using file sharing
protocols such as NFS and CIFS. Hypervisors have built-in NFS or CIFS clients that enable
communication with the file servers and NAS devices.
The capacity of the file system can be dynamically increased without disrupting the VMs
running on a physical compute system. If the volumes on which the file system resides have
additional configurable capacity, then the file system can be extended to increase its
capacity. However, if there is no configurable capacity available on the volumes, then
additional capacity must be assigned before the file system can be extended.
An administrator connects to a VM using its console, which is an interface to view and
manage the VMs on a compute system or a cluster. The console may be installed locally on a
compute system, web-based, or accessed over a remote desktop connection. An
administrator uses the console to perform activities such as installing a guest OS, accessing
the BIOS of the VM, powering a VM on or off, editing startup and shutdown settings,
configuring virtual hardware, removing VMs, troubleshooting, and so on.
A VM template is a master copy of a virtual machine with a standardized virtual hardware
and software configuration, that can be used to create and provision new VMs. The VM
template typically includes a guest OS, a set of applications, and the hardware and software
configurations required to deploy a VM. Templates can be created in two ways – either by
converting a VM to a template or by cloning a VM to a template. When VM is converted to a
template, the original VM is replaced by the template. When a VM is cloned to a template,
the original VM is retained. A VM template provides preinstalled and preconfigured
software, which makes provisioning VMs faster and eliminates installation, configuration,
and maintenance overheads. It also enables ensuring consistency and standardization across
VMs, which makes it easier to diagnose and troubleshoot problems.
A VM template can be updated with new software and with OS and software patches.
Updating the VM template involves the conversion of the template back to a VM and then
the installation of the new software or patches. After the update is complete, the VM is
converted back into a template. While updating the template, the relevant VM must be
isolated to prevent user access.
A virtual appliance is a preconfigured virtual machine preinstalled with a guest operating system and an
application, and dedicated to a specific function. In a cloud environment, virtual appliances are used
for different functions, such as to provide Software as a Service, to run cloud management
software, to route packets, and for providing security features such as a firewall or network
intrusion detection.
Using a virtual appliance simplifies the delivery and operation of an application. Typically,
the process is time-consuming and error-prone, and involves setting up a new VM, installing
the guest OS and then the application. In contrast, a virtual appliance deployment is faster
because the VM is preconfigured and has preinstalled software. This simplifies installation
and eliminates configuration issues, such as software or driver compatibility problems. Also,
the application runs in isolation within the virtual appliance, and it is protected against
crashes and security issues of the other virtual appliances. Virtual appliances are typically
created using the Open Virtualization Format (OVF) – an open, hypervisor-independent
packaging and distribution format.
A VM network is a logical network that provides Ethernet connectivity and enables communication between
the VMs running on a hypervisor within a compute system. A VM network includes logical switches, called
virtual switches. Virtual switches function similar to physical Ethernet switches, but may not have all the
functionalities of a physical Ethernet switch.
Consider the example of a web application that is running on a VM and needs to communicate with a database
server. The database server could be running on another VM on the same compute system. The two VMs can
be connected via a VM network to enable them to communicate with each other. Because the traffic between
the VMs does not travel over a network external to the compute system, the data transfer speed between the
VMs is increased.
In some cases, the VMs residing on different compute systems may need to communicate either with each
other, or with other physical compute systems, such as a client machines. To transfer these types of network
traffic, the VM network must be connected to the network of physical compute systems. In this case, the VM
traffic travels over both the VM network and the network of physical compute systems. The figure on the slide
shows two physical compute systems, each with a VM network and both the VM networks connected to a
network of physical compute systems.
VM networks comprise virtual switches, virtual NICs, and uplink NICs that are created on a
physical compute system running a hypervisor.
A virtual switch is a logical OSI Layer 2 Ethernet switch created within a compute system. A
virtual switch is either internal or external. An internal virtual switch connects only the VMs
on a compute system. It has no connection to any physical NIC, and cannot forward traffic to
a physical network. An external virtual switch connects the VMs on a compute system to
each other and also to one or more physical NICs. It enables the VMs to communicate
internally and also to send traffic to external networks. A physical NIC already connected to
a virtual switch cannot be attached to any other virtual switch. A virtual switch also provides
traffic management for the VMs and maintains a MAC address table for forwarding frames
to a virtual switch port based on the destination address. A single virtual switch, called a
distributed virtual switch, can also function across multiple physical compute systems. It is
created and configured from a centralized management server. Once created, instances of
the distributed virtual switch with identical networking configurations appear on each
physical compute system managed by the management server. Configuration changes to the
distributed virtual switch are applied to all its instances.
A virtual NIC connects a VM to a virtual switch and functions similar to a physical NIC.
Virtual NICs send and receive VM traffic to and from the VM network. A VM can have one or
more virtual NICs. Each virtual NIC has unique MAC and IP addresses and uses the Ethernet
protocol exactly as a physical NIC does. The hypervisor generates the MAC addresses and
allocates them to virtual NICs. The guest OS installed on a VM sends network I/O to the
virtual NIC using a device driver similar that of a physical NIC. A virtual NIC forwards the I/Os
in the form of Ethernet frames to the virtual switch for transmission to the destination. It
adds its MAC and IP addresses as source addresses to the Ethernet frames it forwards.
An uplink NIC is a physical NIC connected to the uplink port of a virtual switch, and functions
as an ISL between the virtual switch and a physical Ethernet switch. It is called an uplink
Copyright © 2014 EMC Corporation.
because a physical interface to connect a compute systemModule
All rights reserved
it only provides to the4:network
Virtual Layer
This lesson described virtual machines and VM hardware. It also described the files
associated with a VM and the file system to store and manage the VM files. This lesson also
covered VM console, VM template, and virtual appliance. Further this lesson covered VM
network and its components.
This lesson covers virtual volume and different ways to create virtual volumes.
Virtual volume or LUN (logical unit number) is created by abstracting the identity and
internal function of storage system(s) and appears as physical storage to the compute
system. The virtual to physical storage mapping is performed by the virtualization layer.
Virtual volumes are assigned to the compute system to create file system to store and
manage files. In a shared environment, there may be a chance that this LUN can be accessed
by an unauthorized compute system. LUN masking is a process that provides data access
control by defining which LUNs a compute system can access. This ensures that volume
access by compute systems is controlled appropriately, preventing unauthorized or
accidental access. In a cloud environment, the virtual volumes are created and assigned to
different services based on the requirements. For example, if a consumer requires 500 GB of
storage for their archival purpose, the service provider creates a 500 GB virtual volume and
assigned to the consumer. The storage capacity of a virtual volume can be dynamically
expanded or reduced based on the requirements. A virtual volume can be created from a
RAID set (traditional approach) or from a storage pool. Following slides will discuss these in
detail.
In the traditional approach, a virtual volume is created from a RAID set by partitioning the
available capacity into smaller units. A RAID set includes physical disks that are logically
grouped together and the required RAID level is applied. Virtual volumes are spread across
all the physical disks that belong to a RAID set. The figure on the slide shows a RAID set
consisting of four disks that has been partitioned into two volumes: virtual volume (LUN 0)
and virtual volume (LUN 1). Traditional virtual volumes are suited for applications that
require predictable performance. Traditional virtual volumes provide full control for precise
data placement and allow an administrator to create virtual volumes on different RAID
groups if there is any workload contention. Organizations that are not highly concerned
about storage space efficiency may still use traditional virtual volumes.
Virtual volumes can be created from the storage pool that comprises a set of physical drives
that provide the actual physical storage used by the volumes. A storage pool can contain a
few drives or hundreds of drives. Two types of LUNs can be created from the storage pool:
Thin and Thick virtual volume.
Thin virtual volumes do not require physical storage to be completely allocated to them at
the time they are created and presented to a compute system. From the operating system’s
perspective, a thin virtual volume appears as a traditional virtual volume. Thin virtual
volumes consume storage as needed from the underlying storage pool in increments called
thin virtual volume extents. The thin virtual volume extent defines the minimum amount of
physical storage that is consumed from a storage pool at a time by a thin virtual volume.
When a thin virtual volume is destroyed, its allocated capacity is reclaimed to the pool.
Thick virtual volume is the one whose space is fully allocated upon creation. However, when
a thick virtual volume is created, all of its capacity is reserved and allocated in the pool for
use by that virtual volume.
Thin virtual volume offer many benefits, although, a thick virtual volume created from a storage pool provides
better performance than a thin virtual volume created from the same storage pool. Thin virtual volumes are
appropriate for applications that can tolerate performance variations. In some
cases, performance improvement is seen when using a thin virtual volume, due to striping
across a large number of drives in the pool. However, when multiple thin virtual volumes
contend for shared storage resources in a given pool, and when utilization reaches higher
levels, the performance can degrade. Thin virtual volumes provide the best storage space
efficiency and are particularly suitable for applications where space consumption is difficult
to forecast. Using thin virtual volumes, cloud service providers can reduce storage costs and
simplify their storage management.
This lesson covered virtual volume and different ways to create virtual volumes.
This lesson covers virtual network and the types of virtual network including VLAN and
VSAN. This lesson also covers the mapping between VLANs and VSANs in an FCoE SAN.
Virtual networks are software-based logical networks that are created from a unified pool of
network resources. A virtual network can be created by segmenting a single physical
network into multiple logical networks. For example, multiple virtual networks may be
created on a common network infrastructure for the use of the different departments in an
organization. Also multiple physical networks can be consolidated into a single virtual
network. A virtual network utilizes the underlying physical network only for simple packet
forwarding. It appears as a physical network to the nodes connected to it, because existing
network services are reproduced in the logical space. Nodes with a common set of
requirements can be functionally grouped in a virtual network, regardless of the geographic
location of the nodes.
Two nodes connected to a virtual network can communicate with each other without the
routing of frames – even if they are in different physical locations. Network traffic must be
routed when two nodes in different virtual networks communicate – even if they are
connected to the same physical network. Virtual networks are isolated and independent of
each other. Each virtual network has unique attributes such as routing, switching,
independent polices, quality of service, bandwidth, security, and so on. The network
management traffic and the broadcasts within a virtual network generally do not propagate
to the nodes in another virtual network.
All the types of network can be virtualized, including networks of physical compute systems,
SANs, and VM networks. Virtual networks are programmatically created, provisioned, and
managed from a network management workstation. Network and security services become
part of individual VMs in accordance with networking and security policies defined for each
connected application. When a VM is moved to a hypervisor on another compute system, its
networking and security services move with it. And when new VMs are created to scale an
application, the necessary policies are dynamically applied to those VMs as well.
Virtual networks enable a cloud provider to support the complex requirements of multi-
Copyright © 2014 EMC Corporation.
tenancy. Also, theAllprovider
rights reserved
can Module 4: Virtual Layer
create logical networks that span physical boundaries,
The figure on the slide shows two virtual networks that are created on both virtual and
physical switches. Virtual network 1 connects VM1 and VM3 and enables communication
between them without routing of frames. Similarly, VM2 and VM4 are connected by virtual
network 2. Communication between VM1 and VM2 or between VM3 and VM4 must be
routed. The network traffic movement between virtual networks may be controlled by
deploying access control at the router.
The slide lists the common types of virtual network.
A virtual LAN (VLAN) is a virtual network consisting of virtual and/or physical switches, which
divides a LAN into smaller logical segments. A VLAN groups the nodes with a common set of
functional requirements, independent of the physical location of the nodes. In a multi-tenant cloud
environment, the provider typically creates and assigns a separate VLAN to each consumer.
This provides a private network and IP address space to a consumer, and ensures isolation
from the network traffic of other consumers.
Traditionally in a physical network, a router is typically used to create a LAN and the LAN is
further segmented by using switches and hubs. In a physical LAN, the nodes, switches, and
routers are physically connected to each other and must be located in the same area. VLANs
enable a network administrator to logically segment a LAN, and the nodes do not have to be
physically located on the same LAN. For example, a cloud provider may place the VMs of a
consumer in the same VLAN, and the VMs may be on the same compute system or different
ones. Also, if a node is moved to another location, depending on the VLAN configuration it
may still stay on the same VLAN without requiring any reconfiguration. This simplifies
network configuration and administration. A node (VM, physical compute system, or storage
system) may be a member of multiple VLANs, provided the OS, hypervisor, and storage array
OS support such configurations.
To configure VLANs, an administrator first defines the VLANs on the physical and virtual
switches. Each VLAN is identified by a unique 12-bit VLAN ID (as per IEEE specification
802.1Q). The next step is to configure the VLAN membership, based on different techniques
such as port-based, MAC-based, protocol-based, IP subnet address-based, or application-
based. In the port-based technique, membership in a VLAN is defined by assigning a VLAN ID to
a physical or virtual switch port or port group. In the MAC-based technique, the membership
in a VLAN is defined on the basis of the MAC address of the node. In the protocol-based technique,
different VLANs are assigned to different protocols based on the protocol type field found in the Layer
2 header. In the IP subnet address-based technique, membership is based on the network IP subnet
address of the Layer 3 header. In the application-based technique, a specific application, for example,
Copyright ©
a 2014 EMC Corporation.
file transfer protocolAll rights
(FTP) reserved Module
application, can be configured to execute on one VLAN. 4: Virtual Layer
A detailed
A private VLAN (PVLAN) is an extension to the VLAN standard and further segregates the
nodes within a VLAN into sub-VLANs. A PVLAN is made up of a primary VLAN and one or
more secondary (or private) VLANs. The primary VLAN is the original VLAN that is being
segregated into smaller groups. Each secondary PVLAN exists only inside the primary VLAN. It has a
unique VLAN ID and isolates the Layer 2 traffic from other PVLANs. Primary VLANs are
promiscuous, meaning that ports on the PVLANs can communicate with ports configured as the
primary VLAN. Routers are typically attached to promiscuous ports.
There are two types of secondary PVLANs within a primary VLAN: isolated and community.
• Isolated: A node attached to a port in an isolated secondary PVLAN can only communicate with
the promiscuous PVLAN.
• Community: A node attached to a port in a community secondary PVLAN can
communicate with the other ports in the same community PVLAN, as well as with the
promiscuous PVLAN. Nodes in different community PVLANs cannot communicate with
each other.
To configure PVLANs, the PVLAN feature must be supported and enabled on a physical
switch or a distributed virtual switch. To create PVLANs, the administrator first creates
standard VLANs on a switch, and then configures the VLANs as primary and secondary. The
figure on the slide illustrates how different types of PVLANs enable and restrict
communications between VMs (nodes) that are connected to a distributed virtual switch.
PVLANs enable a cloud provider to support a larger number of consumers and addresses the
issues with scalability encountered in VLANs. If a service provider assigns one VLAN per
customer, it limits the number consumers that can be supported. Also a block of addresses
is assigned to each consumer VLAN, which can result in unused IP addresses. Additionally, if
the number of nodes in the VLAN increases, the number of assigned addresses may not be
large enough to accommodate them. In a PVLAN, all members share a common address
space, which is allocated to the primary VLAN. When nodes are connected to secondary
VLANs, they are assigned IP addresses from the block of addresses allocated to the primary
A stretched VLAN is a VLAN that spans across multiple sites over a WAN connection. In a typical
multi-site environment, two sites are connected over an OSI Layer 3 WAN and all network traffic
between them is routed. Because of the routing, it is not possible to transmit Layer 2 WAN traffic
between the nodes in the two sites. A stretched VLAN extends a VLAN across the sites and enables
nodes in the two different sites to communicate over a WAN as if they are connected to the same
network.
Stretched VLANs may be created by simply connecting two sites using long distance fiber and can be
configured using different methods, depending upon the underlying WAN technology. For example, in
a packet switched network, protocols such as Layer 2 Tunneling Protocol (L2TP) and Multiprotocol
Label Switching (MPLS) are used. L2TP encapsulates Layer 2 PPP (point-to-point) frames into User
Datagram Protocol (UDP) datagrams, which, in turn, are encapsulated in IP packets and transmitted
over the IP network. MPLS supports multiple protocols and, for example, can encapsulate Layer 2
Frame Relay (FR) frames into IP packets. Similarly, in circuit switched networks, protocols such as
Dense Wave Division Multiplexing (DWDM), and Coarse Wave Division Multiplexing (CWDM) are
used. An elaboration of these methods is beyond the scope of this course.
Stretched VLANs also allow the movement of VMs between sites without having to change their
network configurations. This enables the creation of high-availability clusters, VM migration, and
application and workload mobility across sites. For example, in the event of a disaster, or during the
maintenance of one site, a provider typically moves VMs to an alternate site. Without a stretched
VLAN, the IP addresses of the VMs must be changed to match the addressing scheme at the other
site.
A VXLAN is a Layer 2 overlay network built on a Layer 3 network. An overlay network is a
virtual network that is built on top of existing network. VXLANs, unlike stretched VLANs, are
based on LAN technology. VXLANs use the MAC Address-in-User Datagram Protocol (MAC-
in-UDP) encapsulation technique. In this scheme, a VXLAN header is added to the original
Layer 2 (MAC) frame, which is then placed in a UDP-IP packet and tunneled over a Layer 3
network. Communication is established between two tunnel end points called Virtual Tunnel
Endpoints (VTEPs). At the transmitting node, a VTEP encapsulates the network traffic into a VXLAN
header and at the destination node, a VTEP removes the encapsulation before presenting the original
Layer 2 packet to the node. VXLANs enable the creation of a logical network of nodes across different
networks. In case of VM communication, the VTEP is built into the hypervisor on the compute system
hosting the VMs. VXLANs enable the separation of nodes, such as VMs, from physical
networks. They allow the VMs to communicate with each other using the transparent
overlay scheme over physical networks that could span Layer 3 boundaries. This provides a
means to extend a Layer 2 network across sites. The VMs are unaware of the physical
network constraints and only see the virtual Layer 2 adjacency.
Nodes are identified uniquely by the combination of their MAC addresses and a VXLAN ID.VXLANs
use a 24-bit VXLAN ID which makes it theoretically possible to have up to 16 million Layer 2
VXLANs co-existing on a common Layer 3 infrastructure. VXLANs make it easier for
administrators to scale a cloud infrastructure while logically isolating the applications and
resources of multiple consumers from each other. VXLANs also enable VM migration across
sites and over long distances.
A virtual SAN (VSAN) or virtual fabric is a logical fabric created on a physical FC or FCoE SAN.
A VSAN enables communication between a group of nodes with a common set of
requirements, independent of their physical location in the fabric. A VSAN conceptually
functions in the same way as a VLAN. Each VSAN behaves and is managed as an
independent fabric. Each VSAN has its own fabric services, configuration, and set of FC
addresses. Fabric-related configurations in one VSAN do not affect the traffic in another
VSAN. Also, the events causing traffic disruptions in one VSAN are contained within that
VSAN and are not propagated to other VSANs. Similar to a stretched VLAN, a VSAN may be
extended across sites by using long distance fiber, DWDM, CWDM, or FCIP links to carry the
FC frames.
To configure VSANs on a fabric switch, the VSANs are first defined with specific VSAN IDs. Then the F_Ports on
the switch are assigned the VSAN IDs to include them in the respective VSANs. If an N_Port connects to an
F_Port that belongs to a VSAN, it becomes a member of that VSAN.
Note: VSAN vs. Zone
Both VSANs and zones enable node ports within a fabric to be logically segmented into
groups. But they are not same, and their purposes are different. There is a hierarchical
relationship between them. An administrator first assigns physical ports to VSANs and then
configures independent zones for each VSAN. A VSAN has its own independent fabric
services, but fabric services are not available on a per-zone basis.
The FCoE protocol enables transmission of FC SAN traffic through a LAN that supports Data
Center Bridging (DCB) functionalities. The FC frames remain encapsulated into Ethernet
frames during transmission through the LAN. If VLANs and VSANs are created on the LAN
and FC SAN respectively, a mapping is required between the VLANs and VSANs. The
mapping determines which VLAN will carry FC traffic that belongs to a VSAN. The VSAN to
VLAN mapping is performed at the FCoE switch. Multiple VSANs are not allowed to share a
VLAN; hence a dedicated VLAN must be configured at the FCoE switch for each VSAN. Also,
it is recommended that VLANs that carry regular LAN traffic should not be used for VSAN
traffic.
The figure on the slide provides an example of a mapping between VLANs and VSANs. In the
example, the FCoE switch is configured with four VLANs – VLAN 100, VLAN 200, VLAN 300,
and VLAN 400. The Ethernet switch is configured with two VLANs – VLAN 100 and VLAN
200. Both VLAN 100 and VLAN 200 transfer regular Ethernet traffic to enable compute-to-
compute communication. The fabric switch has VSAN 100 and VSAN 200 configured. To
allow data transfer between the compute system and the FC fabric through the FCoE switch,
VSAN 100 and VSAN 200 must be mapped to VLANs configured on the FCoE switch. Since
VLAN 100 and VLAN 200 are already being used for LAN traffic, VSAN 100 and VSAN 200
should be mapped to VLAN 300 and VLAN 400, respectively.
This lesson covered virtual network and the common types of virtual networks including
VLAN, private VLAN, stretched VLAN, VXLAN, and VSAN. This lesson also covered the
mapping between VLANs and VSANs in an FCoE SAN.
The Concepts in Practice section covers VMware ESXi and EMC VPLEX.
VMware ESXi is a bare-metal hypervisor with a compact architecture that is designed for
integration directly into virtualization-optimized compute system hardware, enabling rapid
installation, configuration, and deployment. ESXi abstracts processor, memory, storage, and
network resources into multiple VMs that run unmodified operating systems and
applications. The ESXi architecture comprises an underlying operating system called
VMkernel, that provides a means to run management applications and VMs. VMkernel
controls all hardware resources on the compute system, and manages resources for the
applications. It provides core OS functionality, such as process management, file system,
resource scheduling, and device drivers.
The EMC VPLEX family is a storage virtualization platform that enables federation across
heterogeneous storage systems – both EMC and non-EMC block storage arrays. VPLEX
enables data mobility and information access over distance, within and between data
centers. VPLEX is built as a cluster, and each cluster consists of one, two, or four highly
available and fully redundant enterprise engines. A VPLEX engine is an enclosure that
contains two directors, management modules, and redundant power. The engine is installed
in between the compute systems and storage to enable data mobility and management of
multiple heterogeneous arrays from a single interface within a data center. To enable data
access between two sites, VPLEX clusters are connected together across the sites, and the
use of advanced caching techniques enables the same data to exist at more than one
location simultaneously. A VPLEX engine pools distributed block storage resources and
enables the creation of virtual volumes from the pool. These virtual volumes are then
allocated to the compute systems. The virtual-to-physical-storage mapping remains hidden
to the compute systems. All products in the VPLEX family have several common key
features, including cache coherency across engines in a cluster and between clusters, N+1
scaling for performance, N-1 resiliency, and an architecture that can support globalization of
applications over time.
This module covered the functions of virtual layer, virtualization software, resource pool,
and virtual resources.
This module focuses on control layer and its key functions. This module also focuses on
control software and also software-defined approach for managing IT resources. It further
covers key resource management techniques that enables to effectively optimize and utilize
the IT resources for meeting the required service levels.
Copyright © 2014 EMC Corporation. All rights reserved Module 5: Control Layer 200
This module describes the control layer, highlighted in the figure, of the cloud infrastructure.
This lesson covers the control layer and its key functions. This lesson also covers control
software and key phases for provisioning resources using unified manager.
The control layer includes control software (management software) that are responsible to
manage the underlying cloud infrastructure resources and enable provisioning of IT
resources for creating cloud services. Control layer can be deployed either top on virtual
layer or on physical layer. This layer receives request from the service and orchestration
layers, and interacts with the underlying virtual and physical resources for service
provisioning. For example, when a consumer initiates a service request (a VM instance with
4 GB RAM and 500 GB storage), based on the workflow defined by the orchestration layer
for this service, the control layer provisions the required resources from the pool to fulfill
the service request. This layer also exposes resources (physical and/or virtual) to and
supports the service layer where cloud services interfaces are exposed to consumers. The
key functions of control layer includes resource configuration, resource provisioning, and
monitoring resources.
The control software ties together the underlying physical resources and their software
abstraction to enable resource pooling and dynamic allocation of resources. It provisions
resources for services and provides information about provisioned or consumed resources
by service instances to the cloud portal and billing system. Before configuring the cloud
resources, the control software should discover all the underlying resources in order to
know the total available resources in the environment for service provisioning. This also
provides a complete view of all the resources in the cloud environment and can be centrally
configured and monitored. The two types of control software are element manager and
unified manager. The following slides will discuss the various key tasks performed by
element and unified manager.
Infrastructure component vendors may provide the element managers as built-in or external
software to configure those components or elements. For example, storage vendors offer
element manager along with the storage system to configure and make the storage
resources available for the applications or services. Similarly, network and compute systems
are managed using network and compute management software, respectively. The figure on
the slide depicts how various element managers that are involved in managing the
infrastructure components independently. Typically the underlying infrastructure is
managed from element manager through either graphical user interface (GUI) or command
line interface (CLI).
Element managers are required to support initial component configuration such as zoning,
RAID levels, LUN masking, and firmware updates. They are also required when resource
capacity need to be expanded for meeting the demands. For example, a storage element
manager can be used by an administrator to detect the newly added drives and add them to
an existing storage pool. Troubleshooting and monitoring capabilities may also be supported
by element managers. Additionally, many security settings and policy configurations are
managed through the element managers like RBAC (Role-Based Access Controls) type
settings.
For a cloud infrastructure of significant size, especially where a variety of physical and virtual
components exist, using these element managers alone to perform routine management
tasks can become complex.
Most vendors today include native APIs with their management software packages so that
administrators can integrate them with other supporting tools. In a cloud environment
where compute, storage, network and other infrastructure elements work together, having
a software solution that permits unified management and configuration provides
simplification and improved efficiencies.
Unified manager provide a single management interface to configure cloud infrastructure
resources and provisioning resources for services. Unified manager interacts with all
standalone infrastructure elements through the elements’ native APIs. It discovers and
collects information on configurations, connectivity, and utilization of cloud infrastructure
elements. Unified manager compiles this information and provides a consolidated view of
infrastructure resources wherever they reside. In addition, unified manager identifies the
relationships between virtual and physical elements for easy management. It provides a
topology or a map view of infrastructure, which enables an administrator to quickly locate
and understand interconnections of infrastructure components and services.
Unified manager exposes APIs that enables to integrate with orchestration layer to
automate resource provisioning for cloud services. After provision a service, there may be a
need to add or remove resources (compute, storage and network) to meet changing
business requirements. Unified manager allows to dynamically add or remove resources
from a service without impacting the availability. It provides a dashboard that shows how
the infrastructure is configured and how the resources are used. This enables an
administrator to monitor the configuration and utilization of the infrastructure resources
and to plan for capacity requirements. Unified manager also provides a topology or a map
view of the infrastructure, which enables an administrator to quickly locate and understand
the interconnections of the infrastructure components and services. It enables to enforce
compliance by creating configuration policies which are applied to the resources consumed
by a service instance. It helps to track the configuration changes and performs compliance
checking to ensure that the resources are in a known configuration. It also prevents
conflicting resource identity assignments, for example, accidentally assigning a MAC address
to more than one virtual NIC. It provides an alerts console which allows an administrator to
see alerts against the infrastructure resources and the associated services affected by
problems. The alert console facilitates identifying the services affected due to problems and
the root causes of the problems. Knowing the root cause of a problem early enough helps to
resolve problems faster.
The key phases for provisioning infrastructure resources to cloud services using unified
manager: resource discovery, configuring resource pool, and service provisioning.
Discovery operations create an inventory of the infrastructure resources, so that unified
manager can learn what resources are available for cloud service deployment. Discovery
also provides information about the assets including their configuration, connectivity,
capacity, availability, utilization, and physical-to-virtual dependencies. Discovery provides
cloud administrators with visibility into each resource and enables monitoring cloud
infrastructure resources centrally. Typically, the unified manager interact with element
managers through APIs to discover resources in the environment. Discovery is typically
scheduled by setting an interval for its periodic occurrence. Depending on provider’s
business requirements, a cloud administrator can change how often discovery runs.
Discovery may also be initiated by a cloud administrator when a change occurs in the cloud
infrastructure. During discovery, the unified manager captures information about the
infrastructure resources such as:
• Compute systems (number of blade servers, CPU speed, memory capacity, CPU and
memory pools, mapping between virtual and physical compute systems)
• Network components (switch model, network adapters, VLAN IDs, VSAN IDs, physical-to-
virtual network mapping, QoS, network topology, zones)
• Storage systems (type of storage system, drive type, total capacity, free capacity, used
capacity, RAID level, storage pools, physical-to-virtual storage mapping)
Once the discovery phase is completed, the unified manager enables to configure resource
pools. Resource pool is a collection of available resources that are used to dynamically
provision resources. Unified manager interacts with the underlying virtual and physical layer
and enables to create resource pools. A cloud environment usually consists of two types of
pools: resource pools and identity pools. Examples of resource pools are CPU pool and
storage pool, whereas WWN pool, MAC address pool, VLAN ID pool, and VSAN ID pool are
examples of identity pools. The virtual resources such as VM, virtual volume, and virtual
network are created from these pools and provisioned for the services.
Unified manager allows an administrator to grade pools. Resource grading is a process that
categorizes these pools based on various criteria, such as performance, capacity, and
availability. Pools of different grades are used to create a variety of service offerings,
providing choices to the cloud service consumers. Multiple grade levels may be defined for
compute (CPU and memory), storage, and network (WWN, VLAN ID, and so on) pools. Each
grade level is marked with a grade name such as ‘Gold’, ‘Silver’, and ‘Bronze’. The number of
grade levels for a given type of pool depends on business requirements. The slide provides
an example of grading storage pools. In this example, three grade names are used to
distinguish three different grade levels. Resource grading helps the unified manager to
allocate resources from the appropriate pool during the service provisioning process.
Service provisioning involves creating service instances and allocating resources from graded
resource pools to the service instances. Service provisioning starts when consumers select
cloud services from the service catalog. A service template (will be discussed detail in
module 6) defined in a service catalog facilitates consumers to understand service
capabilities and provides guidelines to create workflows for service orchestration. The
unified manager on receiving a service provisioning request, the resources are allocated,
and integrated as per the service template to create an instance of the service.
This lesson covered the control layer which includes the key functions of the layer. This
lesson also covered control software and key phases for provisioning resources using unified
manager.
This lesson covers the functionalities of software-defined approach, and key functions of
software-defined controller. This lesson also covers the key benefits of software-defined
approach.
For any organization, it is becoming important to support its growth through virtualization,
so it can quickly and efficiently deliver cloud, big data and analytics, mobile, and social
business services. Software-defined approach is the mechanism helps in creating and
implementing an optimized IT infrastructure that can help organizations achieve competitive
advantage and higher value through speed and efficiency in delivering the services.
Software-defined approach virtualizes all the infrastructure components (compute, storage,
and network), pools them into aggregated capacity. It separates the control or management
functions from the underlying components to the external software, which takes over the
control operations to manage the multi-vendor infrastructure components centrally.
Principally, a physical infrastructure component (compute, network, and storage) has control
path and data path. In a simple terms, control path sets and manages the policies for the
resources and the data path performs the actual transmission of data. Software-defined
approach decouples the control path from the data path. By abstracting the control path,
resource management operates at the control layer, which gives the ability to partition the
resource pool and manage them uniquely by policy. This decoupling of the control path and
data path enables to centralize all data provisioning and management tasks through
software external to the infrastructure components.
The software runs on a centralized compute system or a standalone device, called the
software-defined controller. The figure on the slide shows an illustration of software-defined
approach, where the management function is abstracted from the underlying infrastructure
components using controller. From a data center aspect of software-defined approach,
there is a software-defined compute controller for compute systems, software-defined
storage controller for storage systems, and software-defined network controller for network
systems.
The software-defined controller has built-in intelligence that automates provisioning and
configuration based on defined policies. It enables organizations to dynamically, uniformly,
and easily modify and manage their infrastructure. The controller discovers the available
underlying resources and provides an aggregated view of resources. It abstracts the
underlying hardware resources (compute, storage, and network) and pools them. This
enables the rapid provisioning of resources from the pool based on pre-defined policies that
align to service level agreements for different consumers. The controller enables a cloud
administrator to manage the resources, node connectivity, traffic flow, control behavior of
underlying components, apply policies uniformly across the infrastructure components, and
enforce security, all from a software interface. The controller also provides interfaces that
enable applications external to the controller to request resources and access these
resources as services.
Software-defined approach enables to provision resources based on policies to services and
applications as needed by the business in a very short time. It enables to deliver
infrastructure resources to consumers via service catalog and provide on-demand self-
service access to consumers. This in turn will dramatically improve business agility. A
software-defined approach increases flexibility by abstracting the underlying IT resources to
enable service providers to use low cost non-proprietary standard hardware and, in many
cases, leverage existing investment on infrastructure to dramatically lower capital
expenditure (CAPEX). Virtualizing the entire infrastructure through software-defined
approach further enables to save CAPEX and operating expenditure by improving the
utilization of resources. By abstracting physical resources into a virtual pool, service provider
can achieve massive scale by aggregating existing and new heterogeneous hardware
components into limitless capacity available to consumers. The software-defined approach
provides a centralized management of heterogeneous resources to ensure service levels and
monitor resource utilization. It allows to create new innovative services that span underlying
resources. For example, a new service called “object data service” can be created that
provides the ability to store, access and manipulate unstructured data (e.g. images, video,
audio, online documents) as objects on file-based storage system and exploiting the
performance capabilities of that system.
Software-defined approach helps cloud service providers to provide the most efficient and
scalable cloud solutions. Combining the transformational power of cloud computing with
the benefits of software-defined approach enables cloud-based workloads to achieve their
highest levels of performance, reliability and scalability.
This lesson covered the functions of software-defined controller and key benefits of
software-defined approach.
This lesson covers the introduction to resource management aspect of cloud infrastructure
and resource allocation model. This lesson also covers various key compute resource
management techniques: hyper-threading, memory page sharing, dynamic memory
allocation, VM load balancing across hypervisors, and server flash-cache.
Typically in a cloud environment, multiple consumers share the same underlying hardware
resources. So it is important to effectively manage these resources to meet the required
service level. Resource management includes allocation of resources effectively to a service
instance from a pool of resources. The key goals of resource management are controlling
utilization of resources and preventing service instances from monopolizing the resources.
Monopolizing the resources can be avoided by controlling the allocation of resources to the
service instance. The resources are managed from a centralized management server, which
enables to define policies and configure resources. This management server also provides
the ability to pool the resources, allocate and optimize their utilization.
Resources that are allocated to a service instance can be controlled based on one of two
models: relative resource allocation and absolute resource allocation.
In a relative resource allocation model, resource allocation for a service instance is not
defined quantitatively. Instead, resource allocation to a service instance is defined
proportionally relative to the resource allocated to other service instances. For example,
consider two service instances with two different service levels: service instance 1 has
“Platinum” service level with 2X priority and service instance 2 has “Silver” service level
with X priority. When a resource contention occurs, the service instance 1 is allocated twice
as much resource as service instance 2.
An absolute resource allocation model is based on defining a quantitative bound for the
resources for each service instance. In this model, a lower and upper bound is defined. A
lower bound guarantees minimum amount of resources to a service instance. The upper
bound limits a service instance from consuming resources beyond the defined maximum
level. For example, consider a service instance (VM) that is configured with 2 GB memory
capacity and 1200 MHz processing power as its lower bounds. The same VM is configured
with 4 GB memory capacity and 2400 MHz processing power as its upper bounds. In this
case, if the available memory capacity or processing power is less than 2 GB and 1200 MHz
respectively, then the VM will not power on. On the other hand, even if capacity greater
than its specified upper bounds is available, the maximum amount of memory capacity and
processing power this VM can consume is 4 GB and 2400 MHz, respectively.
The slide listed some of the key resource (compute, storage, and network) management
techniques that enables to optimize resource utilization, improve performance, and ensure
meeting the service levels. Most of these techniques allow cloud administrators to set
policies for managing resources effectively based on the requirements. Some of the
techniques provides the capability to overcommit (more capacity is allocated than is actually
available) CPU, memory, and storage resources to avoid frequent provisioning of resources,
or to reduce disruption to application availability when adding new resources. This over
commitment enables to create more service instances but require proper monitoring in
place to avoid any downtime. The following slides will discuss these techniques in detail.
Hyper-threading makes a single processor core appear as two logical processor cores,
allowing the hyper-threading enabled operating system (or hypervisor) to schedule two
threads simultaneously to avoid idle time on processor. However, the two threads cannot be
executed at the same time because the two logical cores share the resources of a single
physical core. When core resources are not in use by the current thread, especially when the
processor is stalled (for example due to data dependency), resources of the core are used to
execute the next scheduled thread.
When a service provider builds compute infrastructure using hyper-threading enabled
processors and operating systems, the overall performance of the services running on these
infrastructure can be improved.
In a cloud environment, it common that multiple VMs are running on a compute system,
and this may increase the probability of having identical content in multiple memory pages.
For example, VMs may run the same guest OS and have the same applications. This may
create redundant copies of memory pages, and results in increased consumption of memory
resources. Memory page sharing is a technique by which the hypervisor scans the memory
pages to identify redundant pages regardless of which VM created it. After the candidate
memory page is identified, the VM memory pointer is updated to point to the shared
location, and the redundant memory pages are reclaimed. Removing these redundant
memory pages will enable to utilize the memory resources effectively and at the same time
more memory can be assigned to the required VM instances to meet the demand.
Figure on the slide illustrates the process of reclaiming memory pages using the memory
page sharing technique. Consider a physical compute system running three VMs namely VM
1, VM 2, and VM 3. In this scenario, the hypervisor scans the physical memory location and
identifies that contents of memory page 1 of VM 1, VM2, and VM3 are identical. In this
situation, the hypervisor updates the memory map for the virtual memory page 1 for the
three VMs to point to memory page 1 of physical memory. In the absence of this technique,
three physical memory pages would be consumed to store these three redundant memory
pages. Similarly, virtual memory page 2 of VM 2 and VM 3 have identical pages and point to
physical memory page 5. When VM 3 updates its virtual memory page 2, which points to the
shared physical memory page 5, Copy-on-Write (CoW) is invoked to handle updates. In CoW
mechanism, when a shared page is updated the hypervisor will transparently create a
private copy of the physical memory page 5 to page 6 for that VM. The pointer map is
updated to point to this private copy of the virtual memory page. In this way, shared virtual
memory pages can be modified non-disruptively.
Dynamic memory allocation is a technique to reclaim memory pages. When a VM must free
up memory, it is best to let the guest OS of the VM to select the memory pages to give up.
The guest OS of the VM knows which pages have been least recently used and can be freed
up. In this technique, each VM has an agent installed in the guest OS that communicates
with the hypervisor. The agent’s function is to demand memory from the guest OS and to
relinquish it to the control of the hypervisor. When a compute system is not under memory
pressure, no action is taken by the agent running within each VM. However, when memory
becomes scarce, the hypervisor chooses VMs and instructs the agents in the VMs to
demand memory from their guest OS. After the memory is reclaimed by the agent, the
agent reserves the memory and puts it back into the memory pool. The hypervisor then
assigns the relinquished memory pages from the pool to other VMs that require more
memory. Consider a scenario in which an application running on a VM instance faces a
sudden increase in the load. The VM would require additional processing power and
memory to handle this increase in workload, without impacting the service level. The
dynamic memory allocation technique enables more memory resources to be assigned to
the VM by reclaiming memory from other VMs in the environment.
To provide redundancy and load balancing, many hypervisors are clustered. In such
environment, when a new VM is powered-on, the management server checks the
availability of resources on all the hypervisors. It places the VM on a hypervisor where
resources are sufficiently available and ensures that the load is balanced across hypervisors.
Although the management server performs initial placements so that load is balanced
across the clustered hypervisors, changes in VM load and resource availability may cause the
cluster to become imbalanced. To overcome such imbalance in load, the management
server monitors all the hypervisors for resources and makes a balancing decision. The
management server executes the load balancing decision by migrating VM from over-
utilized hypervisors to underutilized hypervisors to avoid performance bottleneck. The
management server makes its load balancing decision based on configured threshold values.
Threshold value is a measure of how much imbalance on a hypervisor’s resources
(processor cycles and memory) is acceptable.
In server flash-caching, a flash memory cache card is installed in the compute system to
enhance application performance. Server flash-caching technology uses intelligent caching
software and a flash card on the compute system. The cache software places the most
frequently referenced data on the flash card, thereby putting the data closer to the
application. This dramatically improves application performance and avoids the latencies
associated with I/O access over the network to the storage system. Server flash-caching
technology provides performance acceleration for read-intensive workloads. As a result of
server flash-caching implementation, a copy of the hottest data automatically resides on the
flash card in the compute system for improving performance.
Server flash-caching needs “warm-up” time before significant performance improvement is
realized. Warm-up time is the time required to move significant amount of data into server
flash-cache. Typically this happens when the server flash-cache has been installed and is
empty.
This lesson covered resource management aspect of cloud infrastructure and resource
allocation model. This lesson also covered various key compute resource management
techniques including hyper-threading, memory page sharing, dynamic memory allocation,
VM load balancing across hypervisors and server flash-cache.
This lesson covers the key storage resource management techniques: virtual storage
provisioning, storage pool rebalancing, thin LUN storage space reclamation, automated
storage tiering, cache tiering, and dynamic VM load balancing across storage volumes.
One of the biggest challenges for administrators is allocating the storage space required by
various applications (or services) running in their IT infrastructure. With traditional storage
provisioning, administrators typically allocate storage capacity to applications based on
anticipated storage requirements. Administrators often over-provision storage to an
application, either to avoid frequent provisioning of storage if the LUN capacity is exhausted,
or to reduce disruption to application availability when adding new storage. Over allocation
results in unused storage space, and lower capacity utilization. Also, results in acquisition of
excess storage capacity, which leads to higher costs, increased power consumption, cooling,
and floor space requirements.
Virtual storage provisioning enables to present a LUN to an application with more capacity
than is physically allocated to it on the storage system. Physical storage is allocated to the
application “on-demand” from a shared storage pool of physical capacity. This provides
more efficient utilization of storage by reducing the amount of allocated-but-unused
physical storage. The shared storage pool enables rapid elasticity of storage resources to
adapt to the variations in workload by quickly and dynamically expanding (scaling outward)
or reducing (scaling inward), and to maintain the required service level.
Virtual storage provisioning enables service providers to reduce storage costs by increasing
capacity utilization and simplifying storage management. It also helps to reduce power and
cooling, and floor space requirements. As a result, virtual storage provisioning has become a
part of green computing.
When a storage pool is expanded, the sudden introduction of new empty drives combined
with relatively full existing drives causes a data imbalance. It also may cause performance
impact due to the fact that the new data would be added mostly to the newly added drives.
Storage pool rebalancing is a technique that provides the ability to automatically rebalance
allocated extents on physical disk drives over the entire pool when new disk drives are
added to the pool. Rebalancing restripes data across all the disk drives (both existing and
new disk drives) in the shared storage pool. This enables spreading out the data equally on
all the physical disk drives within the shared pool, ensuring that the used capacity of each
disk drive is uniform across the pool and helps in achieving higher overall pool performance.
Virtual storage provisioning provides the cloud administrator a way to substantially reduce
over-allocating storage to applications and manage the capacity based on usage instead of
allocated size. This allows the administrator to maintain a common, unallocated storage
space that is readily available to other applications or services on an as-needed basis. But
even with virtual storage provisioning, there can be unused storage in a thin LUN, if space
from deleted blocks is not reclaimed. Consider a scenario where large file deletions are
common and significant numbers of unused storage blocks are not reclaimed to the storage
pool from the thin LUN. In such cases reclaiming the unused space from thin LUNs would be
beneficial to the overall physical storage space availability. Thin LUN storage space
reclamation technique identifies unused space in thin LUNs and re-assign it to the storage
pool. There are multiple options available to reclaim the unused space on a thin LUN, such
as zero extent reclamation and API-based reclamation. Zero extent reclamation is a method
commonly implemented at the storage system that provides the ability to free or de-
allocate storage extents found to contain all zeros in a thin LUN. These de-allocated extents
are added back to the pool, making them available for other applications. Another approach
is to use a thin LUN storage space reclamation API, implemented at the compute system.
Compute systems running such an API can efficiently communicate the location of all the
identified unused space on the LUN to the storage system. This enables the storage system
to reclaim all unused physical storage to the pool, making it available for other thin LUNs.
Automated storage tiering is a technique of establishing a hierarchy of different storage
types for different categories of data that enables storing the correct data automatically to
the correct tier, to meet the service level requirements. Many applications have predictable
spikes in activity, with much lower activity at other times. Ideally, an automated storage
tiering solution addresses these cyclical fluctuations as well as unpredictable spikes that can
also occur. Automated storage tiering has the potential to replace tedious manual storage
management and to significantly benefit cloud environments. Tiers are differentiated on the
basis of protection, performance, and cost. For example, high performance solid-state drives
(SSDs) can be configured as tier 0 storage to keep frequently accessed data (hot data), and
low cost SATA drives as tier 1 storage to keep the less frequently accessed data (cold data).
Keeping frequently used data in SSD improves application performance. Moving less-
frequently accessed data to more economical, higher capacity SATA drives can free up
storage capacity in high performance drives and reduce the overall cost of storage. For
example, if the focus of a cloud service is providing low- or no-cost capabilities that do not
need a large amount of performance, then high-capacity, energy-efficient HDDs are an
option. On the other hand, if the service requires low response time while supporting a
large number of active users, SSDs are a good fit. This clearly tells that different tiers of
storage media are aligned and used to meet different service requirements. The data is
moved between tiers based on defined tiering policies. The tiering policy is usually based on
parameters such as file type, frequency of access, and so on. For example, if a policy states,
“Move the files that are not accessed for the last 30 days to the lower tier,” then all the files
matching this condition are moved to the lower tier. Data movement between various tiers
can happen within (intra-array) or between (inter-array) storage arrays.
Tiering may also implemented at the cache level. A large cache in a storage system improves
performance by retaining large amounts of frequently accessed data, so that a high
proportion of reads are served directly from the cache. However, configuring a large cache
in the storage system can be costly. An alternative way to increase the size of the cache is by
utilizing the SSDs on the storage system to create a large capacity secondary cache
positioned between the storage processor’s DRAM primary cache and the storage system’s
disk drives. This enables tiering between DRAM primary cache and SSDs secondary cache.
Cache tiering also enables the storage system to store large amounts of frequently accessed
data on the cache tier. Most reads may now be served directly from cache tiering, which
provides excellent performance benefits during peak workload.
During the storage provisioning process for VMs, the volumes are often randomly selected,
and virtual disks are created for VMs on these volumes. This may lead to over or
underutilized volumes. Dynamic VM load balancing across storage volumes enables
intelligent placement of VMs during creation, based on the I/O load and available storage
capacity on the hypervisor’s native FS volume or the NAS FS volume. This technique is
implemented in a centralized management server that manages the virtualized
environment. A management server performs ongoing dynamic VM load balancing within a
cluster of volumes. A cluster volume is a collection or pool of a hypervisor’s native FS or NAS
FS volumes that are aggregated as a single volume to enable efficient and rapid placement
of new virtual machines and load balancing on existing workloads. User-configurable space
utilization or I/O latency thresholds are defined to ensure space efficiency, and I/O
bottlenecks are avoided. These thresholds are typically defined during the configuration of
the clustering volumes to avoid resource bottlenecks and to meet application service levels.
This lesson covered the key storage resource management techniques including virtual
storage provisioning, storage pool rebalancing, thin LUN storage space reclamation,
automated storage tiering, cache tiering, server-flash cache and dynamic VM load balancing
across storage volumes.
This lesson covers the key network resource management techniques including balancing
client workload across nodes, network storm control, Quality of Service, traffic shaping, link
aggregation, NIC teaming, and multipathing.
Network traffic flow in a cloud network infrastructure is controlled to optimize both
performance and availability of cloud services. Administrators may use several traffic
management techniques supported by different vendors of network resources. Some of
these techniques enable distribution of traffic load across nodes or parallel network links to
prevent overutilization or underutilization of these resources. Other techniques enable
automatic failover of network traffic from a failed network component to another available
component. Some techniques also ensure guaranteed service levels for a class of traffic
contending with other classes for network bandwidth. The key network traffic management
techniques, listed in this slide, are described in the subsequent slides.
Client (consumer) connections are typically balanced across a group of nodes such as a
cluster of application servers that process clients’ requests simultaneously. The client
workload balancing service is usually provided by a purpose-built device called load
balancer. The load balancer splits client traffic across multiple nodes. The working principle
of a load balancer may vary based on vendor implementation. A common load balancing
method is to place the load balancer between the node cluster and the Internet. This allows
all client traffic to pass through the load balancer. Clients use the IP address of the load
balancer to send requests. This IP address is called a public IP address because it is
accessible for general use. The public IP address abstracts the real (private) IP addresses of
all nodes in the cluster. The private IP addresses of the nodes are known only to the load
balancer, which decides where to forward each request. Slide shows an example of load
balancing. In this example, there are three application servers, each with a private IP
address. A load balancer is placed in the network, before the application servers, and
provides a publicly accessible IP address and a domain name webapp.sample.com. When a
cloud consumer access the application located at webapp.sample.com, they are directed to
the load balancer, and are then redirected to one of the application servers.
A network storm occurs due to flooding of frames on a LAN or a VLAN, creating excessive
traffic and resulting in degraded network performance. A storm could happen due to errors
in the network configuration, or a denial-of-service (DoS) attack.
Storm control is a technique to prevent regular network traffic on a LAN or VLAN from being
disrupted by a network storm and thereby avoiding degraded network performance. If
storm control is enabled on a supported LAN switch, it monitors all incoming frames to
switch ports over specific time interval. The switch calculates the total number of frames of
a specific type (unicast, multicast, or broadcast) that arrive at a switch port over the interval.
The switch then compares the sum with a pre-configured storm control threshold. The
switch port blocks the traffic when the threshold is reached and filters out subsequent
frames until the interval ends.
Quality of service (QoS) refers to the capability of a network to prioritize business critical
and latency-sensitive network traffic and to provide better service to such traffic over less
critical traffic. QoS uses a collection of technologies that allows applications to obtain
consistent service levels in terms of network bandwidth, latency variations, and delay. This is
performed by raising the priority of critical classes of network traffic over other classes.
The Internet Engineering Task Force (IETF) defines two approaches for QoS: integrated
services and differentiated services. In integrated service, an application signals the network
to inform the network components about required QoS. The signal carries a request about
the network bandwidth and permissible delay for the application’s network traffic. If every
network component along the data path can reserve the necessary bandwidth, the
originating application can begin transmitting. The application can transmit data through the
network only after receiving confirmation from the network.
For differentiated service, different network traffic gets different QoS based on the priority
specified by each packet. The network uses the priority specification to classify traffic and
then manage network bandwidth based on the traffic class. The priority specification to the
packets can be inserted by applications or by switches or routers. There are different ways
to specify the priority. For example, three precedence bits in the type of service (ToS) field of
the IP packet header are used as a priority specification. In Ethernet network the class of
service (CoS) field specifies the priority.
Traffic shaping limits the traffic rate at a network interface such as a node port or a router
port. This helps to limit the rate of low priority network traffic, which improves latency and
increases available network bandwidth for higher priority traffic. In addition to ensuring
required service level for business-critical applications, it helps to control traffic rate per
client or tenant, avoiding network congestion. Traffic shaping can be performed by a node or
an interconnecting device.
Traffic shaping allows an administrator to set a limit on the traffic rate on a network
interface. In the event of a traffic burst that exceeds the limit, traffic shaping retains excess
packets in a queue and then schedules the excess packets for later transmission. In this way
it ensures consistent traffic rate at a network interface and meets required service level.
Link aggregation combines two or more parallel network links into a single logical link, called
port-channel, yielding higher bandwidth than a single link could provide. Link aggregation
enables distribution of network traffic across the links and traffic failover in the event of a
link failure. If a link in the aggregation is lost, all network traffic on that link is redistributed
across the remaining links. Link aggregation can be performed for links between two
switches and between a switch and a node.
NIC teaming is a link aggregation technique that logically groups NICs (to create a NIC team)
so that they appear as a logical NIC. It distributes network traffic across NICs and provides
traffic failover in the event of a NIC/link failure.
Multipathing can perform load balancing by distributing I/O across all active paths. Standby
paths become active if one or more active paths fail. If an active path fails, the multipathing
process detects the failed path and then redirects I/Os of the failed path to another active
path.
Link aggregation, NIC teaming, and multipathing techniques are detailed in Module 7.
This lesson covered the key network resource management techniques: balancing client
workload across nodes, network storm control, Quality of Service, traffic shaping, link
aggregation, NIC teaming, and multipathing.
The concepts in practice section covers EMC products such as Unisphere, Unified
Infrastructure Manager, ViPR, FAST VP, XtremSF, and PowerPath/VE.
EMC Unisphere is a unified storage management platform that provides intuitive user
interfaces for managing EMC VNX and VNXe storage arrays, and EMC RecoverPoint.
Unisphere is web-enabled and supports remote management of storage arrays. Unisphere
allows VNX administrators to monitor health, alerts, and performance of large numbers of
VNX storage systems across a central location. It provides easy-to-use customizable
dashboard that aggregates system information like capacity, CPU utilization, health, and
alerts. Some of the key capabilities offered by Unisphere follow:
• Provides unified management for file, block, and object storage
• Provides single sign-on for all devices in a management domain
• Supports automated storage tiering and ensures that data is stored in the correct tier to
meet performance and cost
• Provides management of both physical and virtual components
Contd.
EMC ViPR is a software-defined storage solution that abstracts storage with all its unique
capabilities from physical arrays into a single pool of virtual storage. Storage administrators
then create virtual storage arrays that they can manage at the virtualization layer according
to automated policies. ViPR decouples the storage control path from the data path, which
enables it to centralize all the provisioning and management tasks. ViPR automates
provisioning by creating pre-defined, policy-driven, virtual storage arrays and delivering self-
service access. It centralizes storage management and monitors utilization, performance
and health through a single management interface across physical and virtual storage. ViPR
provides a hyper-scale, cloud architecture built with standard APIs, so any customer or
service provider can extend it to support non-EMC storage and integrate with cloud stacks
such as VMware and OpenStack.
EMC FAST VP performs storage tiering at a sub-LUN level in a virtual provisioned
environment. FAST VP automatically moves more active data (data that is more frequently
accessed) to the best performing storage tier, and it moves less active data to a lower
performance and less expensive tier. Data movement between the tiers is based on user-
defined policies, and is executed automatically and non-disruptively by FAST VP.
EMC XtremSF is a PCIe flash card deployed in the compute system to dynamically improve
application performance by reducing latency and accelerating throughput. It can also be
used as a caching device in conjunction with server flash caching software EMC XtremSW
Cache. The XtremSW Cache accelerates reads and protects data using write-through cache
to the networked storage. It extends EMC FAST VP into the server, adding another tier of
intelligence and performance to the I/O stack.
EMC PowerPath/VE software provides multipathing solutions for physical compute systems
running VMware ESX/ESXi and Microsoft Hyper-V hypervisors. It provides path failover and
load-balancing across FC, iSCSI, or FCoE I/O paths. All I/Os to storage systems run through
PowerPath/VE, which distributes I/O requests to a LUN across all the available paths.
PowerPath/VE may be added to the hypervisor and used in place of the hypervisor’s native
multipathing functionality to deliver advanced multipathing capabilities.
This module covered control layer and its key functions. This module also covered control
software and software-defined approach for managing IT resources. It further covered key
resource management techniques to effectively utilize the IT resources for meeting the
required service levels.
This module focuses on the service layer and the service orchestration layer of a cloud
infrastructure. It covers service layers functions, cloud interfaces, cloud portal, cloud
interface standards, and common protocols for accessing cloud services. It also includes
service orchestration and cloud service lifecycle.
Copyright © 2014 EMC Corporation. All rights reserved Module 6: Service and Orchestration Layers 251
This module describes the service layer and the service orchestration layer, highlighted in
the figure, of the cloud infrastructure.
This lesson covers the service layer functions, service catalog and its elements, service
ordering, and cloud interfaces.
Cloud services are IT resources that are packaged by service providers and offered to service
consumers. A service is instantiated once constituent IT resources are provisioned and
configured. The instantiated service is called a service instance.
Slide provides an example of cloud service (SaaS type). The service comprises two Web
servers, an application server, a database server, and a load balancer – all are hosted on
virtual machines (VMs). Three VLANs are configured to interconnect these VMs
appropriately. The load balancer distributes client connections across both the Web servers.
The Web servers receive Hypertext Transfer Protocol (HTTP) requests from clients and
respond with HTTP responses back to the clients. The Web servers delegate the requests to
the application server that executes the requests and passes back the generated responses.
The application server accesses the database server for storing and retrieving data while
processing clients requests.
The service layer of the cloud infrastructure enables a provider to define services and a
consumer to self-provision services. Additionally, it presents cloud interfaces to the
consumers, enabling them to consume deployed services.
The service layer has three key functions:
• Enables defining services in a service catalog: Cloud service providers should ensure that
consumers are able to view what services are available, what the service level options
are, and what the service cost is to help them effectively make the right choice of
services. Cloud services are defined in a service catalog, which is the menu of service
offerings from a service provider. The catalog provides a central source of information on
the service offerings delivered to the consumers by the provider, ensuring that consumers
can view a standardized, accurate, and consistent picture of the services available to
them.
• Enables on-demand, self-provisioning of services: A service catalog also allows a
consumer to request or order a service from the catalog that best matches consumer’s
need without manual interaction with a service provider. While placing a service request,
a consumer commonly submits service demands such as required resources, needed
configurations, and location of data. Once a service request is approved by the provider,
appropriate resources are provisioned for the requested service.
• Presents cloud interfaces to consume services: Cloud interfaces are functional interfaces
and management interfaces of the deployed service instances. Using these interfaces
consumers perform computing activities such as executing a transaction, and administer
their use of rented service instances such as modify, scale, stop, or restart a service
instance.
A service catalog is the menu of services that lists services, attributes of services, service
level commitments, terms and conditions for service provisioning, and prices of services.
Cloud service providers define, publish, and manage a list of service offerings in a service
catalog. A consumer may view the service catalog to know what cloud services are available,
their features and price, and consumer-specific values of the services. Additionally, a service
catalog allows a consumer to request or order a service from the catalog in self-service way.
The service catalog is the central source of information about the services offered by a
provider to a consumer and is fundamental to the successful transformation of an IT
organization or service provider from the traditional method of delivering IT capability to
delivering IT as a service. It does not offer consumers an unlimited number of service
offerings nor does it allow them to arbitrarily choose the resources for a service. Rather, it
offers customers a limited and standardized set of service offerings that have been pre-
defined based on the provider’s expertise, technology, skill of personnel, and on market
demand. By presenting a carefully selected and standardized set of services in its service
catalog, a provider can optimize its infrastructure and personnel requirements, and ensure
consistent service quality, service provisioning time, and consumer satisfaction.
Slide shows the service catalog for EMC’s private cloud database service named Xpress
database. The database is offered on Oracle or Microsoft SQL Server.
A service catalog includes various elements to define services. A list of common service
catalog elements with explanations is provided below.
• Service category: Services are defined using a hierarchy within the service catalog that
qualifies each service under specific service category such as e-commerce, business
intelligence, or database. This provides a clear view to the consumers about the type of
service offered.
• Service name: An appropriate name that enables consumers to understand what service
offering is available.
• Service description: The description tells consumers what the service is, what business
process it supports, and what value the service provides to the consumers.
• Features and options: This is a list of features including constraints, policies, and rules to
characterize a cloud service. It may provide a list of options for selection, such as a list of
operating systems (OSs) for a PaaS type of service. It may also provide technical
descriptions of each option, such as the software version. Further, it clarifies billing and
decommissioning policies, including impact on billing if service instances are
decommissioned before the subscription period is completed.
(Contd.)
A service catalog typically provides a link (hypertext or hyperlink button) to request or order
a service. After clicking the designated link, a consumer is commonly asked to submit a Web
form with a few drop-downs, check boxes, radio buttons, and text boxes to describe
required resources, their configurations, usage of service, and so on. Providers usually make
effort to simplify the form for consumer use, abstracting underlying resource allocation
details. For example, a PaaS consumer may specify consumer’s application requirement and
database usage while requesting a database service from a provider. This high-level service
request is translated into its constituent resource requests such as number of VMs, amount
of memory, OS, and database configuration. Further, a consumer must agree to the contract
terms associated with the selected service before submitting the form to complete the
service ordering.
Slide shows partial view of the Web form for ordering EMC’s private cloud database service.
Cloud services are consumed through cloud interfaces, which enable computing activities
and administration of rented service instances. Based on primary purpose, these interfaces
can be categorized into two types (source: NIST SP 500‐291):
• Management interface: It enables a consumer to control his use of a rented service.
• Functional interface: It enables a consumer to use service functions.
Typically, each cloud service presents an interface of each category. However, both types of
interface can be combined to create a single interface for a service, or can be further divided
to create more interfaces. The following slides detail both the types of interfaces.
Management interface is a self-service interface that enables a consumer to monitor,
modify, start, and stop rented service instances without manual interaction with the service
provider. It facilitates consumers to prepare desired functional interface.
Based on the service model (IaaS, PaaS, or SaaS), the management interface presents
different management functions. For IaaS, the management interface enables consumers to
manage their use of infrastructure resources. For PaaS, the management interface enables
managing the use of platform resources. The SaaS management interface enables
consumers to manage their use of business applications. Slide shows examples of IaaS,
PaaS, and SaaS management interface functions.
Functional interface presents the functional content of a service instance that enables
consumers to perform computing activities. Similar to the management interface, the
functional content is also presented differently depending on the service model:
• IaaS: The functional interface is the specifics of hardware such as processors, memory,
network adapters, and storage volumes typically used by an OS.
• PaaS: The functional interface may provide an integrated development environment (IDE)
that consists of programming interface, libraries and tools to develop and deploy
applications. This could be offered in proprietary or commercially available coding
languages. The environment enables consumers to code, compile, and run their
applications on the cloud platform. The development environment may also be offered
using a software development kit (SDK) that a consumer can download and use for
application development. After development, consumer can upload the application onto
the cloud platform.
• SaaS: The graphical user interface (GUI) of a business application offered as a service is an
example of the functional interface.
This lesson covered service layer functions, service catalog and its elements, and the
process to order a service from a service catalog. It also covered the management and
functional interfaces of cloud services.
This lesson covers the cloud portal functions and the personalization of cloud portal.
A cloud portal is a Web portal (a web-page at a website) that presents service catalog and
cloud interfaces, enabling consumers to order and manage cloud services in on-demand,
self-service way. A cloud portal is also accessed by the cloud administrators to manage cloud
infrastructure and the lifecycle of cloud services. Service lifecycle includes various phases of
a service from its initiation to termination. Cloud service lifecycle is described later in this
module.
Cloud portals are hosted on one or more (for redundancy and workload balancing) compute
systems, called portal servers. Cloud portals are created using specialized development
tools, called portal software. The portal software enables providers to design and publish
cloud portals. A user (cloud administrator or consumer) may use the uniform resource
locator (URL) of a cloud portal to logon to the portal.
A cloud portal has two key functions:
• Presentation: It is the organization and display of service information and management
functions to the users.
• Interaction with orchestration layer: It involves sending service request to the
orchestration layer and receiving response back from the orchestration layer.
A cloud portal commonly presents following elements:
• Service catalog: It lists and describes service offerings along with their attributes, service
level, terms and conditions for provisioning, and prices. It also includes action buttons or
links that allows a consumer to request or order a service from the catalog.
• Management interface: It presents information about all services ordered by a consumer
including order status such as on order, suspended, and available, consumption of
resources by services, notification of events, and billing information. This helps
monitoring and management of service instances. The management interface also
includes action buttons or links, which enable a consumer to modify, stop, and start
rented service instances in a self-service way.
• Link to functional interface: Once a service instance is created based on a service
request, its consumer is notified in the cloud portal that the service is ready to use and is
provided a link to access it. By using the link, a consumer gets access to the functional
interface of a service. As described before in cloud interfaces section, the functional
interface exposes the functional content of a service instance, enabling a consumer to
perform computing operation.
Management interface is commonly designed like a digital dashboard. Similar to a car's
dashboard, a digital dashboard provides a summary of information pertaining to the cloud
services in real-time and also provides a central point to manage rented service instances.
The management interface is presented across one or more portlets or windows. Each
portlet exposes a group of management functions and service information. Service
information is presented in a tabular and/or graphical view. This helps presenting
information clearly and coherently, which improves content readability and simplifies
management of rented service instances.
Slide shows the portal view of the EMC VMAX Cloud Edition that can be deployed by service
providers to provision storage for cloud services. In the portal view, the management
interface is highlighted with a rectangle.
Once a service provisioning or management request is placed in the cloud portal, the portal
routes the request to the orchestration layer where appropriate workflows are triggered to
fulfill the request. The orchestration layer is the automation engine of the cloud
infrastructure, which defines standardized workflows for process automation. The
workflows help orchestrating the execution of various system functions across the cloud
infrastructure to fulfill the request. Service orchestration is detailed later in this module.
A request typically carries a request identifier (ID), user’s subscription ID, and user’s choices.
The request ID is used as a reference during interaction between the portal and the
orchestration layer. The subscription ID is used to authenticate and authorize an operation.
Users submit their choices such as service name, required resources, and location to host a
service while ordering or modifying services. The cloud portal also receives response back
from orchestration layer about service operation status such as completed, failed, and still in
progress and updated service information. Service portal presents this information to the
intended users.
A cloud portal personalizes the portal content for each user. Personalization of portal
content helps displaying information for each user based on user identity, role, organization,
and other access rights, so that a user can have his own view of the portal content.
Personalization of cloud portal is necessary for meeting security and compliance
requirements associated with ordering and management of services.
A cloud portal commonly has capability to personalize portal content automatically. This is
accomplished by setting policy that helps governing who can see particular portal content
and who can order or modify particular service. Based on the policy, personalized content is
populated in the portlets.
Slide provides a sample set of roles and their profiles that can be used to set personalization
policy.
This lesson covered the cloud portal functions – presentation and interaction with
orchestration layer, and personalization of cloud portal.
This lesson covers various cloud interface standards for portability and interoperability
across multiple cloud service providers. It also covers common protocols for service access.
Standardization of cloud interfaces is a process to formulate a norm or a specification for
common and repeated use in regard to development and implementation of cloud
interfaces for achieving uniformity across service providers. The standardization establishes
conformity to specific feature set or quality level and enhances portability and
interoperability across clouds.
Portability in cloud means the ability to migrate data and application from one cloud to
another without need to recreate data and modify applications significantly. Interface
standardization helps porting applications and data from one service provider to another
without vendor lock-in issues and at an acceptable cost.
Interoperability in cloud means the ability to communicate, run software, and transfer data
among multiple clouds with uniform cloud interface. Interface standardization allow
consumers to use their data and applications across multiple clouds with common cloud
interface.
Although, we are still in the early days of cloud interface standards, some new interface
standards have emerged that provide the portability and interoperability advantages. These
standards are described in the subsequent slides.
Topology and Orchestration Specification for Cloud Applications (TOSCA), developed by
Organization for the Advancement of Structured Information Standards (OASIS) standardizes
the language to define a cloud service. The standard defines both service structure (service
components and their relationships) and operational behavior of the service independent of
any particular cloud service provider or hosting technology. The structure of a service is
modeled in a topology graph (shown in the slide), which includes nodes (service
components) and their relationships. For example, a business application is hosted on a Web
server, the Web server is hosted on an OS, which in turn is hosted on a VM. The operational
behavior of a service is specified as plans, which are workflows for orchestrating
management operations such as deployment, modification, patching, and termination of
services. Both the topology and plans are portable and can be interpreted by TOSCA
compliant cloud environments. This facilitates portable deployment of services to any
compliant cloud.
Open Virtualization Format (OVF) from the Distributed Management Task Force (DMTF) is an
open standard for packaging and distribution of virtual appliances (preconfigured VMs that
are ready to run on a hypervisor and typically includes a preinstalled guest OS and an
application software to be run in the VMs). OVF enables packaging and deployment of
services as virtual appliances and facilitates portability between various cloud platforms. An
OVF package includes metadata about VMs such as the number of processors, memory
required to run applications, and network configuration information. Metadata information
can be used by a cloud platform to deploy a service. An OVF package may also contain
digital signatures to ensure the integrity of the VMs being deployed along with licensing
information in the form of a EULA (End User License Agreement).
Cloud interface standards for interoperability are described below:
• Open Cloud Computing Interface (OCCI) from the Open Grid Forum is a set of
specifications for IaaS management interface. The specifications can be applied to
implement a vendor neutral interface for managing compute, network, and storage
resources provided as a service. OCCI specification can also be extended to support PaaS
and SaaS management interfaces.
• Cloud Infrastructure Management Interface (CIMI) from Distributed Management Task
Force (DMTF) specifies a standard management interface for IaaS offerings that allows
consumers to manage their resource usage. CIMI allows interoperability between
consumers and multiple providers that all offer the standard CIMI interface for managing
cloud infrastructure, thereby avoiding vendor lock-in.
• Cloud Data Management Interface (CDMI) from Storage Networking Industry Association
(SNIA) provides standard for both management interface as well as a functional interface
of a storage service. The functional interface enables an application to create, retrieve,
update and delete data from the cloud. The management interface can be used for
managing containers of data, user accounts, access control, and billing.
• Cloud Application Management for Platforms (CAMP) from Organization for the
Advancement of Structured Information Standards (OASIS) defines a management
interface standard for PaaS that can be used to package, deploy, and manage applications
onto a cloud platform. CAMP is under development. However, with support from several
industry participants, OASIS technical committee is making rapid progress towards
producing a workable CAMP specification.
Cloud services are usually accessed using Web services that allow a client application
(service requestor) to request data and computations to a Web server (service provider) and
a service provider to return responses. The client application could be a Web browser, an
onsite Web service application, or a Web service application deployed on a cloud platform.
Web services enable these client applications to communicate with Web servers in a cloud
infrastructure that present cloud interfaces through the use of standard Web protocols,
commonly HTTP.
Web services provide a standard means of interoperating between different software
applications running on a variety of platforms. This allows different applications to
communicate through a Web service and eliminates the dependency on a specific
programming language. For example, an ASP.NET application running on a Windows server
can use a Web service to communicate and exchange data with a Java application running
on a Linux server.
The Web services in cloud computing environments are primarily based on:
• Simple Object Access Protocol (SOAP)
• Representational State Transfer (REST)
These are described next.
Simple Object Access Protocol (SOAP) is a protocol for exchanging structured information
between applications (service requestor and service provider). It provides a messaging
framework that allows applications to pass messages back and forth in the implementation
of Web services over a network.
SOAP uses Extensible Markup Language (XML) for formatting messages that are commonly
transferred using HTTP. SOAP follows the HTTP request and response message model
providing SOAP request in a HTTP request and SOAP response in a HTTP response. SOAP
specifies the binding of HTTP header and XML file so that an application in one compute
system can call an application in another compute system and pass it information over HTTP.
It also specifies how the called application can return a response over HTTP. Being platform
and language independent, SOAP enables applications running on different OSs and
developed with different technologies and programming languages to communicate with
each other.
A SOAP message is an XML document containing the following elements:
• Envelope element: The required (shown using solid box) Envelope element is the root
element of a SOAP message and defines the XML document as a SOAP message.
• Header element: The optional (shown using dotted box) Header element contains
information that might be of use to SOAP receiver of the message. For example, the
Header element can be used to specify authentication information for a service; likewise,
it can be used to specify an account number for pay-per-use services. If the Header
element is present, it must be the first child element of the Envelope element.
• Body element: The required (shown using solid box) SOAP Body element contains the
information – request or response – intended for the ultimate receiver of the message. A
SOAP ultimate receiver parses the request or response to determine which operation to
invoke (by service provider) or to obtain requested information (by service requestor),
respectively.
• Fault element: The optional (shown using dotted box) Fault element contains error codes
and error messages. If a Fault element is present, it must appear as a child element of the
Body element.
Representational State Transfer (REST) is a client-server software architectural style for
distributed hypermedia systems such as hypertext, audio, video, and image. The
architectural principles defined in REST are used for developing Web services. A Web service
based on REST is called a RESTful Web service. REST makes use of existing Web protocols,
commonly HTTP.
The implementation of RESTful Web services follows four basic design principles. These are
listed in the slide.
In the REST architectural style, all objects such as function and data exposed by Web
services are treated as resources and are uniquely identified by their Uniform Resource
Identifiers (URIs). An URI is a string of characters, which commonly provides Web address to
identify a resource. For simplicity and usability, URIs typically have a directory-like structure.
For example, a list of cloud services (resources) subscribed by a consumer organization may
have a structure as http://www.serviceprovider.org/{subscription-id}/services/{service-
name}. In this example, “subscription-id” and “service-name” are the parameters that define
the search criteria used by a Web server to find a set of matching resources. There can be
almost limitless set of valid URIs that can be used to access resources to the finest levels of
granularity.
The REST style emphasizes that the interactions between a service requestor application
(client) and a service provider application (server) should use a limited number of methods
provided by the HTTP protocol. RESTful Web services allows manipulation (create, read,
modify, and delete) of resources by using a set of simple, well-defined HTTP methods
commonly: PUT, GET, POST, and DELETE. PUT creates a new resource, which can be then
deleted by using DELETE. GET retrieves the current state of a resource in some
representation. POST transfers a new state (such as append and modify) onto a resource.
For example, a user through a service requestor can issue a GET or DELETE method on URI
http://www.serviceprovider.org/0005-1543-9754/services/myservice-109 for reading
attributes or for deleting subscription of the specified (myservice-109) cloud service,
respectively.
A resource representation typically reflects the current state of a resource at the time an
application requests it. For example, when a Web browser gets and displays resource
content that constitutes a HyperText Markup Language (HTML) Web page, it is getting a
representation of the current state of that resource. Resource representations in this sense
are just snapshots in time. A service requestor performs actions on a resource by using a
representation of that resource. A representation has enough information to manipulate a
resource.
Resources are decoupled from their representation so that their content can be accessed in
a variety of formats, such as HTML, XML, plain text, JavaScript Object Notation (JSON), and
so on. For example, GET http://www.serviceprovider.org/0005-1543-
9754/services/myservice-109 could return an XML content as shown on the slide. An
application can negotiate the appropriate representation format right for it. The use of these
standard formats allows RESTful Web services to be used by applications written in different
languages and running on different computing platforms.
REST architecture is based on stateless interaction with resources; that is, self-contained
request messages. A REST Web service application (service requestor) includes within the
HTTP request all of the data needed by the server-side component (service provider) to
generate a response. This eliminates the need to store application state at the service
provider between requests and retrieving the state while processing a request.
Statelessness improves Web service performance because it offloads the responsibility of
maintaining application state to the requesting application, saving server-side resource
utilization.
Slide illustrates both stateful and stateless Web services. In both the cases, an application
needs to retrieve the content of the next Web page (page 2) in a multipage presentation of
cloud service information. In case of stateful service, the application requests the next page
assuming that the Web service keeps track of where the application leaves off while
navigating the presentation. The Web server stores a variable for the previous page in order
to be able to respond to the requests for next. In case of stateless service, the application
request should include the actual page number to retrieve the page.
This lesson covered the interface standards for portability – TOSCA and OVF, and the
interface standards for interoperability – OCCI, CIMI, CDMI, and CAMP. It also covered SOAP
and REST.
This lesson covers the definition of service orchestration, system integration using
orchestration software, application programming interface (API), and use cases of
orchestration.
Service orchestration refers to the automated arrangement, coordination, and management
of various system or component functions in a cloud infrastructure to provide and manage
cloud services. Service orchestration, unlike an automated activity, is not associated with a
specific system. Instead, it may span multiple systems, located in different locations
depending on the size of cloud infrastructure.
Cloud service providers typically deploy a purpose-designed orchestration software or
orchestrator that orchestrates the execution of various system functions. The orchestrator
programmatically integrates and sequences various system functions into automated
workflows for executing higher-level service provisioning and management functions
provided by the cloud portal. The orchestration workflows are not only meant for fulfilling
requests from consumers but also for administering cloud infrastructure such as adding
resources to a resource pool, handling service-related issues, scheduling a backup for a
service, billing, and reporting.
Orchestration of system functions saves service provisioning time, eliminates the possibility
of manual errors, reduces operating expenses, and simplifies cloud infrastructure
management. Although some manual steps (performed by cloud administrators) may be
required while processing the service provisioning and management functions, service
providers are looking to automate these functions as much as possible.
System integration is the connection of multiple system or component functions, which are
essential for carrying out self-service provisioning and management of services, into a
workflow. The orchestrator (see figure in the slide) provides system integration capability.
The orchestrator enables defining workflows that logically integrate various system
functions to automate provisioning and management of cloud services. A cloud portal
interacts with orchestrator and transfers service requests. The orchestrator, in turn, interacts
with appropriate systems based on pre-defined workflows, coordinates and sequences
execution of functions by these systems, and responds back to cloud portal with updated
service information.
(Contd.)
An orchestrator commonly interacts with other software components and devices (through
device’s native operating system and management software) in a cloud infrastructure using
application programming interfaces (APIs) defined for these components. API is a source-
code-based specification intended to be used by software components as an interface to
communicate with each other. It specifies a set of system or component functions that can
be called from a software component to interact with other software components.
An orchestrator programmatically integrates and calls various API functions for deploying
and managing cloud services based on pre-defined workflow. For example, when a
consumer orders a cloud service through a cloud portal, an appropriate workflow is
triggered by the orchestrator. According to the workflow, the orchestrator uses specific API
function (for example, API_Function_Appr) to send the request to an approval system that
validates the request based on pre-defined policy. If consumer needs to enter his credit card
information, the orchestrator uses another API function (for example, API_Function_Credit)
to send consumer’s credit card information to a banking application that verifies whether
this information is correct. Once payment is confirmed, the banking application sends a
response back to the orchestrator acknowledging success of the payment. Based on the
validation and payment, the orchestrator calls API function (for example,
API_Function_Control) of unified manager to send provisioning requests. Slide shows these
API function calls from the orchestrator for order-to-delivery automation.
The orchestrator enables cloud administrators and architects to launch a pre-defined
workflow from orchestrator’s built-in library or to define a new workflow and then launch
the workflow. The workflow serves as the reference point during service provisioning and
management operations. The orchestrator automatically sequences and triggers system
functions included in the workflow.
Orchestrators commonly provide interfaces to model workflows similar to a flow diagram. A
workflow includes multiple related API functions and logical connections between them.
Common elements in a workflow are:
• Start: It is the starting point of a workflow. A workflow can have only one start element.
• Condition: It consists of conditional branches. It takes one or more input parameter and
return either true or false. It allows a workflow to branch into different directions,
depending on the input parameters.
• Action: It is an activity that is executed by calling an API function. It takes one or multiple
input parameters and returns a value when it completes its run. Multiple activities can be
executed in sequence or simultaneously.
• Waiting time: Wait for a given time period or until a certain date/time has passed at
which a workflow resumes running.
• Waiting event: Wait for a specific event to resume workflow running.
(Contd.)
In this use case, a consumer logs on to the cloud portal and orders a DB2 database platform
(PaaS) from the service catalog. The database platform is ordered to support consumer’s
application. The request is routed to the orchestrator, which triggers a workflow to fulfill this
request. After this request is fulfilled, the consumer’s application can access the deployed
database as needed. The diagram in the slide shows a sample workflow defined in the
orchestrator to deploy DB2 database.
In this use case, a consumer logs on to the cloud portal and requests for a customer
relationship management (CRM) software as a service from the service catalog. The CRM
software has dependency on database and PHP modules. In this case, the provider uses
MySQL to deploy a database. The CRM application and MySQL are hosted on different VMs
that are interconnected via a VLAN. The orchestrator sequences appropriate system
functions in a workflow to fulfill this request. The diagram on the slide shows a sample
workflow implemented in orchestrator to automatically deploy CRM application as a
service.
In this use case, the tenant administrator of a consumer organization requests that the
organization be removed from the provider’s database. After provider approves this request,
the request is routed to the orchestrator to trigger an automated workflow that fulfills this
request. After this request is fulfilled, the consumer organization is removed from provider’s
database and rented resources are released.
This lesson covered the definition of service orchestration, integration of system functions
by orchestrator, orchestrator APIs, and use cases of orchestration.
This lesson covers the overview of cloud service lifecycle and service planning phase of
cloud service lifecycle.
A cloud service lifecycle describes the life of a cloud service, from planning and optimizing
the service to align with the business strategy, through the design and deployment of the
service, to its ongoing operation and support. From inception to retirement, a cloud service
goes through four key phases: service planning, service creation, service operation, and
service termination. Each phase includes various activities as shown in this slide. These
phases are described in the subsequent slides.
During service planning, making business case decisions for the cloud service offering
portfolio is key – regardless of whether those services will be supported by an internal IT
organization (generally referred to as IT as a Service) or existing or new service providers
who are looking to add to their portfolio of offerings. Service planning considers business
requirements, market conditions and trends, competitive factors, required service quality
and attributes, and the addition of new capabilities when required.
Common activities during the service planning are listed in the slide. These activities are
described next.
Assessment of service requirements should take into consideration both business
requirements and IT capability of provider organization as inputs. These inputs are described
below:
Business requirements: This includes both current and future business needs. The business
requirements are commonly composed of:
• Line of business (LOB) requirements for implementing specific business functionalities.
LOB refers to related products and services where a business generates revenue. Some
examples of LOB requirements are listed in the slide.
• Market demands driven by various factors as listed in the slide.
IT capability: This includes the current state of their IT infrastructure, management
processes, technology, and expertise of the IT operational staff.
Cloud administrators or architects assess and identify potential service offerings that would
support the business requirements. This includes evaluating the services to be created and
upgraded, the necessary feature set for each service, and the service level objectives (SLOs)
of each service aligned to consumer needs and market conditions. SLOs are specific
measurable characteristics such as availability, throughput, frequency and response time.
They provide a measurement of performance of the service provider. SLOs are key elements
of a service level agreement (SLA) that describes what service level will be provided, how it
will be supported, service location, and responsibilities of the consumer and the provider.
Cloud administrators or architects also assess the IT readiness to provide new or upgraded
services. The assessment helps evaluate the current state of IT capability relative to the
capability required to deploy required services and to meet SLOs.
Some key questions addressed during service assessment are listed in this slide. By
answering these questions, administrators or architects can determine service
requirements.
Service requirements may or may not commensurate the capability of provider’s IT
infrastructure and IT operation management team. Unless IT is ready to fulfill service
requirements, a service provider may not implement services as required to meet its
business goals. Hence, it is essential to assess IT readiness while assessing service
requirement.
A common method to evaluate IT readiness is to create a balanced scorecard. A balanced
scorecard translates an organization's overall business strategy into specific, measurable
goals and helps monitor the organization's performance in terms of achieving these goals.
The balanced scorecard aims at:
• Viewing IT capability from four separate perspectives – internal process, learning and
growth, financial, and consumer
• Developing metrics under each perspective
• Populating strategic goal of service provider against each metric
• Populating organization’s performance against each metric using data collected through
interview and survey
• Finding the gap between strategic goal and performance measure and areas where IT
capability needs to improve
Slide shows key areas covered by each perspective and an example of a balanced scorecard.
A service enablement roadmap guides the strategic direction to fulfill service requirements
in provider’s organization. The roadmap is the framework for needed transformation in IT
infrastructure, operations and management, for necessary cloud solutions, for eliminating
skill gaps of operational staff, and for timeline to implement services – all contributing to the
service requirements, safeguarding against deviation from provider’s business goals.
The service roadmap, shown in the slide, follows a five-step service framework. These steps
are described below:
1. Understand: This step involves understanding what business values the services can
bring and the challenges that need to be overcome. Providers may involve internal or
external cloud advisors that help accelerate understanding, promote consensus and
alignment among technical and business stakeholders, and set realistic up-front
expectations on services to be created or upgraded.
(Contd.)
Cloud billing is a process of generating bills from the resource usage data for cloud services
using predefined billing policies. Example of a billing policy could be charging consumers $3
per hour for the usage of 1 CPU with discount of 25 percent for CPU-hour greater than 30
hours. Usage data for cloud services needs to be collected by a billing system that generates
bill for each consumer. Usage data is commonly collected from control layer that manages
the provisioning of resources to service instances. The billing system is responsible for
accurate measurement of the number of units of service usage per consumer and reports
price for the consumed units. In this way, it enables metering for cloud services, providing
transparency for both the provider and consumer of the utilized service.
In a private cloud, chargeback and showback are commonly used for billing. Chargeback is
the ability to measure resource consumption per business unit and charge them back
accordingly. Showback is reporting, but not applying, the charge to the business units. For
showback business units are shown the charges, usually for business accounting purposes.
In either case, accurate measurement and reporting of resource usage are essential from
both consumer and provider perspectives.
Establishing a billing policy requires consideration of below factors:
• Provider’s business goals: Service providers make billing decisions based on their
business goals such as making a profit, justifying new capital spending, influencing
consumption behaviors by the business units, making IT more service aware, cost
conscious and accountable, and comparing IT costs to public service provider charges. For
instance, an organization managing an on-premise private cloud may issue showback
reports to its business units, with the intent to optimize their IT spending habits. Other
organizations may feel that spending habits will not be modified with just reports and
may prefer chargeback. Showback may also be used as an intermediate step towards an
eventual chargeback. Public cloud service providers generally charge consumers
according to resource usage for the service. However, they might opt to deliver a service
without any charge for promotional purposes or as a supplement to another service.
• Service-specific pricing strategy: Pricing strategy for each cloud service identifies the
costs related to a cloud service over the life of a service and plans for: recovery of the
costs, profit, meeting organization’s return-on-investment (ROI) and reinvestment goals,
and adjustments driven by competitive externally-provided services. Once all costs are
evaluated, providers should determine an appropriate chargeable unit for the service,
such as duration of use, amount of bandwidth, or amount of storage capacity. Ideally the
chargeable unit is readily associated with the business value the consumer perceives or
receives from the service. The provider determines the price per unit of usage to meet its
business goals. The provider may add some margin amount over per-unit cost to define
price per unit usage, or may establish price at the true cost of service, depending on the
provider’s business goals.
This lesson covered the overview of cloud service lifecycle and its service planning phase
that includes assessing service requirements, developing service enablement roadmap, and
establishing billing policy.
This lesson covers the service creation phase of cloud service lifecycle.
During service creation, providers aim at defining services in the service catalog and creating
workflows for service orchestration. This allows a consumer to self-provision cloud services
from the cloud portal and the orchestrator to automate process execution according to the
workflows.
Common activities during service creation are listed in the slide. These activities are
described next.
A service template is a collection of interrelated hardware and/or software components that
constitute a service and work together upon deployment of a service. The collection of
service components and their relationships are based on decisions that emanate from cloud
service planning. On receiving a service provisioning request, hardware and/or software
resources are allocated, configured, and integrated as per the service template to create an
instance of the service. In this way, a service template provides a standard to repeatedly
generate predictable instances of a specific service.
Service templates are generally defined in the service catalog, which is presented in the
cloud portal. The portal software provides standard interface to make entries about service
templates along with other information described later in the service offering section. From
consumers perspective, the service template specifics helps them understand the hardware
configuration, software, and protection mechanism for a service. From a provider’s
perspective, it provides guidelines to create workflows for service orchestration.
A service template specifics may include fixed and/or customizable options. Customizable
options could include a choice of OS, the approach for data protection, or a mix of storage
and compute parameters. The options for service components are based on service
requirements assessed during service planning.
Service template specifics commonly covers below entities:
• Service structure: It specifies the structure of a service that includes service components
and their relationships. While planning for a service structure, an architect needs to map
service requirements into component requirements of a service. For example, a SaaS
type service requirement may be mapped into running a specific business application on
a guest OS, deploying a database to support the application, and hosting the database
and the application on VMs. A more complex example of service structure is provided in
the next slide.
• Service attribute: It specifies configurations of service components. For example,
attributes of a VM provided as a service component are: number of processors of specific
processing power, memory size, and number of attached disks of specific size.
• Service operation: It specifies the management operations such as add, modify, start,
and stop that can be performed on management interface of a service.
This slide provides an example of service structure of a SaaS type of service. The service
requirements are:
• Deployment of accounting application
• Accounting application must scale based on workload variation, default two
application instances
• Accounting application must run on Apache Tomcat server
• Supporting services: database service and data backup service
The diagram in the slide illustrates a service structure based on these service requirements.
Once a service template is defined, the workflows for service orchestration are created in
the orchestrator based on the template specification. These workflows enable automated
allocation, configuration, and integration of hardware and/or software resources for a
service as per the service template. As noted earlier in the service orchestration section, an
orchestration workflow consists of a series of inter-related API functions to fulfill service
requests invoked at the cloud portal. The cloud portal passes on the requests to the
orchestrator, which interacts with appropriate cloud infrastructure components according to
the workflow. This enables automated arrangement, coordination, and management of the
service provisioning and management functions. Service providers want to automate these
functions as much as possible. Reaching the point where a service is automated from
creation to termination is one of the ultimate goals of providers.
Service offerings are available to consumers in the form of a service catalog presented in the
cloud portal. They are defined in the service catalog using portal software that provides an
administration interface to define service offerings. A service offering is the result of
combining service template specifics with constraints, policies, rules, pricing, and SLA.
• Constraints may restrict access to a service to certain individuals or groups.
• Policies are principles for treating service components on a per-service or per-consumer
basis. Policies could affect network and security areas such as bandwidth quotas, firewall
configuration, Quality of Service (QoS), and network traffic inspection for intrusion
detection and prevention.
• Rules are guidelines for deploying or scaling service instances, limiting resource allocation
to the service instances, and subscription period. Rules can also be imposed for
compliance requirements, disposition of consumer data upon service termination, and
situation when a consumer requests resources beyond a specified or authorized limit.
• Pricing, if applicable, specifies the ongoing service charge, charges associated with the
initial data transfer to the provider’s premises, and any additional fees for support and
recovery of data. These charges are determined during the service planning phase.
(Contd.)
Some key questions that should be addressed when creating an SLA are listed in this slide.
Slide shows an example of IaaS offering that combines service template specifics,
constraints, policies, rules, price, and SLA. This service offering includes compute and
storage components and is supported by a data backup service. It is defined in the service
catalog.
The service contract is an agreement between the cloud service provider and the cloud
service consumer stating the terms of service usage. The two parties enter into a contract
describing agreements on pricing, SLAs, termination of service, and any specific
configuration options.
The service contract specifics is commonly presented on the cloud portal, which describes
the terms and conditions for using a service. When a consumer wants to order or instantiate
a service, the service contract must be established with the provider. By agreeing to the
terms and conditions a contract is established for the requested service.
This lesson covered the service creation phase of the cloud service lifecycle, which includes
defining service template, creating orchestration workflow, defining service offering, and
creating service contract.
This lesson covers the service operation and service termination phases of cloud service
lifecycle.
The service operation phase of the cloud service lifecycle comprises ongoing management
operations by cloud administrators to maintain cloud infrastructure and deployed services,
meeting or exceeding SLA commitments. Common activities in this phase are listed in the
slide. These activities are described next.
Service assets include cloud infrastructure resources that contribute to the provisioning of a
service. Examples of service assets are: compute, storage, network, security components
(both physical and virtual), and business applications. Discovery of service assets provides
cloud administrators with visibility into each asset, including physical-to-virtual
dependencies and information about its configuration, connectivity, functions, performance,
capacity, availability, and utilization.
Discovery is a key for monitoring cloud infrastructure resources. In order to manage service
operations, a cloud administrator needs to monitor infrastructure resources. For that, an
administrator must know what resources are deployed and be able to monitor these
resources. Discovery activity automatically collects information on various service assets,
enabling administrators to monitor these assets centrally. Depending on the size of the
cloud environment, discovery may involve gathering information about a large number of
service assets and mapping many physical-to-virtual relationships.
Discovery is performed using specialized tool. A discovery tool is either bundled with a
management software as a feature or an independent software that passes discovered
information to a management software. The discovery tool commonly interacts and collects
necessary data from service assets through native APIs defined for these assets. An
alternative discovery method is to install software agents on the service assets to expose
information to the discovery tool.
Discovery is typically scheduled by setting an interval for its periodic occurrence. Depending
on provider’s business requirements, a cloud administrator can change how often discovery
runs. Discovery may also be initiated by a cloud administrator or be triggered by an
orchestrator when a change occurs in the cloud infrastructure. A discovery tool although by
default discovers all service assets automatically, however, an administrator may manually
add service assets for discovery. Manual addition is useful if administrators want to monitor
only a subset of service assets.
Before performing the discovery activity, a few questions need to be addressed by the
provider. These are listed in the slide.
Service operation management involves all service-related operations across cloud
infrastructure. It has a service-based focus, meaning that all operations performed on a
cloud infrastructure have a goal to meet service requirements and SLA. Service operation
management aims at ensuring and restoring service levels while continuously optimizing
management operations to increase efficiency and reduce cost. It ensures that management
operations are efficient by using as few resources as needed without deviating from SLA
commitments.
The key management operations include the following:
• Monitoring and reporting: Monitoring is a continuous operation for tracking the status of
services and service assets. The parameters that are monitored include configuration,
security, performance, availability, and capacity. Reports are formulated from data
gathered through monitoring. Reporting helps to organize and present current status,
historical trends, operation log, and bills associated with services.
• Provisioning: It configures and allocates the hardware, software, and other resources
required to deliver services. Provisioning primarily involves resources management to
meet capacity, availability, performance, and security requirements of cloud services.
• Troubleshooting: It resolves service-related issues in the cloud infrastructure so that
services can maintain their operational state.
(Contd.)
The final phase of the cloud service lifecycle is service termination. Terminating a service
removes a service instance previously assigned to a consumer. When a given service
instance is terminated, resources allocated to the service instance should be automatically
returned to the resource pool for redeployment to other service instances. Service
termination must be achieved without impacting other instances of the service and without
compromising consumers’ data, resources, and confidential information.
A cloud service may be terminated against below scenarios:
• Natural termination by contract agreement
• Provider or consumer initiated termination
Some common reasons for a provider to terminate the service include: business
circumstances where the provider can no longer offer the service, a disaster, or significant
consumer violation of contract terms. New hardware deployment or software upgrade by
the provider could also be a reason for service termination. Typically, service providers do
not want to maintain multiple versions of software and varying hardware platforms. Instead,
they may offer options such as migration to the new hardware and software, or a refund, or
limited use of end-of-life service.
Some common reasons for a consumer to terminate the cloud service are: business
circumstances where the consumer no longer needs the service, the service requirement
was planned to be temporary, or service performance or support is not acceptable. Typically,
termination by the consumer is easily and quickly performed through the cloud portal.
However, termination requests may be processed through a validation system and only
those tenant administrators/users who have the required authority are allowed to terminate
a service. There could be billing implications too (e.g., pro-rated charges and early
termination fee), depending on the terms in the service contract.
The service termination phase usually consists of the following activities:
• De-provisioning service instance: De-provisioning halts a service instance and releases
resources from it. Automating the termination process shortens de-provisioning time.
The de-provisioning activity ensures that no resources are orphaned and thus unavailable
for use by other service instances.
• Handling consumer’s data as per contract terms: The service contract typically includes a
termination agreement that defines the terms and conditions for terminating a service.
The termination agreement describes what happens to the consumer data if a service is
terminated or if the provider goes out of business. Based on the agreement, providers
may return data to the consumers upon termination within a specified time duration.
Some provider may also offer a data shredding service to permanently destroy all traces –
including backup or secondary copies – of consumer data from provider’s premises. Such
a disposal of consumer data should be verifiable by consumer in order to ensure its
malicious or unintended misuse.
• Providing final billing report: The final billing report includes service charges till the time
of service termination. Depending on the terms in the service contract, a premature
termination may incur early termination fees and non-refund of any subscription fee
intended to cover the unused portion of the subscription period.
• Asking for feedback to improve service: Seeking feedback provides opportunity to
understand the consumers viewpoint about services, which can help a provider to further
develop. Consumers feedbacks are considered as a key input while planning for a new
service or making decision to upgrade a service.
This lesson covered the service operation phase of cloud service lifecycle, which includes
discovering service assets and managing service operations. It also covered the service
termination phase of cloud service lifecycle, which includes de-provisioning of service
instance.
The Concepts in Practice section covers VMware vCenter Orchestrator.
VMware vCenter Orchestrator is an orchestration software that helps automate provisioning
and operational functions in a cloud infrastructure. It comes with a built-in library of pre-
defined workflows as well as a drag-and-drop feature for linking actions together to create
customized workflows. These workflows can be launched from VMware vSphere client, from
VMware vCloud Automation Center or through various triggering mechanisms. vCenter
Orchestrator can execute hundreds or thousands of workflows concurrently.
vCenter Orchestrator can be installed as a virtual appliance or on a Windows server. The
vCenter Orchestrator virtual appliance significantly reduces the time and skill required to
deploy vCenter Orchestrator and provides a low-cost alternative to the traditional Windows-
based installation.
This module covered service catalog, service ordering process, management and functional
interfaces of services, and cloud portal and its functions. It also covered various cloud
interface standards along with SOAP and REST. Additionally, this module covered service
orchestration including system integration and workflow modeling and cloud service
lifecycle comprising four phases – service planning, service creation, service operation, and
service termination.
This module focuses on the importance of business continuity (BC) and how it enables to
achieve required cloud service availability. This module also focuses on various fault tolerant
mechanisms to eliminate single points of failure in the cloud infrastructure. This module
further discusses various data protection solutions such as backup and replication. Finally,
this module covers the key design strategies for cloud application resiliency.
Copyright © 2014 EMC Corporation. All rights reserved Module 7: Business Continuity 337
As the business environment changes rapidly, consumers expect cloud services to be
available whenever they need. Cloud service providers usually publish service level
agreement (SLA) for each of their service offerings, which includes service availability,
performance, and downtime compensation. Therefore, it is critical for service providers to
deliver services to consumers in accordance with these SLAs. Preparation and planning are
most important to minimize and cope with cloud service outages. When building cloud
infrastructure, business continuity (BC) process must be defined to meet the availability
requirement of their services.
In a cloud environment, it is important that BC process should supports all the layers –
physical, virtual, control, orchestration, and service – to provide uninterrupted services to
the consumers. The BC processes are automated through orchestration to reduce the
manual intervention, for example if a service requires VM backup for every 6 hours, then
backing up VM is scheduled automatically every 6 hours.
This lesson covers introduction to business continuity (BC) and cloud service availability. This
lesson also covers causes of service unavailability and impact due to service unavailability.
Further, this lesson covers various ways to achieve required cloud service availability.
Business continuity is a set of processes that includes all activities that a business must
perform to mitigate the impact of service outage. BC entails preparing for, responding to,
and recovering from a system outage that adversely affects business operations. It describes
the processes and procedures a service provider establishes to ensure that essential
functions can continue during and after a disaster. It enables continuous availability of
services in the event of outages. Business continuity prevents interruption of mission-critical
services, and reestablishes impacted services as swiftly and smoothly as possible by using an
automated process. BC involves proactive measures, such as business impact analysis, risk
assessment, building resilient IT infrastructure, deploying data protection solutions (backup
and replication). It also involves reactive countermeasures, such as disaster recovery, to be
invoked in the event of a service failure. Disaster recovery (DR) is the coordinated process of
restoring IT infrastructure, including data that is required to support ongoing cloud services,
after a natural or human-induced disaster occurs. The basic underlying concept of DR is to
have a secondary data center or site (DR site) and at a pre-planned level of operational
readiness when an outage happens at the primary data center.
Cloud service availability refers to the ability of a cloud service to perform its agreed
function according to business requirements and customer expectations during its
operation. Cloud service providers need to design and build their infrastructure to maximize
the availability of the service, while minimizing the impact of an outage on consumers.
Cloud service availability depends primarily on the reliability of the cloud infrastructure
(compute, storage, and network) components, business applications that are used to create
cloud services, and the availability of data. Time between two outages, whether scheduled,
or unscheduled, is commonly referred as uptime, because the service is available during this
time. Conversely, the time elapsed during an outage (from the moment a service becomes
unavailable to the moment it is restored) is referred to as downtime.
(Contd.)
The slide listed some of the key causes of service unavailability. Data center failure is not the
only cause of service failure. Poor application design or resources configuration error can
also lead to service outage. For example, if the web portal is down for some reason, then
the services are inaccessible to the consumers, which leads to service availability. Even
unavailability of data due to several factors (data corruption and human error) also leads to
service unavailability. A cloud service might also cease to function due to an outage of the
dependent services.
Perhaps even more impactful on availability are the outages that are required as a part of
the normal course of doing business. The IT department is routinely required to take on
activities such as refreshing the data center infrastructure, migration, running routine
maintenance or even relocating to a new site. Any of which activity can have its own
significant and negative impact on service availability.
Note:
In general, the outages can be broadly categorized into planned and unplanned. Planned
outages may include installation and maintenance of new hardware, software upgrades or
patches, performing application and data restores, facility operations (renovation and
construction), and migration. Unplanned outages include failure caused by human errors,
database corruption, failure of physical and virtual components, and natural or human-made
disasters.
Cloud service unavailability or service outage results in loss of productivity, loss of revenue,
poor financial performance, and damages to reputation. Loss of revenue includes direct loss,
compensatory payments, future revenue loss, billing loss, and investment loss. Poor
financial performance affects revenue recognition, cash flow, discounts, credit rating, and
stock price. Damages to reputations may result in a loss of confidence or credibility with
customers, suppliers, financial markets, banks, and business partners. Other possible
consequences of service outage include the cost of additional equipment rental, overtime,
and extra shipping.
With the aim of meeting the required service availability, the service provider should build a
resilient cloud infrastructure. Building a resilient cloud infrastructure requires following high
availability solutions:
• Deploying redundancy at both cloud infrastructure component level and site (data
center) level to avoid single points of failure
• Deploying data protection solutions such as backup and replication
• Implementing automation in cloud service failover
• Architecting resilient cloud applications
For example when a disaster occurred at one of the service provider’s data center, then BC
triggers the DR process. This process typically involves both operational personnel and
automated procedure in order to reactivate the service (application) at a functioning data
center. This requires the transfer of application users, data, and services to the new data
center. This involves the use of redundant infrastructure across different geographic
locations, live migration, backup, and replication solutions.
This lesson covered the need for business continuity and cloud service availability. This
lesson also covered causes of service unavailability and impact due to service unavailability
to service providers. Further, this lesson covered various ways to achieve required cloud
service availability including implementing fault tolerant mechanisms, deploying data
protection solutions such as backup and replication, implementing automation in cloud
service failover, and architecting resilient cloud applications.
This lesson covers identifying and avoiding single points of failure. This lesson also covers
key fault tolerance mechanisms at cloud infrastructure component level.
Single points of failure refers to any individual component or aspect of an infrastructure
whose failure can make the entire system or service unavailable. Single points of failure may
occur at infrastructure component level and site level (data center). The figure on the slide
illustrates an example where various cloud infrastructure components including the
compute system, VM instance, network devices, storage, and site itself become a single
point of failure. The compute systems may run cloud services including consumer
applications, web portal, and so on. Assume that the consumer’s service (web application)
runs on a VM instance and the service uses a database server runs on another VM to store
and retrieve application data. If the database server is down, then the application would not
be able to access the data and in turn impact the availability of the service. Consider
another example where consumers are connected to the provider’s location through
Internet. These consumers are generally connected to a web portal which then forwards the
connection to a secure client gateway. If the web portal is down due to underlying compute
failure then the consumer will not be able to access the required services. This results into
service unavailability. Therefore, it is important for the service provider to build a fault
tolerant cloud infrastructure that avoids single points of failure in the environment.
Highly available infrastructures are typically configured without single points of failure to
ensure that individual component failures do not result in service outage. The general
method to avoid single points of failure is to provide redundant components for each
necessary resource, so that a service can continue with the available resource even if a
component fails. Service provider may also create multiple service availability zones
(discussed later in the module) to avoid single points of failure at data center level. Usually,
each zone is isolated from others, so that the failure of one zone would not impact the other
zones. It is also important to have high availability mechanisms that enables automated
service failover within and across the zones in the event of component failure, data loss, or
disaster.
Note:
N+1 redundancy is a common form of fault tolerance mechanism that ensures service
availability in the event of a component failure. A set of N components has at least one
standby component. This is typically implemented as an active/passive arrangement, as the
additional component does not actively participate in the service operations. The standby
component is active only if any one of the active components fails. N+1 redundancy with
active/active component configuration is also available. In such cases the standby
component remains active in the service operation even if all other components are fully
functional. For example, if active/active configuration is implemented at site level, then a
cloud service is fully deployed in both the sites. The load for this cloud service is balanced
between the sites. If one of the site is down, the available site would manage the service
operations and manage the workload.
The underlying cloud infrastructure components (compute, storage, and network) should be
highly available and single points of failure at component level should be avoided. The
example shown on the slide represents an infrastructure designed to mitigate the single
points of failure at component level. Single points of failure at compute level can be avoided
by implementing redundant compute systems in a clustered configuration. Single points of
failure at network level can be avoided via path and node redundancy and various fault
tolerance protocols. Multiple independent paths can be configured between nodes so that if
a component along the main path fails, traffic is rerouted along another. The key techniques
for protecting storage from single points of failure are RAID, erasure coding techniques,
dynamic disk sparing, and configuring redundant storage system components. Many storage
systems also support RAIN (redundant array independent nodes) architecture to improve
the fault tolerance. The following slides will discuss the various fault tolerant mechanisms as
listed in the slide to avoid single points of failure at component level.
Compute clustering is one of the key fault tolerance mechanisms that provides continuous
availability of service even when a VM instance, physical compute systems, OS or hypervisor
fails. Clustering is a technique where at least two compute systems (or nodes) work
together and are viewed as a single compute system to provide high availability and load
balancing. If one of the compute systems fails, service running in the compute system can
failover to another compute system in the cluster to minimize or avoid any service outage.
The two common cluster implementations are active/active and active/passive. In
active/active clustering, the nodes in a cluster are all active participants and run the same
service of their clients. The active/active cluster balances requests for service among the
nodes. If one of the nodes fails, the surviving nodes take the load of the failed one. This
method enhances both performance and availability of a service. The nodes in the cluster
have access to shared storage volumes. In active/active clustering only one node can write
or update the data in a shared file system or database at a given time. In active/passive
clustering the service runs on one or more nodes, and the passive node just waits for a
failover. If and when the active node fails, the service that had been running on the active
node is failed over to the passive node. Active/passive clustering does not provide
performance improvement like active/active clustering. Clustering uses a heartbeat
mechanism to determine the health of each node in the cluster. The exchange of heartbeat
signals, usually happens over a private network, allows participating cluster members to
monitor one another’s status. Clustering can be implemented between multiple physical
compute systems, or between multiple VMs, or between VM and physical compute system,
or between multiple hypervisors. Hypervisor clustering is the common clustering
implementation in a cloud environment.
Hypervisor clustering is an implementation where multiple hypervisors running on different
compute systems are clustered. This method provides high availability for services running
on VMs by pooling the virtual machines and compute systems that reside on into a cluster.
If a physical compute system running a VM fails, the VM will be restarted on another
compute system in the cluster. This method provides rapid recovery of services running on
VMs in the event of compute system failure.
In some hypervisor clustering implementation, the hypervisor uses its native technique to
provide continuous availability of services running on VMs even if a physical compute
system or a hypervisor fails. In this implementation, a live instance (i.e., a secondary VM) of
a primary VM is created on another compute system. The primary and secondary VMs
exchange heartbeats. If the primary VM fails due to hardware failure, the hypervisor
clustering enables failover to the secondary VM immediately. After a transparent failover
occurs, a new secondary VM is created and redundancy is reestablished. The hypervisor
running the primary VM as shown in the figure on the slide captures the sequence of events
for the primary VM, including instructions from the virtual I/O devices, virtual NICs, etc.
Then it transfers these sequences to the hypervisor running on another compute system.
The hypervisor running the secondary VM receives these event sequences and sends them
to the secondary VM for execution. The primary and the secondary VMs share the same
storage, but all output operations are performed only by the primary VM. A locking
mechanism ensures that the secondary VM does not perform write operations on the
shared storage. The hypervisor posts all events to the secondary VM at the same execution
point as they occurred on the primary VM. This way, these VMs “play” exactly the same set
of events and their states are synchronized with each other.
Sometime it is necessary to move the running services on VMs from one hypervisor to
another without impacting the services. For example, a cloud administrator wants to
perform scheduled compute system maintenance without impacting VMs running on it. This
can be achieved with the help of implementing VM live migrations. In a VM live migration,
as shown in figure on the slide, the entire active state of a VM is moved from one hypervisor
to another. The state information includes memory contents and all other information that
identifies the VM. This method involves copying the contents of virtual machine memory
from the source hypervisor to the target and then transferring the control of the VM’s disk
files to the target hypervisor. Next, the VM is suspended on the source hypervisor, and the
VM is resumed on the target hypervisor. Because the virtual disks of the VMs are not
migrated, this technique requires that both source and target hypervisors have access to the
same storage. Performing VM live migration requires a high speed network connection. It is
important to ensure that even after the migration, the virtual machine network identity and
network connections are preserved. In addition, this migration facilitates VM load balancing
across physical compute systems. That improves performance and optimizes resource
usage.
A short-time network interruption could impact plenty of services running in a cloud
environment. So, the network infrastructure must be fully redundant and highly available
with no single points of failure. Link aggregation, switch aggregation, NIC teaming, and
multipathing provides fault-tolerance mechanism against link failure.
Link aggregation combines two or more parallel network links into a single logical link, called
port-channel, yielding higher bandwidth than a single link could provide. Link aggregation
enables distribution of network traffic across the links and traffic failover in the event of a
link failure. If a link in the aggregation is lost, all network traffic on that link is redistributed
across the remaining links. Link aggregation can be performed for links between two
switches and between a switch and a node.
Switch aggregation combines two physical switches and makes them appear as a single
logical switch. All network links from these physical switches appear as a single logical link.
This enables a single node to use a port-channel across two switches and network traffic is
distributed across all the links in the port-channel. Switch aggregation also provides fault
tolerance against link failure. Normally, an active switch failure, without switch aggregation,
causes the Ethernet or FCoE network to take finite time (a few seconds) to select an
alternate link through another switch. With switch aggregation, if one switch in the
aggregation fails, network traffic will continue to flow through another switch. The figure in
the slide shows an example of switch aggregation. In this example, four physical links to the
aggregated switches (directors) appear as a single logical link to the third switch.
NIC teaming logically groups NICs (to create a NIC team) so that they appear as a logical NIC
to the OS or hypervisor. NIC teaming provides network traffic failover to prevent
connectivity loss in the event of a NIC failure or a network link outage.
NICs within a team can be configured as active and standby. Active NICs are used to send
data packets, whereas the standby NICs remain idle. A standby NIC is not used for
forwarding traffic unless an active NIC fails. In the event of an active NIC , the node begins
using another active NIC to send data. Data that is in flight may have to be retransmitted. In
some cases, NIC teaming enables aggregation of network bandwidth of individual NICs. The
bandwidth aggregation facilitates distribution of network traffic across NICs in the team.
Multipathing enables service providers to meet aggressive availability and performance
service levels. It enables a compute system to use multiple paths for transferring data to a
LUN on a storage system. Multipathing enables automated path failover that eliminates the
possibility of disrupting an application or service due to the failure of an adapter, cable, port,
and so on. In the event of a path failover, all outstanding and subsequent I/O requests are
automatically directed to alternative paths.
Typically, a single path from a compute system to a LUN consists of a NIC or HBA port, switch
ports, connecting cables, and a storage controller (SC) port. To use multipathing, multiple
paths must exist between the compute and the storage systems. Each path can be
configured as either active or standby. Multipathing can perform load balancing by
distributing I/O across all active paths. Standby paths become active if one or more active
paths fail. If an active path fails, the multipathing process detects the failed path and then
redirects I/Os of the failed path to another active path. Multipathing can be a built-in OS and
hypervisor function, or it can be provided by a third-party as a software module that can be
installed to the OS and hypervisor. The figure on the slide shows a configuration where four
paths between the physical compute system (with dual-port HBAs) and the LUN enable
multipathing.
An in-service software upgrade (ISSU) is a technique where the software (firmware) on a
network device (switch and router) can be patched or upgraded without impacting the
network availability. ISSU is mainly used to ensure network availability as a result of device
maintenance or upgrade processes. It eliminates the need to stop devices ongoing process
and reboot the device, which degrades the overall network service availability. Usually an
ISSU requires a network device with redundant control plane elements, such as supervisor
engines or routing engines. This redundancy allows a network administrator to update the
software image on one engine while the other maintains network availability. For example,
some routers and switches are integrated with active and standby route processors. When
the upgrade process starts, the active routing engine’s operations are switched to the
standby routing engine until the upgrade process is complete.
Cloud infrastructure comprises a very large number of disk drives and solid state drives to
support the various services running in the environment. Failure of these drives could result
in data loss and service unavailability. The greater the number of drives in use, the greater is
the probability of a drive failure. RAID is a technique that combines multiple drives into a
logical unit called a RAID set. Nearly all RAID implementation models provide data
protection against drive failures. The figure on the slide illustrates an example of RAID 6
(dual distributed parity), where data is protected against two disk failure.
A dynamic disk sparing is a fault-tolerance mechanism, refers to a spare drive that
automatically replaces a failed disk drive by taking the identity of it. A spare drive should be
large enough to accommodate data from a failed drive. Some systems implement multiple
spare drives to improve data availability. In a dynamic drive sparing, when the recoverable
error rates for a disk exceed a predetermined threshold, the disk subsystem tries to copy
data from the failing disk to the spare drive automatically. If this task is completed before
the damaged disk fails, the subsystem switches to the spare disk and marks the failing disk
as unusable. Otherwise, it uses parity or the mirrored disk to recover the data.
Erasure coding provides space-optimal data redundancy to protect data loss against multiple
drive failure. In a typical erasure coded storage system, a set of n disks is divided into m
disks to hold data and k disks to hold coding information, where n, m, and k are integers.
The coding information is calculated from the data. If up to k of the n disks fail, their
contents can be recomputed from the surviving disks. For example, if erasure coding is
implemented and data is written to the storage system, the data is divided into m data
fragments and k coding fragments. Each fragment is then stored on different drives. In the
event of drive failures, the data can be reconstructed from surviving m and k fragments.
Figure on the slide illustrates an example of dividing a data into nine data segments (m = 9)
and three coding fragments (k = 3). The maximum number of drive failure supported in this
example is three.
Storage resiliency can be achieved by using virtualization technique. A virtualization layer
created using virtualization appliance, which exists at the SAN, abstracts the identity of
physical storage devices and creates a storage pool from heterogeneous storage systems.
Virtual volumes are created from the storage pool and assigned to the compute system.
Instead of being directed to the LUNs on the individual storage systems, the compute
systems are directed to the virtual volumes provided by the virtualization layer.
Figure on the slide provides an illustration of a virtual volume that is mirrored between LUNs
of two different storage systems. Each I/O to the virtual volume is mirrored to the
underlying LUNs on the storage systems. If one of the storage systems incurs an outage due
to failure or maintenance, the virtualization appliance will be able to continue processing
I/O on the surviving mirror leg. Upon restoration of the failed storage system, the data from
the surviving LUN is resynchronized to the recovered leg. This method provides protection
and high availability for critical cloud services in the event of a storage system failure.
This lesson covered overview of single points of failure. This lesson also covered various
fault tolerant mechanisms to avoid single points of failure that includes clustering, VM live
migration, link and switch aggregation, in-service software upgrade, RAID, erasure coding,
dynamic drive sparing, and storage resiliency using mirrored virtual volume.
This lesson covers introduction to service availability zone. This lesson also covers
automated service failover across zones along with zone configurations such as active/active
and active/passive. Further this lesson covers live migration across zones using stretched
cluster.
An important high availability design best practice in a cloud environment is to create
service availability zones. A service availability zone may be viewed as group of cloud
resources that are physically isolated from one another (each zone has its own distinct cloud
resources). Typically, the resources within a zone are redundant and there is no single points
of failure. A zone can be a part of a data center or even consider as a whole data center. This
provides redundant cloud computing facilities on which applications or services can be
deployed. Service providers typically deploy multiple zones within a data center (to run
multiple instances of a service), so that if one of the service zone incurs outage due to some
reasons, then the service can be failed over to the other zone. They also deploy multiple
zones across geographically dispersed data centers (to run multiple instances of a service)
across geographically dispersed data centers, so that the service can survive even the failure
is at the data center level.
It is also important that there should be a mechanism that allows seamless (automated)
failover of services running in one zone to another.
To ensure robust and consistent failover in case of a failure, automated service failover
capabilities are highly desirable to meet stringent service levels. This is because manual
steps are often error prone and may take considerable time to implement. Automated
failover also provides a reduced RTO when compared to the manual process. A failover
process also depends upon other capabilities, including VM replication and migration
capabilities, and reliable network infrastructure between the zones. The following slides will
demonstrate the active/passive and active/active zone configurations, where the zones are
in different remote locations.
The figure on the slide shows an example of active/passive zone configuration. In this
scenario, all the traffic goes to the active zone (primary zone) only and the storage is
replicated from the primary zone to the secondary zone. Typically in an active/passive
deployment, only the primary zone has deployed cloud service applications. When a
disaster occurs, the service is failover to the secondary zone. The only requirement is to
start the application instances in the secondary zone and the traffic is rerouted to this
location.
In some active/passive implementation, both the primary and secondary zone have services
running, however only the primary zone is actively handling requests from the consumers. If
the primary zone goes down, the service is failover to the secondary zone and all the
requests are rerouted. This implementation provides faster restore of a service (very low
RTO).
The figure on the slide shows an example of implementing active/active configuration across
data centers. In this case, both the zones are active, running simultaneously, handling
consumers requests and the storage is replicated between the zones. There should be a
mechanism in place to synchronize the data between the two zones. If one of the zone fails,
the service is failover to the other active zone. The key point to be noted here is until the
primary zone is restored the secondary zone may have a sudden increase in workload. So, it
is important to initiate additional instances to handle the workload at secondary zone. The
active/active design gives the fastest recovery time.
The following slide details the underlying techniques such as live migration of VMs using
stretched cluster, which enables continues availability of service in the event of compute,
storage, and zone (site) failure.
Moving services across zones located in different locations without user interruption is
critical to achieve high availability. Stretched cluster enables to perform live VM migrations
across zones. A stretched cluster is a cluster with compute systems in different remote
locations. Figure shown in the slide provides an example of VM live migration across zones
in different locations using stretched clusters. Stretching the cluster across the zones (data
centers) provides disaster recovery capability in the event of disaster in one of the data
centers. Stretched clusters are typically built as a way to create active/active zones in order
to provide high availability and enable dynamic workload balancing across zones. Also, in a
solution where consumers of a given application are spread across the globe, working in
different time zones, the productivity is enhanced to a great extent if an application is closer
to the consumers. Live Migration with stretched cluster takes first step to make this possible
by providing the ability to move VMs and application within and across zones such that it is
closest to consumer for faster/reliable access. Stretched cluster configurations are typically
deployed using stretched VLANs (discussed in Module 4). The stretched VLANs allow
movement of VMs from a compute system at one zone to another compute system at other
zone, without having to change the network configuration of VMs. Referring to figure on the
slide, and assuming that there is a planned downtime at zone A, the stretched cluster
enables moving running VMs to another hypervisor running at zone B before the outage.
Once the outage has ended, the VMs can be moved back to zone A. In the case of disaster,
the VMs can be automatically restarted at the surviving zone.
(Contd.)
This lesson covered overview of service availability zone along with its implementation such
as active/passive and active/active configurations. This lesson also covered automated
service failover across zones. Further this lesson covered VM live migration across zones
using stretched cluster.
This lesson covers introduction to backup and recovery along with backup requirements in a
cloud environment. This lesson also covers guest-level and image-level backup method.
Further this lesson covers backup service deployment options and deduplication for backup
environment.
Extended downtime of critical services can be disastrous to any business. Like protecting the
infrastructure components (compute, storage, and network), it is also critical for business to
protect the data by making copies of it so that it is available for restoring the service even if
the original data is no longer available. Typically businesses implement data protection
solution in order to protect the data from accidentally deleting files, application crashes,
data corruption, and disaster. Data should be protected at local location and as well as to a
remote location to ensure the availability of service. For example, when a service is failover
to other zone (data center), the data should be available at the destination in order to
successfully failover the service to minimize the impact to the service.
One challenge to data protection that remains unchanged is determining the “right” amount
of protection required for each data set. A “tiered approach” to data protection takes into
account the importance of the data. Individual applications or services and associated data
sets have different business values, require different data protection strategies. As a result, a
well-executed data protection infrastructure should be implemented by a service provider
to offer a choice of cost effective options to meet the various tiers of protection needed. In
a tiered approach, data and applications (services) are allocated to categories (tiers)
depending on their importance. For example, mission critical services are tier 1, important
but less time-critical services are tier 2, and non- critical services are tier 3. Using tiers,
resources and data protection techniques can be applied more cost effectively to meet the
more stringent requirements of critical services while less expensive approaches are used
for the other tiers. The two key data protection solutions widely implemented are backup
and replication.
A backup is an additional copy of production data, created and retained for the sole purpose
of recovering lost or corrupted data. With growing business and regulatory demands for
data storage, retention, and availability, cloud service providers face the task of backing up
an ever-increasing amount of data. This task becomes more challenging with the growth of
data, reduced IT budgets, and less time available for taking backups. Moreover, service
providers need fast backup and recovery of data to meet their service level agreements. The
amount of data loss and downtime that a business can endure in terms of RPO and RTO are
the primary considerations in selecting and implementing a specific backup strategy. RPO
specifies the time interval between two backups. For example, if a service requires an RPO
of one hour, the data need to be backed up every hour. RTO relates to the time taken by the
recovery process. To meet the defined RTO, the service provider should choose the
appropriate backup media or backup target to minimize recovery time. For example, a
restore from tapes takes longer to complete than a restore from disks. Service providers
need to evaluate the various backup methods along with their recovery considerations and
retention requirements to implement a successful backup and recovery solution in a cloud
environment.
In a cloud environment, applications typically run on virtual machines (VMs). Multiple VMs
are hosted on a single or clustered physical compute systems. The virtualized compute
system environment is typically managed from a management server, which provides a
centralized management console for managing the environment. The integration of backup
application with the management server of virtualized environment is required. Advanced
backup methods require the backup application to obtain a view of the virtualized
environment and send configuration commands related to backup to the management
server. The backup may be performed either file-by-file or as an image. Similarly, recovery
requires either/both file level recovery and/or full VM recovery from the image. Cloud
services have different availability requirement and that would affect the backup strategy.
For example, if the consumer chosen higher backup service level (platinum) for their VM
instances, then the backup would happen more frequently, low RTO, and also have longer
term retention when compared to lower-level service tiers. Typically cloud environment has
large volume of redundant data. Backing up redundant data would significantly impact the
backup window and increases the operating expenditure. Service provider need to consider
deduplication techniques to overcome these challenges. It is also important to ensure that
most of the backup and recovery operations need to be automated.
In a backup environment, the common backup components are backup client, backup
server, storage node, and backup target. The role of a backup client is to gather the data that
is to be backed up and send it to the storage node. It also sends the tracking information to
the backup server. The backup server manages the backup operations and maintains the
backup catalog, which contains information about the backup configuration and backup
metadata. The backup configuration contains information about when to run backups,
which client data is to be backed up, and so on. The backup metadata contains information
about the backed up data. The storage node is responsible for organizing the client’s data
and writing the data to a backup device. A storage node controls one or more backup
devices. Backup devices may be attached directly or through a network to the storage node.
The storage node sends tracking information to the backup server about the data written to
the backup device. Typically this information is used for recoveries.
(Contd.)
The two key methods of backup in a cloud environment are guest-level and image-level
backup. The following slides will discuss about these methods.
In a guest-level backup, a VM is treated as if it is a physical compute system. A backup agent
or client is installed on the VM, and it streams the backup data to the storage node as
shown in the figure on the slide. Management of the backup operation is similar to that of a
physical compute system backup. Guest-level backup performs file-level backup of the data.
The VM’s configuration files are not backed up.
In a cloud environment, it is common to run multiple VMs on a compute system. If multiple
VMs on a compute system are backed up simultaneously then the combined I/O and
bandwidth demands placed on the compute system by the various guest-level backup
operations can deplete the compute system resources. This may impact the performance of
the services running on the VMs. To overcome these challenges, the backup process can be
offloaded from the VMs or hypervisors to a proxy server. This can be achieved by using the
image-level backup method.
Image-level backup makes a copy of the virtual disk and configuration associated with a
particular VM. The backup is saved as a single entity called as VM image. This type of backup
is suitable for restoring an entire VM in the event of a hardware failure or human error such
as the accidental deletion of the VM. Image-level backup also supports file-level recovery. In
an image-level backup, the backup software can backup VMs without installing backup
agents inside the VMs or at the hypervisor-level. The backup processing is performed by a
proxy server that acts as the backup client, thereby offloading the backup processing from
the VMs. The proxy server communicates to the management server responsible for
managing the virtualized compute environment. It sends commands to create a snapshot of
the VM to be backed up and to mount the snapshot to the proxy server. A snapshot captures
the configuration and virtual disk data of the target VM and provides a point in time view of
the VM. The proxy server then performs backup by using the snapshot. The figure on the
slide illustrates image-level backup.
Some vendors support incremental backup through tracking changed blocks. This feature
identifies and tags any blocks that have changed since the last VM snapshot. This enables
the backup application to backup only the blocks that have changed, rather than backing up
every block. This considerably reduces the amount of data to be backed up and the number
of VM that need to be backed up within a backup window. Changed block tracking also
makes it possible to restore only those changed blocks needed to rollback an existing VM
image to a particular RPO.
There are three common backup service deployment options that a cloud service providers
offer to their consumers. These deployment options are:
1. Local backup service (Managed backup service): This option is suitable when a cloud
service provider is already providing some form of cloud services (example: compute
services) to the consumers. The service provider may choose to offer backup services to
the consumers, helping protect consumer’s data that is being hosted in the cloud.
2. Replicated backup service: This is an option where a consumer performs backup at their
local site but does not want to either own or manage or incur the expense of a remote
site for disaster recovery purposes. For such consumers, a cloud service provider offers
replicated backup service that replicates backup data to a remote disaster recovery site.
3. Remote backup service: In this option, consumers do not perform any backup at their
local site. Instead, their data is transferred over a network to a backup infrastructure
managed by the cloud service provider.
The slide shows an example to create a backup service level for EMC private cloud
environment. The backup service levels (Platinum, Gold, and Silver) are used as a means of
differentiating between the level of protection required by different VM and applications.
The main properties that define these backup service levels are backup target, backup
schedule (daily, weekly, monthly), and retention period for the backup.
The cloud administrator can create a service level and follows series of steps to define a
backup service level:
1. Enter a new service level name and the name should be unique. Typically the service
level name could be Platinum, Gold, Silver, and Bronze.
2. Select one the available backup target that should match the selected service level. For
example if the service level name is chosen platinum, then the backup target should
have more features including deduplication, faster VM restore.
3. Enter the schedule, which determines how often the backup is taken. A schedule of
Daily, Weekly, or Monthly can be selected for backups.
4. Create a retention policy appropriate for this backup service level. Several types of
retention policies can be created namely retain the backups forever, retain the backups for a
certain number of days/weeks/months/years, retain the backups until certain date, and
define a custom retention.
5. Finally click the submit. The workflow triggers several API functions in sequence to create
the service level.
Once the service level is created, the consumers can use this service level to protect their
VMs and applications. Following slide will show an example how a consumer deploy VM
with automatic data protection.
This slide show an example how a consumer deploy a VM with predefined backup service
level. This requires all added backup service levels are visible to consumers, so that they can
choose a service level to protect the new VMs they are deploying. The consumer follows
series of steps to deploy VMs with a required backup service level:
1. Consumer log into the account on the self service portal and click New Request as
shown in the slide.
2. To deploy a VM, consumer should select a service (VM instance) from a list of available
services. In this example, the consumer selected DSLinuxGold.
3. Select the number of VMs to deploy.
4. Select the required backup service level and in this example, Silver Backup Service Level
is selected as shown in the slide.
Once the submit button is pressed, the VM is provisioned and inherits the backup attributes
defined by the service level.
With the growth of data and 24x7 service availability requirements, service providers are
facing challenges in protecting their consumers data. Backup processes typically backup a lot
of duplicate data. Backing up duplicate data significantly increases the backup window size
and also results in unnecessary consumption of resources, such as storage space and
network bandwidth. There are also requirements to preserve data for longer periods –
whether driven by the need of consumers or legal and regulatory concerns. Backing up large
amount of duplicate data at the remote site for DR purpose is also very cumbersome and
requires lots of bandwidth. Data deduplication provides the solution for service providers to
overcome the challenges in a backup environment.
Deduplication is the process of detecting and identifying the unique data segments (chunk)
within a given set of data to eliminate redundancy. The use of deduplication techniques
significantly reduces the amount of data to be backed up in a cloud environment, where
typically a large number of VMs are deployed. Data deduplication operates by segmenting a
dataset into blocks and identifying redundant data and writing the unique blocks to a
backup target. To identify redundant blocks in the backup data, the data deduplication
system creates a hash value or digital signature—like a fingerprint—for each data block and
an index of the signatures for a given repository. The index provides the reference list to
determine whether blocks already exist in a repository. When the data deduplication system
sees a block it has processed before, instead of storing the block again, it inserts a pointer to
the original block in the repository.
The level at which data is identified as duplicate affects the amount of redundancy or
commonality. Operational levels of deduplication include file-level deduplication and sub-
file deduplication. File-level deduplication (also called single instance storage) detects and
removes redundant copies of identical files in a backup environment. Only one copy of the
file is stored; the subsequent copies are replaced with a pointer to the original file. By
removing all of the subsequent copies of a file, a significant amount of space savings can be
achieved. File-level deduplication is simple but does not address the problem of duplicate
content inside the files. Also, a change in any part of a file results in classifying that as a new
file and saving it as a separate copy. For example, two 10-MB presentations with a
difference in just the title page are not considered as duplicate files, and each file is stored
separately.
(Contd.)
Source-based data deduplication eliminates redundant data at the source (backup client)
before transmission to the backup device. The deduplication software or agent on the
clients checks each file or block for duplicate content. Source-based deduplication reduces
the amount of data that is transmitted over a LAN or WAN from the source to the backup
device, thus requiring less network bandwidth. There is also a substantial reduction in the
capacity required to store the backup data. However, a deduplication agent running on the
client may impact backup performance, especially when a large amount of data needs to be
backed up. When image-level backup is implemented, the backup workload is moved to a
proxy server. The deduplication agent is installed on the proxy server to perform
deduplication without impacting the VMs running applications. Service provider can
implement source-based deduplication when performing backup (backup as a service) from
consumer’s location to provider’s location.
Target-based data deduplication occurs at the backup device, which offloads the
deduplication process and its performance impact from the backup client. In target-based
deduplication the backup application sends data to the target backup storage device where
the data is deduplicated, either immediately (inline) or at a scheduled time (post-process).
With inline data deduplication, the incoming backup stream is divided into small chunks,
and then compared to data that has already been deduplicated. The inline deduplication
method requires less storage space than the post process approach. However, inline
deduplication may slow down the overall data backup process. Some vendors’ inline
deduplication systems leverage the continued advancement of CPU technology to increase
the performance of the inline deduplication by minimizing disk accesses required to
deduplicate data. Such inline deduplication systems identify duplicate data segments in
memory, which minimizes disk usage.
(Contd.)
This lesson covered various backup requirements in a cloud environment. This lesson also
covered guest-level and image-level backup method. This lesson further covered various
backup service deployment options such as local backup service (Managed backup service),
replicated backup service, and remote backup service. Finally, this lesson covered
importance of deduplication in backup environment along with source-based and target-
based deduplication.
This lesson covers an introduction to replication and its types. This lesson also covers local
replication methods such as snapshot and mirroring. This lesson further covers remote
replication methods such as synchronous and asynchronous remote replications along with
continuous data protection (CDP). Finally, this lesson covers a replication use case, Disaster
Recovery as a Service (DRaaS).
It is absolutely necessary for cloud service providers to protect mission-critical data and
minimize the risk of service disruption. If a local outage or disaster occurs, faster data and
VM restore and restart is essential to ensure business continuity. One of the ways to ensure
BC is replication, which is the process of creating an exact copy (replica) of the data. These
replica copies are used for restore and restart services if data loss occurs. Based on the
availability requirements for the service being offered to the consumer, the data can be
replicated to one or more locations. Service provider should provide the option to
consumers for choosing the location to which the data to be replicated in order to comply
with regulatory requirements. Replication can be classified into two major categories: local
and remote. Local replication refers to replicating data within the same location. Local
replicas help to restore the data in the event of data loss or enables to restart the
application immediately to ensure BC. Snapshot and mirroring are the widely deployed local
replication techniques. Remote replication refers to replicating data across multiple
locations (locations can be geographically dispersed). Remote replication helps organizations
to mitigate the risks associated with regional outages resulting from natural or human-made
disasters. During disasters, the services can be moved to a remote location to ensure
continuous business operation. In a remote replication, data can be synchronously or
asynchronously replicated.
Note:
Replicas are immediately accessible by the application, but a backup copy must be restored
by backup software to make it accessible to applications. Backup is always a point-in-time
copy, but a replica can be a point-in-time copy or continuous. Backup is typically used for
operational or disaster recovery but replicas can be used for recovery and restart. Replicas
typically provide faster RTO compared to recovery from backup.
A snapshot is a virtual copy of a set of files, or volume as they appeared at a specific PIT. A
snapshot can be created by using compute operating environment (hypervisor), or storage
system operating environment. Typically the storage system operating environment takes
snapshot at volume level, that may contain multiple VMs data and configuration files. This
option does not provide an option to restore a VM in the volume. The most common
snapshot technique implemented in a cloud environment is virtual machine snapshot. A
virtual machine snapshot preserves the state and data of a virtual machine at a specific
point-in-time. The VM state includes VM files, such as BIOS, VM configurations, and its
power state (powered-on, powered-off, or suspended). This VM snapshot is useful for quick
restore of a VM. For example, a cloud administrator can snapshot a VM, then make changes
such as applying patches, and software upgrades. If anything goes wrong, administrator can
simply restore the VM to its previous state using the previously saved VM snapshot.
The hypervisor provides an option to create and manage multiple snapshots. When a VM
snapshot is created, a child virtual disk (delta disk file) is created from the base image or
parent virtual disk. The snapshot mechanism prevents the guest operating system from
writing to the base image or parent virtual disk and instead directs all writes to the delta
disk file. Successive snapshots generate a new child virtual disk from the previous child
virtual disk in the chain. Snapshots hold only changed blocks. This VM snapshot can be used
for creating image-based backup (discussed earlier) to offload the backup load from a
hypervisor.
Mirroring can be implemented within a storage system between volumes and also between
the storage systems. The example shown on the slide illustrates mirroring between volumes
within a storage system. The replica is attached to the source and established as a mirror of
the source. The data on the source is copied to the replica. New updates to the source are
also updated on the replica. After all the data is copied and both the source and the replica
contain identical data, the replica can be considered a mirror of the source. While the
replica is attached to the source it remains unavailable to any other compute system.
However, the compute system continues to access the source. After the synchronization is
complete, the replica can be detached from the source and made available for other
business operations such as backup. If the source volume is not available due to some
reason, the replica enables to restart the service instance on it or restores the data to the
source volume to make it available for operations.
Synchronous replication provides near zero RPO where the replica is identical to the source
at all times. In synchronous replication, writes must be committed to the source and the
remote replica (or target) prior to acknowledging “write complete” to the compute system.
Additional writes on the source cannot occur until each preceding write has been completed
and acknowledged. This ensures that data is identical on the source and replica at all times.
Further, writes are transmitted to the remote zones exactly in the order in which they are
received at the source. Therefore, write ordering is maintained. Figure on the slide
illustrates an example of synchronous remote replication. Data can be replicated
synchronously across multiple zones. If the primary zone is unavailable due to disaster, then
the service can be restarted immediately in other zone to meet the required SLA.
Note:
Application response time is increased with synchronous remote replication because writes
must be committed on both the source and target before sending the “write complete”
acknowledgment to the compute system. The degree of impact on response time depends
primarily on the distance and network bandwidth between sites. If the bandwidth provided
for synchronous remote replication is less than the maximum write workload, there will be
times during the day when the response time might be excessively elongated, causing
applications to time out. The distances over which synchronous replication can be deployed
depend on the application’s capability to tolerate extensions in response time. Typically
synchronous remote replication is deployed for distances less than 200 KM (125 miles)
between the two sites.
It is important for a service provider to replicate data across geographical locations in order
to mitigate the risk involved during disaster. If the data is replicated (synchronously)
between zones and the disaster strikes then there would be a chance that both the zones
may impact. This leads to data loss and service outage. Replicating data across zones which
are 1000s of KM apart would help service provider to face any disaster. If a disaster strike at
one of the regions then the data would still available in another region and the service could
move to the location.
Asynchronous replication enables to replicate data over distances up to several thousand
kilometers between the primary zone and secondary zones (remote locations).
Asynchronous replication also mitigates the impact to the application’s response time
because the writes are acknowledged immediately to the compute system. In this case, the
required bandwidth can be provisioned equal to or greater than the average write workload.
Data can be buffered during times when the bandwidth is insufficient and moved later to
the remote zones. Therefore, adequate buffer capacity should be provisioned. RPO depends
on the size of the buffer, the available network bandwidth, and the write workload to the
source. Asynchronous replication implementations can take advantage of locality of
reference (repeated writes to the same location). If the same location is written multiple
times in the buffer prior to transmission to the remote zones, only the final version of the
data is transmitted. This feature conserves link bandwidth.
Mission-critical applications running on compute systems often require instant and
unlimited data recovery points. Traditional data protection technologies offer a limited
number of recovery points. If data loss occurs, the system can be rolled back only to the last
available recovery point. Mirroring offers continuous replication; however, if logical
corruption occurs to the production data, the error might propagate to the mirror, which
makes the replica unusable. Ideally, CDP provides the ability to restore data to any previous
PIT. It enables this capability by tracking all the changes to the production devices and
maintaining consistent point-in-time images. CDP enables to perform operational recovery
(protection against human errors, data corruption, and virus attacks) through local
replication and disaster recovery through remote replication. CDP minimizes both RPO and
RTO.
In CDP, data changes are continuously captured and stored in a separate location from the
primary storage. With CDP, recovery from data corruption poses no problem because it
allows going back to a PIT image prior to the data corruption incident. CDP uses a journal
volume to store all data changes on the primary storage. The journal volume contains all the
data that has changed from the time the replication session started. This enables rolling
back to a desired point in time and rapidly recovering data to avoid downtime. The amount
of space that is configured for the journal determines how far back the recovery points can
go. CDP uses an appliance and a write splitter. A CDP appliance is an intelligent hardware
platform that runs the CDP software and manages local and remote data replications. In
some implementation, instead of having a dedicated CDP appliance, the CDP software is
installed on a VM that manages replication. Write splitters intercept writes to the
production volume from the compute system and split each write into two copies. Write
splitting can be performed at the compute, fabric, or storage system.
Figure on the slide portrays CDP local and remote replication operations. Typically the
replica is synchronized with the source, and then the replication process starts. After the
replication starts, all the writes from the compute system to the source are split into two
copies. One copy is sent to the local CDP appliance at the primary zone (source site), and
the other copy is sent to the source volume. Then the local appliance writes the data to the
journal at the source site and the data in turn written to the local replica. If a file is
accidently deleted, or the file is corrupted, the local journal enables to recover the
application data to any PIT.
In case of remote replication, after receiving the write, the local appliance at the source site
sends it to the appliance at the remote (DR) site. Then, the write is applied to the journal
volume at the remote site. As a next step, data from the journal volume is sent to the
remote replica at predefined intervals. CDP operates in either synchronous or asynchronous
mode. In the asynchronous mode, the local CDP appliance acknowledges a write as soon as
it is received. In the synchronous replication mode, the application waits for an
acknowledgment from the CDP appliance at the remote site before initiating the next write.
In case of any disaster at the primary zone, data can be recovered to required PIT and the
service can be restarted at the DR site.
Facing an increased reliance on IT and the ever-present threat of natural or man-made
disasters, organizations need to rely on business continuity processes to mitigate the impact
of service disruptions. Traditional disaster recovery methods often require buying and
maintaining a complete set of IT resources at secondary data centers that matches the
business-critical systems at the primary data center. This includes sufficient storage to house
a complete copy of all of the enterprise’s business data by regularly replicating production
data on the mirror systems at secondary site. This may be a complex process and expensive
solution for a significant number of organizations.
Disaster Recovery-as-a-Service (DRaaS) has emerged as a solution to strengthen the
portfolio of a cloud service provider, while offering a viable DR solution to organizations. The
cloud service provider assumes responsibility for providing resources to enable
organizations to continue running their IT services in the event of disaster. From a consumer
perspective, having a DR site in the cloud reduces the need for data center space and IT
infrastructure, which leads to significant cost reductions, and eliminates the need for
upfront capital expenditure. Resources at the service provider can be dedicated to the
consumer or they can be shared. The service provider should design, implement and
document a DRaaS solution specific to the customer’s infrastructure. They must conduct an
initial recovery test with the consumer to validate complete understanding of the
requirements and documentation of the correct, expected recovery procedures.
During normal production operations, IT services run at the consumer’s production data
center. Replication of data occurs from the consumer production environment to the service
provider’s location over the network, as shown in figure on the slide. Typically when
replication occurs, the data is encrypted and compressed at the production environment to
improve the security of data and reduce network bandwidth requirements. In most of the
cases during normal operating conditions, a DRaaS implementation may only need a small
share of resources to synchronize application data and VM configurations from the
consumer’s site to the cloud. The full set of resources required to run the application in the
cloud is consumed only if a disaster occurs.
In the event of a business disruption or disaster, the business operations will failover to the
provider’s infrastructure as shown in figure on the slide. In such a case, users at the
consumer organization are redirected to the cloud. For applications or groups of
applications that require restart in a specific order, a sequence is worked out during initial
cloud setup for the consumer and recorded in the disaster recovery plan. Typically virtual
machines are allocated from a pool of compute resources located in the provider’s location.
Once the consumer production data center is up and running, the business operations is
returned back to the consumer’s environment, and is referred to as failback. This requires
replicating the updated data from the cloud repository back to in-house production systems
before resuming normal business operations at consumer’s location. After starting the
business operations at the consumer’s infrastructure, replication to the cloud is re-
established. To offer DRaaS, service provider should have all the necessary resources and
technologies to meet the required service level.
This lesson covered local replication technologies such as snapshot and mirroring. This
lesson also covered synchronous and asynchronous remote replications. This lesson further
covered local and remote continuous data protection. Finally this lesson covered disaster
recovery as a service.
This lesson covers the overview of resilient cloud application. This lesson also covers the key
design strategies for application resiliency and monitoring applications for availability.
An application itself is a biggest factor affecting application or service availability. The cloud
infrastructures are typically built on a large number of commodity systems to achieve
scalability and keep hardware costs down. In this environment, it is assumed that some
components will fail. Therefore, in the design of a cloud application the failure of individual
resources often has to be anticipated to ensure an acceptable availability of the application.
For existing applications, the code has to be rewritten to make them “cloud-ready” i.e., the
application should have the required scalability and resiliency. A reliable application is able
to properly manage the failure of one or more modules and continue operating properly. If a
failed operation is retried a few milliseconds later, the operation may succeed. These types
of error conditions are called as transient faults. Fault resilient applications have logic to
detect and handle transient fault conditions to avoid application downtime. The next few
slides will discuss about the key design strategies for improving application availability.
Graceful degradation of application functionality refers to the ability of an application to
maintain limited functionality even when some of the components, modules or supporting
services are not available. A well designed application or service typically uses a collection of
loosely coupled modules that communicate with each other. Especially a cloud application
requires separation of concerns at the module level so that an outage of a dependent
service or module would not bring down the entire application. The purpose of graceful
degradation of application functionality is to prevent the complete failure of a business
application or service. For example, consider an e-commerce application that consists of
modules such as product catalog, shopping cart, order status, order submission, and order
processing. Assume that the payment gateway is unavailable due to some problem. It is
impossible for the order processing module of the application to continue. If the application
or service is not designed to handle this scenario, the entire application might go offline.
However, in this same scenario, it is possible that the product catalog module can still be
available to consumers to view the product catalog. Also, the application could allow to
place the order and move it into shopping cart. This provides the ability to process the
orders when the payment gateway is available or after failing over to a secondary payment
gateway.
A key mechanism in a highly available application design is to implement retry logic within a
code to handle service that is temporarily down. When applications use other cloud-based
services, errors can occur because of temporary conditions such as intermittent service,
infrastructure-level faults, or network issues. Very often, this form of problem can be solved
by retrying the operation a few milliseconds later, and the operation may succeed. The
simplest form of transient fault handling is to implement this retry logic in the application
itself. To implement this retry logic in an application, it is important to detect and identify
that particular exception is likely to be caused by a transient fault condition. Also, a retry
strategy must be defined to state how many retries can be attempted before deciding that
the fault is not transient and define what the intervals should be between the retries. The
logic will typically attempt to execute the action(s) a certain number of times, registering an
error and utilizing a secondary service if the fault continues.
In a stateful application model, the session state information of an application (for example
user ID, selected products in a shopping cart, and so on) is usually stored in compute system
memory. However, information stored in the memory can be lost if there is an outage with
the compute system where the application runs. In a stateless application model, the state
information are stored out of the memory and usually stored in a repository (database). If a
VM running the application instance fails, the state information is still available in the
repository. A new application instance is created on another VM which can access the state
information from the database and resume the processing.
An event-driven application is a program written to respond to actions generated by the
user or system. In a tightly integrated application environment, user requests are processed
by a particular application instance running on a server through synchronous calls. If that
particular application instance is down, the user request will not be processed. For cloud
applications, an important strategy for high availability design is to insert user requests into
a queue and code applications to read requests from the queue (asynchronously) instead of
synchronous calls. This allows multiple applications instances to process requests from the
queue. This also enables adding multiple application instances to process the workload
much faster to improve performance. Further, if an application instance is lost, the impact is
minimal, which could be a single request or transaction. The remaining requests in the
queue continue to be distributed to other available instances. For example, in a e-commerce
application, simultaneous requests from multiple users for placing orders are loaded into a
queue and the application instances running on multiple servers process the orders
(asynchronously).
A specialized monitoring tool can be implemented to monitor the availability of application
instances that runs on VMs. This tool adds a layer of application awareness to the core high
availability functionality offered by compute virtualization technology. The monitoring tool
communicates directly with VM management software and conveys the application health
status in the form of an application heartbeat. This allows high availability functionality of a
VM management software to automatically restart a virtual machine instance if the
application heartbeat is not received within a specified interval. Under normal circumstance,
the resources that comprise an application are continuously monitored at a given interval to
ensure proper operation. If the monitoring of a resource detects a failure, tool attempt to
restart the application within the virtual machine. The number of attempts that will be
made to restart an application is configurable by the administrator. If the application does
not restart successfully, the tool communicate to high availability functionality of a VM
management software through API in order to trigger a reboot of the virtual machine. The
application is restarted as part of this reboot process. This integration between the
application monitoring tool and VM high availability solutions protect VMs, as well as the
applications that run inside them.
This lesson covered the overview of resilient cloud application. This lesson also covered the
key design strategies for application resiliency such as graceful degradation of application
functionality, retry logic in application code, stateless application model, and event-driven
processing. This lesson finally covered monitoring application for availability.
The concepts in practice section covers EMC backup and deduplication products such as
NetWorker, Avamar, and Data Domain. This section also covers key EMC replication products
such as VNX Snapshot, SnapView, TimeFinder, SRDF, and RecoverPoint. Further this section
covers VMware BC solutions including vCenter Site Recovery Manager, HA,FT, vMotion, and
Storage vMotion.
EMC NetWorker is a backup and recovery software, which centralizes, automates, and
accelerates data backup and recovery operations. Following are the key features of EMC
NetWorker:
• Supports heterogeneous platforms such as Windows, UNIX, Linux, and also virtual
environments.
• Supports different backup targets – tapes, disks, Data Domain purpose-built backup
appliance and virtual tapes.
• Supports multiplexing (or multi-streaming) of data.
• Provides both source-based and target-based deduplication capabilities by integrating
with EMC Avamar and EMC Data Domain, respectively.
• The cloud-backup option in NetWorker enables backing up data to public cloud
configurations
EMC Avamar is a disk-based backup and recovery solution that provides inherent source-
based data deduplication. With its unique global data deduplication feature, Avamar differs
from traditional backup and recovery solutions by identifying and storing only unique sub-
file data. EMC Avamar provides a variety of options for backup, including guest OS-level
backup and image-level backup. The three major components of an Avamar system include
Avamar server, Avamar backup clients, and Avamar administrator. Avamar server provides
the essential processes and services required for client access and remote system
administration. The Avamar client software runs on each compute system that is being
backed up. Avamar administrator is a user management console application that is used to
remotely administer an Avamar system. The two Avamar server editions include Avamar
Data Store and Avamar Virtual Edition.
Copyright © 2014 EMC Corporation. All rights reserved (Contd.)

Module 7: Business Continuity 410
The EMC array-based local replication solutions include VNX Snapshot, SnapView and
TimeFinder. The array-based remote replication solutions include Symmetrix SRDF. EMC
RecoverPoint is the network-based local and remote continuous data protection solution.
VNX Snapshot is a point-in-time copy of a source LUN using redirect on first write
methodology. This functionality differs significantly from copy on first write used by
SnapView. Redirect on first write technology increases write performance. VNX Snapshot
provides point-in-time data copies for backups, testing, decision support, and data recovery.
SnapView is an EMC VNX array-based local replication software that creates a pointer-based
virtual copy and full-volume mirror of the source using SnapView snapshot and SnapView
clone, respectively.
TimeFinder is an EMC Symmetrix array-based local replication software that creates a
pointer-based virtual copy and pointer-based full-volume replica of the source using
TimeFinder/Snap and TimeFinder/Clone respectively. TimeFinder VP Snap creates point-in-
time replicas that are conceptually similar to those created by TimeFinder/Snap but both
source and target devices must be thin.
SRDF (Symmetrix Remote Data Facility) offers a family of technology solutions to
implement storage array-based remote replication. The key SRDF family of software includes
SRDF/Synchronous, SRDF/Asynchronous, and SRDF/Star.
EMC RecoverPoint is a high-performance, single product that provides both local and
remote continuous data protection. It provides fast recovery of data and enables users to
access the data for any previous point in time. RecoverPoint uses lightweight splitting
technology to mirror a write. RecoverPoint-integrated WAN bandwidth reduction
technology utilizes compression to optimize network resource utilization during remote
replication.
VMware vCenter Site Recovery Manager is a VMware tool that makes disaster recovery
rapid, reliable, and manageable so that organizations can meet their recovery objectives.
Site Recovery Manager provides a simple interface for setting up recovery plans that are
coordinated across all infrastructure layers, replacing traditional, error-prone, manual
recovery processes. Recovery plans can be tested non-disruptively as frequently as required
to ensure that they meet business objectives. At the time of a site failover or migration, Site
Recovery Manager automates both failover and failback processes, ensuring fast and highly
predictable RPO and RTO. Site Recovery Manager integrates tightly with an underlying
replication product, vSphere and vCenter Server to automate end-to-end recovery
processes.
VMware FT provides continuous availability for applications in the event of server failures by
creating a live shadow instance of a virtual machine that is in virtual lockstep with the
primary instance. VMware FT is used to prevent application disruption due to hardware
failures. Downtime associated with mission-critical applications can be very expensive and
disruptive to businesses. By allowing instantaneous failover between the two instances in
the event of hardware failure, FT eliminates even the smallest chance of data loss or
disruption.
VMware HA provides high availability for applications running in virtual machines. In the
event of the physical compute system failure, affected VMs are automatically restarted on
other compute systems. VMware HA minimizes unplanned downtime and IT service
disruption while eliminating the need for dedicated standby hardware and installation of
additional software.
VMware vMotion enables the live migration of running virtual machines from one physical
server to another without any downtime. vMotion is capable of migrating VMs running any
OS across any type of hardware and storage supported by VMware ESXi. It supports VM
migration across different versions and generations of hardware, thus helping users to
migrate VMs from older servers to newer servers without disruption or downtime.
VMware Storage vMotion enables live migration of virtual machine disk files within and
across storage arrays with no downtime or disruption in service. Storage vMotion relocates
virtual machine disk files from one shared storage location to another shared storage
location with zero downtime. Storage vMotion enables organizations to perform proactive
storage migrations, simplify array migrations, improve virtual machine storage performance
and free up valuable storage capacity.
This module covered the importance of business continuity in a cloud environment and how
business continuity enable to achieve required service availability. This module also covered
various fault tolerant mechanisms for cloud infrastructure to eliminate single points of
failure. This module further covered backup, deduplication, and replication. Finally, this
module covered the key design strategies for cloud application resiliency including graceful
degradation of application functionality, retry logic in application code, stateless application
model, and event-driven processing.
.
This module focuses on key security threats to the cloud infrastructure. The module further
describes various security mechanisms that enables the cloud service providers to mitigate
these threats. Finally, the module describes the GRC aspect in the cloud environment.
Copyright © 2014 EMC Corporation. All rights reserved Module 8: Security 419
Although, cloud computing offers several benefits, such as flexibility, scalability, and agility,
security is one of the key factors that consumers are concerned about while adopting cloud.
Therefore, it is important for a service provider to deploy the required security tools and
mechanisms to offer a secure cloud environment to their consumers. The fundamental
requirements of information security and compliance pertain to both non-cloud and cloud
infrastructure management. In both environments there are some common security
requirements. However, in a cloud environment there are important additional factors,
which a service provider must consider, that arise from information ownership,
responsibility and accountability for information security, and the cloud infrastructure’s
multi-tenancy characteristic. Therefore, providing secure multi-tenancy is a key requirement
for building a cloud infrastructure.
Apart from multi-tenancy, cloud infrastructure provides rapid elasticity, a feature rarely
found in traditional data centers. Therefore, tools used to provide information security must
have the ability to detect newly provisioned resources and integrate with these scaled
resources to provide security. Without these capabilities, it is difficult to monitor and
manage the security of such an environment.
The security mechanisms deployed in the cloud environment not only provide protection to
the five layers (physical, virtual, control, service orchestration, and service layers), but it also
provides protection to the two cross layer functions (service management and business
continuity).
Apart from security mechanisms, service provides also require to adopt governance, risk
and compliance (GRC) processes that enables the cloud provider to ensure that their acts
are ethically correct and in accordance with their risk appetite (the risk level a service
provider chooses to accept), internal policies and external regulations. This also enables the
service provider to meet the GRC requirement of their consumers.
This lesson covers the key information security terminologies.
Today, many organizations (consumers) run their applications in a cloud environment to take
advantage of the capabilities discussed throughout this course. As organizations
(consumers) adopt cloud, one of the key concerns they have is trust. Trust depends on the
degree of control and visibility available to the information’s owner.
Information is an organization’s most valuable asset. Traditionally, organizations deploy
various tools within their infrastructure to protect the asset. These tools must be deployed
on various infrastructure assets, such as compute (processes information), storage (stores
information), and network (carries information), to protect the information. In a cloud
environment, security management enables cloud service providers to protect vital
information of the consumers stored in the cloud. Managing the security of the
infrastructure, which is an on-going effort, has become increasingly important for cloud
service providers.
Information security is a term that includes a set of practices that protect information and
information systems from unauthorized access, use, information disclosure, disruption,
modification, or destruction.
This slide lists the key information security terminologies, which are described in the next
few slides.
The goal of information security is to provide confidentiality, integrity, and availability,
commonly referred to as the security triad, or CIA. Confidentiality provides the required
secrecy of information and ensures that only authorized users have access to data. Integrity
ensures that unauthorized changes to information are not allowed. The objective of
ensuring integrity is to detect and protect against unauthorized alteration or deletion of
information. Availability ensures that authorized users have reliable and timely access to
compute, storage, network, application, and data resources.
Ensuring confidentiality, integrity, and availability is the primary objective of any IT security
implementation. This is supported through the use of authentication, authorization and
auditing processes.
Authentication is a process to ensure that users or assets are who they claim to be. A user
may be authenticated by a single-factor or multi-factor method. Single-factor authentication
involves the use of only one factor, such as a password. Multi-factor authentication uses
more than one factor to authenticate a user (discussed later in this module).
Authorization refers to the process of determining whether and in what manner a user,
device, application, or process is allowed to access a particular service or resource. For
example, a user with administrator’s privileges is authorized to access more services or
resources compared to a user with non-administrator (for example, read-only) privileges.
Authorization should be performed only if authentication is successful.
The most common authentication and authorization mechanisms used in a data center and
cloud environment are Windows Access Control List (ACL), UNIX permissions, Kerberos, and
Challenge-Handshake Authentication Protocol (CHAP). It is essential to verify the
effectiveness of security mechanisms that are deployed with the help of auditing.
Auditing refers to the logging of all transactions for the purpose of assessing the
effectiveness of security mechanisms. It helps to validate the behavior of the infrastructure
components, and to perform forensics, debugging and monitoring activities.
A cloud service provider should deploy multiple layers of defense throughout the
infrastructure to mitigate the risk of security threats in case one layer of the defense is
compromised. This strategy is referred to as defense-in-depth. This strategy may also be
thought of as a “layered approach to security” because there are multiple measures for
security at different levels. Defense-in-depth increases the barrier to exploitation – an
attacker must breach each layer of defenses to be successful – and thereby provides
additional time to detect and respond to an attack. This potentially reduces the scope of a
security breach. However, the overall cost of deploying defense-in-depth is often higher
compared to single-layered security mechanisms. An example of defense-in-depth could
be a virtual firewall installed on a hypervisor when there is already a network-
based firewall deployed within the same environment. This reduces the chance of
compromising security of the hypervisor’s environment due to a successful breach of the
network-level firewall.
While developing a defense-in-depth strategy, the set of hardware and software
components that are critical to the security of the cloud infrastructure – termed the trusted
computing base (TCB) – must be considered. Vulnerabilities occurring inside the TCB might
jeopardize the security of the entire system. TCB essentially defines a boundary for security-
critical and non-critical parts of an information system. Understanding the security concerns
and threats associated with the cloud environment, and a specific environment’s TCB,
requires understanding the relationships and dependencies between the cloud computing
entities.
As multi-tenancy enables multiple consumers to share the same set of resources, this
consequently increases the concern about security risk to data confidentiality, integrity, and
availability. This gives rise to a potential security concern about data leakage for the
consumers because it makes their private data vulnerable to theft, manipulation, or
destruction. Also, if a virtual machine in a multi-tenant environment is compromised, then it
could increase the security risk to other virtual machines running on same compute
systems. Secure multi-tenancy must be deployed to mitigate these concerns.
Secure multi-tenancy requires mechanisms that prevent any tenant from accessing another
tenant’s information. It also requires mechanisms that prevent one tenant’s process from
affecting another tenant’s process. Cloud service providers need to understand and address
the security implications of multi-tenancy to ensure that the relevant security mechanisms
are in place. Some examples of such security mechanisms discussed later in this module are
LUN masking, VLAN, and VSAN. Among the many issues to be handled when planning for
security in a multi-tenant environment, four key areas must be considered: secure
separation, availability, service assurance, and management.
1. Secure separation enables isolation of resources and services across various consumers
in a multi-tenant environment. For example, at the storage layer, secure separation
provides basic requirements such as separation of data at-rest (data that is stored on a
storage device), address space separation, and separation of data access.
(Contd.)
Velocity-of-attack refers to a situation where an existing security threat in a cloud may
spread rapidly and have large impact. A cloud infrastructure typically has a large number of
compute, storage, and network infrastructure components spanning geographic boundaries.
The typical cloud environment also features homogeneity and standardization in the
platforms and components, such as hypervisors, virtual machines file formats, and guest
operating systems. These factors can amplify security threats and allow them to spread
quickly. Mitigating velocity-of-attack is difficult in a cloud environment. Due to potentially
high velocity-of-attack, providers must employ strong and robust security enforcement and
containment mechanisms.
Information assurance ensures confidentiality, integrity, and availability of consumers’ data
in the cloud. In particular, cloud consumers need assurance that all the consumers operating
on the cloud do so legitimately, accessing only data to which they have rights and only to
the degree that policies and their roles permit. Although information assurance is primarily
a consumer’s concern, cloud providers should deploy strong authentication and
authorization mechanisms to validate that all consumers operating in cloud are genuine and
have right level of access to resources. Further, providers should build a resilient cloud
infrastructure to ensure service availability: an infrastructure that has the ability to
withstand certain types of failure and yet remain fully functional. Resilient cloud
infrastructure is usually built by deploying business continuity.
A provider needs to ensure that any sensitive data, which includes personally identifiable
information (PII) about its consumers (or their customers or users), is legally protected from
any unauthorized disclosure. Consumers must verify that providers have deployed robust
mechanisms to ensure privacy of their data. Some of the mechanisms that can be deployed
by providers to ensure data privacy include encryption of data at-rest and in-transit, and
data shredding.
The ownership of data is influenced by factors such as copyright and contract, and may
depend on the jurisdiction in which the data is created and stored. There are two basic
scenarios. In the case of data that is created on-premise and then stored in the cloud, data
ownership clearly remains with the creator based on various factors such as contractual
ownership, data privacy, copyright law, trade secret, and intellectual property. However,
when data is actually created in the cloud environment, the determination of who owns the
data primarily depends on terms of services (defined in service contract). There are other
factors that may govern the ownership of data, such as the type of information and the
country in which it is generated and stored. This can become even more complex if a
provider has outsourced the infrastructure. In all cases, the cloud service provider must
ensure that consumer always own the data. Further, there needs to be complete assurance
that all regulated information and Intellectual Property (IP) generated by the consumer is
solely owned by the consumer—whether it is created in the cloud or migrated there later.
Cloud service provider also need to ensure that the contract specifies the country in which
data will be stored and that ownership still is retained by consumers.
This slide illustrates relationship among various security concepts in a cloud environment.
Cloud service providers and consumers want to safeguard the asset from threat agents
(attackers) who seek to abuse the assets. Risk arises when the likelihood of a threat agent
(an attacker) to exploit the vulnerability arises. Therefore, the cloud service providers and
consumers deploy various countermeasures to minimize risk by reducing the vulnerabilities.
Risk assessment is the first step to determine the extent of potential threats and risks in a
cloud infrastructure. The process assesses risk and helps to identify appropriate controls to
mitigate or eliminate risks. To the extent possible, cloud service providers must apply their
normal information security and risk-management policies and standards to their cloud
infrastructure. Some of the key security areas cloud service providers must focus on while
building the infrastructure are: authentication, identity and access management, data loss
prevention and data breach notification, governance, risk, and compliance (GRC), privacy,
network monitoring and analysis, security information and event logging, incident
management, and security management. These security areas are covered later in the
module.
This lesson covered the key information security terminologies such as confidentiality,
integrity, availability, authentication, authorization, auditing, defense-in-depth, trusted
computing base, multi-tenancy, velocity of attack, information assurance, and data
ownership and privacy.
This lesson covers the top threats in a cloud environment.
According to the Cloud Security Alliance (CSA) and European Network and Information
Security Agency (ENISA), the top threats in a cloud environment are: data leakage, data loss,
account hijacking, insecure APIs, denial of service, malicious insiders, abuse of cloud
services, insufficient due diligence, shared technology vulnerabilities, and loss of governance
and compliance. Apart from describing these threats, this lesson also recommends various
control mechanisms that a cloud service provider must consider while building cloud
infrastructure to mitigate the threats.
Data leakage occurs when an unauthorized entity (an attacker) gains access to a cloud
consumer’s confidential data stored on the cloud infrastructure. An attacker may gain
unauthorized access to consumers’ confidential data in a variety of ways such as a
compromised password database, poor application design, poor segregation of network
traffic, poor encryption implementation, or through a malicious insider. When an attacker
compromises the database that stores consumers’ passwords, it allows the attacker to
impersonate the consumer or other legitimate user, thus exposing the consumer’s data. To
mitigate the risk of such data leakage, providers may deploy a multi-factor authentication
technique. When multi-factor authentication is used, a compromised password database
will expose only part of the authorization credential set, thus reducing the likelihood of a
breach or attack.
Consider the example of cross-VM side channel attack to gain access to the consumer’s
information. This attack is carried out in two steps: VM placement and information
extraction.
1. In the VM placement step, a malicious VM is placed on a compute system on which the
target virtual machine is running. One of the ways to accomplish that is by creating
multiple virtual machines. Due to load balancing techniques employed by the hypervisor,
new virtual machines may be placed on different compute systems. The attacker then
probes each virtual machine created to identify its neighbors. The process is repeated
until the target virtual machine is identified, allowing for the second step to be carried
out.
(Contd.)
Data loss can occur in the cloud due to various reasons other than malicious attacks. Some
of the causes of data loss may include accidental deletion by the provider or destruction
resulting from natural disasters. The provider is often responsible for data loss resulting
from these causes and appropriate measures such as data backup can reduce the impact of
such events. Providers must publish the protection mechanisms deployed to protect the
data stored in cloud. Also, providers must ensure appropriate terms and conditions related
to data loss and the associated penalties as part of the service contract. The service contract
should also include various BC/DR options, such as backup and replication, offered to the
consumers.
Account hijacking refers to a scenario in which an attacker gains access to consumers’
accounts using methods such as phishing or installing password-logging malware on
consumers’ virtual machines.
Phishing is an example of a social engineering attack that is used to deceive users. Phishing
attacks are typically carried out by spoofing email — an email with a fake but genuine-
appearing address, which provides a link to a website that masquerades as a legitimate
website. After opening the website, users are asked to enter details such as their login
credentials. These details are then captured by the attacker to take over the user’s account.
For example, a consumer may receive an email that is designed to appear as if the provider
had sent it. This email may ask the consumer to click the link provided in the email and
update their details. After clicking the email, the user is directed to a malicious website
where their details are captured.
Another way to gain access to a user’s credentials is by installing password-logging malware.
In this attack, the attacker installs malware in the consumer’s virtual machine which
captures user credentials and sends them to the attacker. After capturing the credentials an
attacker can use them to gain access to the consumer’s account. The attacker may then
eavesdrop on the user’s activities and transactions, manipulate data, return falsified
information, and redirect clients to illegitimate sites. The attacker may also use the identity
of the stolen account to launch subsequent attacks.
(Contd.)
Application programming interfaces (APIs) are used extensively in cloud infrastructures to
perform activities such as resource provisioning, configuration, monitoring, management,
and orchestration. These APIs may be open or proprietary. The security of cloud services
depends upon the security of these APIs. An attacker may exploit vulnerability in an API to
breach a cloud infrastructure’s perimeter and carry out an attack. Therefore, APIs must be
designed and developed following security best practices such as: requiring authentication
and authorization, and avoiding buffer overflows. Security review of the APIs must be
performed by providers. Also, access to the APIs must be restricted to authorized users.
These practices provide protection against both accidental and malicious attempts to bypass
security.
A Denial of Service (DoS) attack prevents legitimate users from accessing resources or
services. DoS attacks can be targeted against compute systems, networks, or storage
resources. In all cases, the intent of DoS is to exhaust key resources, such as network
bandwidth or CPU cycles, thereby preventing production use. For example, an attacker may
send massive quantities of data to the target with the intention of consuming bandwidth.
This prevents legitimate consumers from using the bandwidth. Such an attack may also be
carried out by exploiting weaknesses of a communications protocol. For example, an
attacker may cause DoS to a legitimate consumer by resetting TCP sessions or corrupting the
DNS server cache.
Consider another example of DoS attack specific to cloud environments where consumers
are billed based on resource utilization. While conducting a network based DoS attack to
prevent the consumer from using cloud services, the attacker can consume CPU cycles and
storage capacity, resulting in non-productive expenses for the legitimate consumer. Apart
from DoS attack, an attacker may also carry out Distributed DoS attack.
A Distributed DoS (DDoS) attack is a varient of DoS attack in which several systems launch a
coordinated, simultaneous DoS attack on their target(s), thereby causing denial of service to
the users of the targeted system(s). In a DDoS attack, the attacker is able to multiply the
effectiveness of the DoS attack by harnessing the resources of multiple collaborating
systems which serve as attack platforms. Typically, a DDoS master program is installed on
one compute system. Then, at a designated time, the master program communicates to a
number of "agent" programs installed on compute systems. When the agents receive the
command, they initiate the attack.
The principal control mechanism that can minimize the impact of DoS and DDoS attack is to
impose restrictions and limits on network resource consumption. For example, when it is
identified that the amount of data being sent from a given IP address exceeds the
configured limits, the traffic from that IP address may be blocked. This provides a first line of
defense. Further, restrictions and limits may be imposed on resources consumed by each
compute system, providing an additional line of defense.
Today, most cloud service providers are aware of the security threats posed by outsiders.
Countermeasures such as firewalls, malware protection software, and intrusion detection
systems can minimize the risk of attacks from outsiders. However, these measures do not
reduce the risk of attacks from malicious insiders. According to Computer Emergency
Response Team (CERT), a malicious insider could be a organization’s (cloud service
provider’s) current or former employee, contractor, or other business partner who has or
had authorized access to a cloud service provider's compute systems, network, or storage.
These malicious insiders may intentionally misuse that access in ways that negatively impact
the confidentiality (data leakage), integrity, or availability of the cloud service provider’s
information or resources.
For example, consider a former employee of a cloud service provider who had access to the
cloud resources. This malicious insider may be aware of security weaknesses in that cloud
infrastructure. This is a serious threat because the malicious insider may exploit the security
weakness and can impact confidentiality, integrity, and availability. Control measures that
can minimize the risk due to malicious insiders include strict access control policies,
disabling employee accounts immediately after separation from the company, security
audit, encryption, and segregation of duties (role-based access control, which is discussed
later in this module). Also, a background investigation of a candidate before hire is another
key measure that can reduce the risk due to malicious insiders.
As discussed throughout this course, cloud computing offers several advantages to service
providers and consumers. However, these cloud resources can be misused to perform
unauthorized or illicit activities. For example, an attacker might require years to crack an
encryption key using a personal compute system. However, the same key might be cracked
in minutes or hours by using the extensive computing resources available from a cloud-
based infrastructure. Alternately, an attacker may use cloud computing resources to perform
illegal activities such as distributing pirated software. This could place providers and
consumers in legal jeopardy. These types of threats are difficult to mitigate merely with the
help of tools. However, service providers can establish an agreement with consumers that
have guidelines for acceptable use of cloud resources.
Cloud service provider should pay due diligence towards understanding full scope of the
undertaking while offering cloud services. For example, in a hybrid cloud environment,
where cloud service provider connects to one or more cloud infrastructure(s) to leverage
their capabilities, complete understanding of operational responsibilities is required. These
responsibilities include incident response, encryption, governance, compliance, and security
monitoring. Insufficient due diligence towards understanding these responsibilities may
increase risk levels.
Similarly, understanding operational responsibilities are essential when a service provider
may act as a broker by connecting to multiple cloud service providers to integrate their
capabilities and offer services to the consumers. For example, if a cloud service broker offers
services using resources from multiple cloud service providers that do not meet the security
requirements of the consumers may expose consumers’ data to increased risk. This risk can
be minimized by thoroughly understanding and evaluating the cloud providers’ services and
their terms, and ensuring they provide security controls that can meet the consumers’
security requirements. Further, it is important to understand the consumers’ risk profile to
ensure that the risks involved are within acceptable levels.
Technologies that are used to build a cloud infrastructure provide a multi-tenant
environment enabling the sharing of resources. Multi-tenancy is achieved by using
mechanisms that provide separation of computing resources such as memory and storage
for each consumer. Failure of these mechanisms may expose the consumer’s data to other
consumers, raising security risks. Compromising a hypervisor is a serious event because it
exposes the entire environment to potential attacks. Hyperjacking is an example of this
type of attack in which the attacker installs a rogue hypervisor that takes control of the
compute system. The attacker now can use this hypervisor to run unauthorized virtual
machines in the cloud and carry out further attacks. Detecting this attack is difficult and
involves examining components such as program memory and the processor core registers
for anomalies. It may be possible to prevent hyperjacking by securing components that are
part of the trusted computing base.
Loss of compliance may occur when a cloud service provider or cloud broker does not
adhere to, and demonstrating adherence to, external laws and regulations as well as
corporate policies and procedures.
For example, there are regulations that mandate that businesses perform vulnerability
assessment when dealing with certain types of data. Vulnerability assessment is aimed at
discovering potential security vulnerabilities in the environment by “scanning” the compute,
storage, and network resources. Payment Card Industry (PCI) compliance is an example that
governs the guidelines for handling credit card data. A cloud broker offers services using
resources from multiple cloud service providers may not allow a consumer to perform
vulnerability assessment. This is because performing vulnerability assessment may be
prohibited by the cloud providers, participating in the broker service, through their contract
terms, since any scan employed has the potential to disrupt services to other consumers. In
such a scenario the cloud broker and consumers have to rely on the results of the
vulnerability assessment performed by the cloud service providers.
The effectiveness of governance processes (which determines the purpose, strategy and
operational rules by which companies are directed and managed) may be diminished when
the cloud provider outsources some or all of its services to third-parties, including cloud
brokers. In such a scenario, provider may not have control over many aspect that have been
outsourced and therefore, may impact the commitments given by the provider. Also, the
security controls of the third-party may change, which may impact the terms and conditions
of the provider. Further, if a third-party is unable to supply evidence of meeting the
providers’ compliance requirements, this may put the provider at risk.
To better prepare for the threats discussed in this lesson, cloud service providers deploy
various security mechanisms. These security mechanisms are discussed next.
This lesson covered key security threats in a cloud environment. These threats are data
leakage, data loss, account hijacking, insecure APIs, malicious insiders, denial of service,
abuse of cloud services, shared technology vulnerabilities, insufficient due diligence, loss of
compliance, and loss of governance.
This lesson covers physical security and identity and access management deployed in cloud
environment.
Any security mechanism should account for three aspects: people, process, and technology,
and the relationships among them. Security mechanisms can be administrative or technical.
Administrative mechanisms include security and personnel policies or standard procedures
to direct the safe execution of various operations. Technical mechanisms are usually
implemented through tools or devices deployed on the IT infrastructure.
To protect cloud infrastructure, various technical security mechanisms must be deployed at
the compute, network, and storage levels.
At the compute system level security mechanisms are deployed to secure hypervisors and
hypervisor management systems, virtual machines, guest operating systems and
applications. Security at the network level commonly includes firewalls, demilitarized zones,
intrusion detection systems, virtual private networks, zoning and iSNS discovery domains,
port binding and fabric binding configurations, VLAN and VSAN. At the storage level, security
mechanisms include LUN masking, data shredding, and data encryption. Apart from these
security controls, cloud also requires identity and access management, role based access
control, and physical security arrangements.
Physical security is the foundation of any overall IT security strategy. Strict enforcement of
policies, processes, and procedures by cloud service providers are critical elements of
successful physical security. To secure the cloud infrastructure, the following physical
security measures may be deployed:
• Disable all unused IT infrastructure devices and ports
• 24/7/365 onsite security
• Biometric or security badge-based authentication to grant access to the facilities
• Surveillance cameras [CCTV] to monitor activity throughout the facility
• Sensors and alarms to detect motion and fire
Identity and access management is the process of managing consumers’ identifiers, and
their authentication and authorization to access cloud resources. It also controls access to
resources by placing restrictions based on consumer identities. In today’s cloud
environment, an organization may collaborate with multiple cloud service providers to
access various cloud-based applications. This requires deploying multiple authentication
systems to enable the organization to authenticate employees and provide access to cloud-
based applications.
The cloud environment uses both traditional and new authentication and authorization
mechanisms to provide identity and access management. The key traditional authentication
and authorization mechanisms deployed in an environment are Windows ACLs, UNIX
permissions, Kerberos, and Challenge-Handshake Authentication Protocol (CHAP).
Alternatively, the organization can use Federated Identity Management (FIM) for
authentication. Federation is A federation is an association of organizations and cloud
service providers (referred to as trusted parties) that come together to exchange
information about their users and resources to enable collaboration. Federation includes the
process of managing the trust relationships among the trusted parties beyond internal
networks or administrative boundaries. FIM enables the service providers to offer services
without implementing their own authentication system. The organization can choose an
identity provider to authenticate their users. This involves exchanging identity attributes
between the cloud service provider and the identity provider in a secure way. New identity
and access management mechanisms used in cloud include OpenID and OAuth.
Windows ACLs and UNIX Permissions form the first level of protection to compute resources
(application servers, file servers, and file sharing environment such as NAS) by restricting
accessibility and sharing. These permissions are deployed over and above the default
behaviors and attributes associated with files and folders. In addition, various other
authentication and authorization mechanisms, such as Kerberos and directory services, are
implemented to verify the identity of network users and define their privileges.
Windows supports two types of ACLs: discretionary access control lists (DACLs) and system
access control lists (SACLs). The DACL, commonly referred to as the ACL, is used to
determine access control. The SACL determines what accesses need to be audited if auditing
is enabled. In addition to these ACLs, Windows also supports the concept of object
ownership. The owner of an object has hard-coded rights to that object, and these rights do
not need to be explicitly granted in the SACL. The owner, SACL, and DACL are all statically
held as attributes of each object. Windows also offers the functionality to inherit
permissions, which allows the child objects existing within a parent object to automatically
inherit the ACLs of the parent object. ACLs are also applied to directory objects known as
security identifiers (SIDs). These are automatically generated by a Windows server or
domain when a user or group is created, and they are abstracted from the user. In this way,
though a user may identify his login ID as “User1,” it is simply a textual representation of the
true SID, which is used by the underlying operating system. Internal processes in Windows
refer to an account’s SID rather than the account’s username or group name while granting
access to an object. ACLs are set by using the standard Windows Explorer GUI but can also
be configured with CLI commands or other third-party tools.
(Contd.)
OAuth is an open authorization mechanism that allows a client to access protected
resources from a resource server on behalf of a resource owner. There are four entities
involved in the authorization mechanism: resource owner, resource server, client, and
authorization server. A resource owner is an entity capable of granting access to a protected
resource. A resource server is the compute system hosting the protected resources, capable
of accepting and responding to protected resource requests using access tokens. A client is
an application making protected resource requests on behalf of the resource owner with
the resource owner’s authorization. An authorization server is the compute system issuing
access tokens to the client after successfully authenticating the resource owner and
obtaining authorization. The authorization server may be the same server as the resource
server or a separate entity.
(Contd.)
Multi-factor authentication uses more than one factor to authenticate a user. A commonly
implemented two-factor authentication process requires the user to supply both something
he/she knows (such as a password) and also something he/she has (such as a device). The
second factor might also be a password generated by a physical device (known as token),
which is in the user’s possession. The password generated by the token is valid for a pre-
defined time. The token generates another password after the pre-defined time is over. To
further enhance the authentication process, additional factors may also be considered.
Examples of additional factors that may be used include unique ID number, and users’ past
activity. A multi-factor authentication technique may be deployed using any combination of
these factors. User access to the environment is granted only when all the required factors
are validated.
Kerberos is a network authentication protocol, which is designed to provide strong
authentication for client/server applications by using secret-key cryptography. It uses
cryptography so that a client and server can prove their identity to each other across an
insecure network connection. After the client and server have proven their identities, they
can choose to encrypt all their communications to ensure privacy and data integrity. In
Kerberos, authentications occur between clients and servers. The client gets a ticket for a
service and the server decrypts this ticket by using its secret key. Any entity, user, or
compute system that gets a service ticket for a Kerberos service is called a Kerberos client.
The term Kerberos server generally refers to the Key Distribution Center (KDC). The KDC
implements the Authentication Service (AS) and the Ticket Granting Service (TGS). In
Kerberos, users and servers for which a secret key is stored in the KDC database are known
as principals. The KDC has a copy of every password associated with every principal, so it is
absolutely vital that the KDC remain secure.
(Contd.)
The Challenge-Handshake Authentication Protocol (CHAP) is a basic authentication
mechanism that has been widely adopted by network devices and compute systems. CHAP
provides a method for initiators and targets to authenticate each other by utilizing a secret
code or password. CHAP secrets are usually random secrets of 12 to 128 characters. The
secret is never exchanged directly over the communication channel; rather, a one-way hash
function converts it into a hash value, which is then exchanged. A hash function, using the
MD5 algorithm, transforms data in such a way that the result is unique and cannot be
changed back to its original form.
If the initiator requires reverse CHAP authentication, the initiator authenticates the target by
using the same procedure. The CHAP secret must be configured on the initiator and the
target. A CHAP entry, composed of the name of a node and the secret associated with the
node, is maintained by the target and the initiator. The same steps are executed in a two-
way CHAP authentication scenario.
After these steps are completed, the initiator authenticates the target. If both
authentication steps succeed, then data access is allowed. CHAP is often used because it is a
fairly simple protocol to implement and can be implemented across a number of disparate
systems.
OpenID is an open standard for authentication in which a service provider (known as the
relying party) uses authentication services from an OpenID provider (known as the identity
provider). An OpenID provider maintains consumers’ credentials on their authentication
system and enables relying parties to authenticate consumers requesting the use of the
relying party’s services (in this case, a cloud-based service). This eliminates the need for the
relying party to deploy their own authentication systems. In the OpenID mechanism, a
consumer creates an ID with one of the OpenID providers. This OpenID then can be used to
sign-on to any provider (relying party) that accepts Open ID authentication.
Figure illustrates the OpenID concept by considering a consumer who requires services from
the relying party. For the consumer to use the services provided by the relying party an
identity (user ID and password) is required. The relying party does not provide their own
authentication mechanism, however they support OpenID from one or more OpenID
providers. The consumer can create an ID with the identity provider and then use this ID
with the relying party. The relying party, after receiving the login request, authenticates it
with the help of identity provider and then grants access to the services.
This lesson covered physical security, Windows ACLs, UNIX permissions, Oauth, multi-factor
authentication, Kerberos, CHAP, and OpenID.
This lesson covers role-based access control, network monitoring and analysis, firewall,
intrusion detection and prevention system, adaptive security, port binding and fabric
binding, virtual private network, virtual LAN, virtual SAN, zoning, and iSNS discovery domain
mechanisms deployed in a cloud environment.
Role-based access control (RBAC) is an approach to restricting access to authorized users
based on their respective roles. A role may represent a job function, for example, a storage
administrator. The only privileges assigned to the role are those required to perform the
tasks associated with that role.
It is advisable to consider administrative controls, such as separation of duties, when
defining data center security procedures. Clear separation of duties ensures that no single
individual can both specify an action and carry it out. For example, the person who
authorizes the creation of administrative accounts should not be the person who uses those
accounts.
Monitoring and analyzing the network is important for the smooth and continuous
operation of a cloud infrastructure. Network availability can be compromised by DoS attacks
and network device failures. Proactive network monitoring and analysis can detect and
prevent network failures or performance problems.
Network monitoring can be performed in two ways: active monitoring and passive
monitoring. In active monitoring, the monitoring tools transmit data between the two
endpoints that are monitored. The measurement includes parameters such as availability,
delay, loss, and bandwidth. In passive monitoring, instead of transmitting data and then
measuring, information about a link or device is collected by probing the link or device. As
the data passes through the link or device information is captured. This information is then
used to analyze, detect and troubleshoot any issues related to performance, availability, or
security. Some of the mechanisms used to monitor, detect, and prevent attacks are firewalls,
intrusion detection systems, intrusion prevention systems and network analysis/forensics
systems.
A firewall is a security mechanism designed to examine data packets traversing a network
and compare them to a set of filtering rules. Packets that are not authorized by a filtering
rule are dropped and not allowed to continue to the requested destination. A rule may use
various filtering parameters such as source address, destination address, port numbers, and
protocols. Some firewalls may support filtering parameters that enable packet inspection for
content. These rules can be set for both incoming and outgoing traffic. The effectiveness of
a firewall depends on how robustly and extensively the security rules are defined. Firewalls
can be deployed at the network, compute system, and hypervisor levels.
A network-level firewall is typically used as first line of defense for restricting certain type of
traffic from coming in and going out from a network. This type of firewall is typically
deployed at the entry point of a cloud’s network.
At the compute system-level, a firewall application is installed as second line of defense in a
defense-in-depth strategy. This type of firewall provides protection only to the compute
system on which it is installed.
In a virtualized environment, there is the added complexity of virtual machines running on a
smaller number of compute systems. When virtual machines on the same hypervisor
communicate with each other over a virtual switch, a network-level firewall cannot filter this
traffic. In such situations a virtual firewall can be used to filter virtual machine traffic. A
virtual firewall is a software appliance that runs on a hypervisor to provide traffic filtering
service. Virtual firewalls give visibility and control over virtual machine traffic and enforce
policies at the virtual machine level.
(Contd.)
Intrusion detection is the process of detecting events that can compromise the
confidentiality, integrity, or availability of IT resources. An intrusion detection system (IDS) is
a security tool that automates the detection process. An IDS generates alerts in case
anomalous activity is detected. An intrusion prevention system (IPS) is a tool that has the
capability to stop the events after they have been detected by the IDS. These two
mechanisms usually work together and are generally referred to as intrusion detection and
prevention system (IDPS). The key techniques used by an IDPS to identify intrusion in the
environment are signature-based and anomaly-based detection.
In the signature-based detection technique, the IDPS relies on a database that contains
known attack patterns, or signatures, and scans events against it. A signature can be an
email with a specific subject or an email attachment with a specific file name that is known
to contain a virus. This type of detection is effective only for known threats and is potentially
circumvented if an attacker changes the signature (the email subject or the file name in the
attachment, in this example). In the anomaly-based detection technique, the IDPS scans and
analyzes events to determine whether they are statistically different from events normally
occurring in the system. This technique can detect various events such as multiple login
failures, excessive process failure, excessive network bandwidth consumed by an activity, or
an unusual number of emails sent by a user, which could signify an attack is taking place.
The IDPS can be deployed at the compute system, network, or hypervisor levels. A compute
system-based IDPS analyzes activity that includes system logs, running processes,
application activities, file access and modification, and system and application configuration
changes. Because IDPS software and malicious programs might be running on the same
compute system, compute system-based IDPS is less isolated from attacks. In this scenario,
the malicious program first attacks and disables IDPS and then carries out the main attack.
A network-based IDPS monitors network traffic and network devices. It also analyzes
network and application protocols behavior for unusual activities. Network-based IDPS
resides on the network in the form of an appliance or software installed on a dedicated
compute system, and is usually isolated from malicious applications on the compute
systems. After a network-based IDPS detects an anomaly, it stops the suspicious activity by
dropping the packets or blocking ports. Since a network-based IDPS operate at the network
level, it cannot monitor activities happening within the compute system.
A hypervisor-based IDPS is deployed at the hypervisor level. In this type of IDPS, detection
policies are typically kernel-specific. Any anomaly detected in the kernel is immediately
alerted or corrected depending on the specific IDPS configuration. Hypervisor-based IDPS is
typically deployed in a cloud environment.
Security threats have evolved to the point that traditional security mechanisms cannot
respond to and be effective as standalone mechanisms. Sophisticated techniques such as
phishing, Man in the Middle, and others are used to gain unauthorized access to cloud
resources. To combat against such sophisticated attacks, cloud service providers require the
use of adaptive security mechanisms. Adaptive security mechanisms integrate with the
cloud service providers’ standalone mechanisms such as IDPS and firewalls and use
heuristics to learn user behavior and detect fraudulent activity. Mechanisms such as
behavioral profile, device-related profile, type of web browser, and plug-ins are used to
establish the normal operating profile of the environment. The intelligence in the adaptive
security mechanism detects and identifies anomalies and blocks such anomalies –
capabilities that may not be possible with traditional mechanisms.
Port binding is a mechanism used to limit the devices that can be attached to a specific
switch port, and is supported in both FC SAN and Ethernet environments. In an FC SAN, port
binding maps a WWPN to a switch port. If a host tries to login to a port with a WWPN that is
not allowed by the port binding, the WWPN login is rejected. In an Ethernet network, port
binding maps the MAC address and IP address of a compute system to a specific switch port.
A switch port will forward a packet only if the MAC and IP address in the packet are mapped
to that port. Port binding mitigates but does not eliminate WWPN or MAC spoofing.
Fabric binding is another security mechanism in the FC SAN environment that allows only
authorized switches to join an existing fabric. Inter-switch links are only enabled between
specified switches in the fabric. Each switch in the fabric obtains identical membership data
that includes a list of authorized switches in the fabric. Port security controls such as port
locking and port-type locking complement fabric binding by helping to prevent unauthorized
access to a switch. Port locking persistently (even after a switch reboot) prohibits an unused
switch port from being used. Port-type locking can be used to restrict how a specific switch
port is used, such as preventing it from being initialized as an inter-switch link.
In the cloud environment, a virtual private network (VPN) can be used to provide a
consumer a secure connection to the cloud resources. VPN is also used in a hybrid cloud,
externally hosted private cloud, or community cloud environment to provide secure site-to-
site connection.
A virtual private network extends an consumer’s private network across a public network
such as the Internet. VPN establishes a point-to-point connection between two networks
over which encrypted data is transferred. VPN enables consumers to apply the same
security and management policies to the data transferred over the VPN connection as are
applied to the data transferred over the consumer’s internal network. When establishing a
VPN connection, a consumer (user or an organization) is authenticated before the security
and management policies are applied.
There are two methods in which a VPN connection can be established: remote access VPN
connection and site-to-site VPN connection. In a remote access VPN connection, a remote
client (typically client software installed on the consumer’s compute system) initiates a
remote VPN connection request. A VPN server authenticates and provides the user access
to the cloud network. In a site-to-site VPN connection, the remote site initiates a site-to-site
VPN connection. The VPN server authenticates and provides access to internal network.
One typical usage scenario for this method is when deploying a hybrid cloud, in which a
cloud requires connection to another cloud’s network.
In a cloud environment, VLAN and VSAN ensure security by providing isolation over the
shared infrastructure. Each consumer may be provided VLANs and VSANs to ensure their
data is separated from other consumers. A Virtual Local Area Network (VLAN) is a virtual
network created on a local area network (LAN) consisting of virtual and/or physical switches.
VLAN technology can divide a large LAN into smaller virtual LANs or combine separate LANs
into one or more virtual LANs. A VLAN enables communication among a group of nodes
based on the functional requirements of the group, independent of the nodes’ location in
the network. Similarly, Virtual Storage Area Networks (VSAN) enable the creation of multiple
logical SANs over a common physical SAN. They provide the capability to build larger
consolidated fabrics and still maintain the required security and isolation between them.
Zoning should be done for each VSAN to secure the entire physical SAN.
Zoning is a Fibre Channel switch mechanism that enables node ports within a fabric to be
logically segmented into groups and to communicate with each other within the group.
There are three types of zoning. World Wide Port Name-based zoning is the most commonly
used to prevent unauthorized access when node ports are re-cabled to different fabric ports.
However, it is possible that a rogue compute system could join the fabric, then spoof a
legitimate WWPN and thereby gain access to resources in a zone. If WWPN spoofing is a key
concern, then port zoning and port binding could be used.
Internet Storage Name Service (iSNS) discovery domains function in the same way as FC
zones. Discovery domains provide functional groupings of devices in an IP-SAN. For devices
to communicate with one another, they must be configured in the same discovery domain.
State change notifications inform the iSNS server when devices are added to or removed
from a discovery domain. Figure on the slide depicts the discovery domains in iSNS.
This lesson covered role-based access control, network monitoring and analysis, firewall,
intrusion detection and prevention system, adaptive security, port binding and fabric
binding, VPN, VLAN, VSAN, zoning, and iSNS discovery domain.
This lesson covers security hypervisor and management server, virtual machine hardening,
securing operating system and applications, LUN masking, data encryption, and data
shredding mechanisms deployed in a cloud environment.
This lesson covers security hypervisor and management server, virtual machine hardening,
securing operating system and applications, LUN masking, data encryption, and data
shredding mechanisms.
The Hypervisor and related management servers are critical components of the cloud
infrastructure because they control the operation and management of the virtualized
compute environment. Compromising a hypervisor or management server places all VMs at
a high risk of attack. Hypervisors may be compromised by hyperjacking or other forms of
attack. Further, the management server may be compromised by exploiting vulnerabilities in
the management software or by an insecure configuration. For example, an administrator
may have configured a non-secured or non-encrypted remote access mechanism. Also, a
malicious attacker may take control of the management server by exploiting a security
loophole of the system. This enables the attacker to perform unauthorized activities such as
controlling all the existing VMs, creating new VMs, deleting VMs, and modifying VM
resources.
To protect against such attacks, security-critical hypervisor updates should be installed when
they are released by the hypervisor vendor. Hypervisor hardening should be performed,
using specifications provided by organizations such as the Center for Internet Security (CIS)
and Defense Information Systems Agency (DISA). Access to the management server should
be restricted to authorized administrators. Access to core levels of functionality should be
restricted to selected administrators. Also, network traffic should be encrypted when
management is performed remotely. A separate firewall with strong filtering rules installed
between the management system and the rest of the network can enhance security.
(Contd.)
Virtual machine hardening is a key security mechanism to protect virtual machines from
various attacks. Typically, a virtual machine is created with several default virtual
components and configurations. Some of the configurations and components may not be
used by the operating system and application running on it. These default configurations
may be exploited by an attacker to carry out an attack. Therefore, a virtual machine
hardening process should be used in which the default configuration is changed to achieve
greater security. In this process, virtual machine’s devices that are not required (such as USB
ports or CD/DVD drives) are removed or disabled. Also, in this process the configuration of
VM features is tuned to operate in a secure manner such as changing default passwords,
setting permissions to VM files, and disallowing changes to the MAC address assigned to a
virtual NIC, mitigating spoofing attacks. Hardening is highly recommended when creating
virtual machine templates. This way, virtual machines created from the template start from
a known security baseline.
Hardening, malware protection software, and sandboxing are three key mechanisms that
can increase the resistance of guest operating systems and applications to exploitation by
malicious attackers.
Operating system hardening typically includes deleting unused files and applications,
installing current operating system updates (patches), and configuring system and network
components following a hardening checklist. These hardening checklists are typically
provided by operating systems vendors or organizations such as the Center for Internet
Security (CIS) and Defense Information Systems Agency (DISA), who also provide security
best practices. Further, vulnerability scanning and penetration testing can be performed to
identify existing vulnerabilities and to determine the feasibility of an attack. These
mechanisms assess the potential impact of an attack on the business. Consumer-driven
penetration tests and vulnerability scanning may not be allowed by the cloud service
providers; consumers may have to rely upon the cloud service provider to perform these
tests.
Application hardening is a process followed during application development, with the goal
of preventing the exploitation of vulnerabilities that are typically introduced during the
development cycle. Application architects and developers must focus on various factors
such as proper application architecture, threat modeling, and secure coding while designing
and developing an application. Installing current application updates or patches provided by
the application developers can reduce some of the vulnerabilities identified after the
application is released.
Malware protection software is typically installed on a compute system or as a virtual
appliance to provide protection for the operating system and applications. The malware
protection software detects, prevents, and removes malware and malicious programs such
as viruses, worms, Trojan horses, key loggers, and spyware. Malware protection software
uses various techniques to detect malware. One of the most common techniques used is
signature-based detection. In this technique, the malware protection software scans the
files to identify a malware signature. A signature is a specific bit pattern in a file. These
signatures are catalogued by malware protection software vendors and are made available
to users as updates. The malware protection software must be configured to regularly
update these signatures to provide protection against new malware programs. Another
technique, called heuristics, can be used to detect malware by examining suspicious
characteristics of files. For example, malware protection software may scan a file to
determine the presence of rare instructions or code. Malware protection software may also
identify malware by examining the behavior of programs. For example, malware protection
software may observe program execution to identify inappropriate behavior such as
keystroke capture.
(Contd.)
Sandboxing is another mechanism for guest operating system and application security.
Typically used for testing and verifying unproven or untrusted applications, sandboxing
involves isolating the execution of an application from other applications in order to restrict
the resources and privileges that the application has access to. The restrictions are enforced
via the operating system. When creating a sandbox environment an administrator defines
resources that an application can access while it is being tested. For example, network
access and the ability to inspect the system components or read from input devices are
either disallowed or restricted and closely monitored.
LUN masking refers to the assignment of LUNs to specific host bus adapter world-wide
names. LUN masking is one of the basic SAN security mechanisms used to protect against
unauthorized access to storage. LUN masking can be implemented at the host, within the
switch, or at the storage system. The standard implementations of LUN masking on storage
arrays mask the LUNs presented to a front-end storage port based on the WWPNs of the
source HBAs. A stronger variant of LUN masking may sometimes be offered in which
masking is done on the basis of the source Fibre Channel address. The Fibre Channel
address typically changes if the HBA is relocated across ports in the fabric. To avoid this
problem, most switch vendors offer a mechanism to lock down the Fibre Channel address of
a given node port regardless of its location.
Data encryption is a cryptographic technique in which data is encoded and made
indecipherable to eavesdroppers or hackers. Data encryption is one of the most important
mechanisms for securing data in-flight and at-rest. Data in-flight refers to data that is being
transferred over a network and data at-rest refers to data that is stored on a storage
medium. Data encryption provides protection from threats such as tampering with data,
which violates data integrity, media theft, which compromises data availability and
confidentiality, and sniffing attacks, which compromise confidentiality.
Data should be encrypted as close to its origin as possible. If it is not possible to perform
encryption on the compute system, an encryption appliance can be used for encrypting data
at the point of entry into the storage network. Encryption devices can be implemented on
the fabric to encrypt data between the compute system and the storage media. These
mechanisms can protect both the data at-rest on the destination device and data in-transit.
Encryption can also be deployed at the storage-level, which can encrypt data-at-rest.
Another way to encrypt network traffic is to use cryptographic protocols such as Transport
Layer Security (TLS) which is a successor to Secure Socket Layer (SSL). These are application
layer protocols and provide an encrypted connection for client-server communication.
These protocols are designed to prevent eavesdropping and tampering of data on the
connection over which it is being transmitted.
Some cloud providers may perform data mining and analytics on consumers’ data so that
they can offer new services to them based on their usage and nature of data. However, this
may cause consumers to be concerned about the loss of confidentiality of their data, even if
the cloud provider does not misuse the data. Therefore, a provider may offer data
encryption capability to protect consumers data.
Data shredding is the process of deleting data or residual representations (sometimes called
remanence) of data and making it unrecoverable. Typically, when data is deleted it is not
made unrecoverable from the storage and an attacker may use specialized tools to recover
it. The threat of unauthorized data recovery is greater in the cloud environment as
consumers do not have control over cloud resources. After consumers discontinue the cloud
service, their data or residual representations may still reside in the cloud infrastructure. An
attacker may perform unauthorized recovery of consumers’ data to gain confidential
information.
Cloud service providers can deploy data shredding mechanisms in the cloud infrastructure to
protect from loss of confidentiality of consumers’ data. Data may be stored on disks or on
tapes. Techniques to shred data stored on tape include overwriting it with invalid data,
degaussing the media (a process of decreasing or eliminating the magnetic field), and
physically destroying the tape. Data stored on disk or flash drives can be shredded by using
algorithms that overwrite the disks several times with invalid data.
Cloud service providers may create multiple copies (backups and replicas) of consumers’
data at multiple locations as part of business continuity and disaster recovery strategy.
Therefore, cloud service providers must deploy data shredding mechanism at all location
and to ensure all the copies are shred.
Today, cloud computing has enabled small and medium organizations to have robust
security mechanisms without requiring capital investment in security tools or staff expertise.
This is possible because some cloud service providers are providing Security as a Service.
Security as a Service (SecaaS) is a service that delivers various security mechanisms through
the cloud. Typically, Security as a Service offers mechanisms such as identity and access
management, data loss prevention, security information and event management, web
security, email security, security assessments, intrusion detection and prevention,
encryption, and business continuity and disaster recovery.
Security as a Service provides consumers the opportunity to reduce their capital investment
on their security deployments and still enjoy robust security mechanisms. It also reduces
the security management burden on the organization and enables them to focus on their
core competencies. Security as a Service provides several benefits such as agility, dynamic
scalability, and virtually unlimited resources. The challenge associated with Security as a
Service is that, as in every SaaS deployment, consumers do not have complete visibility and
control over the service. The consumer is responsible for setting the security policies but the
service is managed by the service provider.
Having robust security mechanisms and technology in place is very important to any cloud
environment. However, technologies alone are not enough to guarantee information
security. Cloud service providers need to deploy Governance, Risk and Compliance (GRC)
processes before offering cloud services to their consumers.
This lesson covered security hypervisor and management server, virtual machine hardening,
securing operating system and applications, LUN masking, data encryption, data shredding,
and Security as a Service.
This lesson covers focus areas of cloud governance, key steps of risk management, types of
compliance that control IT operations in cloud, and key auditing activities in cloud.
Governance, Risk and Compliance (GRC) is a term encompassing processes that help an
organization ensure that their acts are ethically correct and in accordance with their risk
appetite (the risk level an organization chooses to accept), internal policies and external
regulations. This process should be integrated, holistic, and organization-wide. All
operations of an organization should be managed and supported through GRC.
Governance, risk management, and compliance management work together to enforce
policies and minimize potential risks. To better understand how these three components
work together, consider an example of how GRC is implemented in an IT organization.
Governance is the authority for making policies, such as defining access rights to users
based on their roles and privileges. Risk management involves identifying resources that
should not be accessed by certain users in order to preserve confidentiality, integrity, and
availability. In this example, compliance management assures that the policies are being
enforced by implementing mechanisms such as firewalls and identify management systems.
GRC is an important component of cloud infrastructure. Therefore, while building a cloud
infrastructure service provider must ensure all aspects of GRC are deployed includes cloud-
related aspects such as ensuring secured multi-tenancy, the jurisdictions where data should
be stored, data privacy, and ownership.
In case an organization is transforming its data center to provide a cloud infrastructure
already have some form of GRC in effect. In such cases, the organizations may only have to
focus on the cloud-related aspects of GRC.
Governance determine the purpose, strategy and operational rules by which companies are
directed and managed. Enterprise Governance is based on the company’s business strategy
and driven by the Board of Directors. It generally includes legal, HR, finance, and the office
of the CEO. Governance affects how the company addresses everything from long-term
strategies to day-to-day operations. This slide focuses on IT Governance, which is a subset
discipline of Enterprise Governance.
The objective of IT governance is to determine the desired behavior or results to achieve IT’s
strategic goals. Governance in IT is a system in which leaders monitor, evaluate, and direct IT
management to ensure IT effectiveness, accountability and compliance. For a governance
system to work, it has to distribute the workload and decision making process without
losing value or gaining bias in the process. Roles and responsibilities must be clearly defined,
providing details such as who is responsible for directing, controlling and executing
decisions, what information is required to make the decisions, and how exceptions will be
handled. The last step is measuring the outcome of the governance process, finding areas
for improvement, and instituting changes for improvement.
The basic principles of IT governance remain the same in cloud environments. Additionally,
cloud service provider must focus on policies related to managing and consuming cloud
services.
Risk is the effect of uncertainty on business objectives. Risk management is a systematic
process to assessing its assets, placing a realistic valuation on each asset, and creating a risk
profile that is rationalized for each information asset across the business. Additionally, the
cloud service providers must establish a risk threshold to measure against each asset.
Cloud computing poses several new risks beyond those that exist in traditional data centers.
Risk management involves identification, assessment, and prioritization of risks and
institutes controls to minimize the impact of those risks. There are four key steps of risk
management that a cloud service provider must perform before offering services to the
consumers: risk identification, risk assessment, risk mitigation, and monitoring.
Step 1: Risk identification points to the various sources of threats that give rise to risk. After
identifying risks in a cloud, these risks and their sources need to be classified into
meaningful severity levels.
(Contd.)
Compliance is the act of adhering to, and demonstrating adherence to, external laws and
regulations, as well as to corporate policies and procedures. It also involves adhering to, and
demonstrating adherence to the service provider's own demands, consumers' demands,
and/or the demands of participating cloud providers (in case of hybrid cloud and cloud
brokers). For example security demands that are expressed in contracts, decisions on best
practice, or specific standards. While building and offering cloud services to consumers, it is
important to assess compliance against regulations and demands (discussed earlier). Also, it
is important to review the security and privacy controls that are in place to ensure that
appropriate controls are applied to the highest value, highest risk assets.
While transforming an existing data center in to cloud, the compliance and audit standards,
processes, and practices likely exist for the data center environment. However, transforming
to cloud requires significant adjustments to the existing compliance framework.
Adhering to policies and regulations applies to both cloud service provider and consumers.
There are primarily two types of policies controlling IT operations in an enterprise that
require compliance even after moving operations to cloud: internal policy compliance and
external policy compliance.
Internal policy compliance controls the nature of IT operations within an organization. An
organization needs to maintain the same compliance when operating in cloud. This requires
clear assessment of the potential difficulties in maintaining the compliance in a cloud and
processes to ensure that this is effectively achieved.
(Contd.)
In order to meet a consumer’s compliance requirements, the cloud service provider must
have compliance management in place. Compliance management ensures that the cloud
services, service creation processes, and cloud infrastructure resources adhere to relevant
policies and legal requirements. Policies and regulations can be based on configuration best
practices and security rules. These include administrator roles and responsibilities, physical
infrastructure maintenance timelines, information backup schedules, and change control
processes. Compliance management activity includes periodically reviewing compliance
enforcement in infrastructure resources and services. If it identifies any deviation from
compliance requirements, it initiates corrective actions.
In the compliance context, audit is a process that determines the validity and reliability of
information about the enforcement of controls presented by a provider. Audit also provides
an assessment of the cloud provider’s control mechanisms and their ability to provide the
consumers the logs required to verify the mechanisms. Auditing of the cloud infrastructure
can be performed by internal auditors (an auditing team within the organization) or external
auditors (from an external organization). The role that carries out this activity is referred as
cloud auditor.
A cloud auditor is any party that can conduct an independent assessment of cloud services,
information system operations, performance and security of the cloud implementation. A
cloud auditor can evaluate the services provided by a cloud provider in terms of security
controls and privacy.
The cloud auditor makes an independent assessment of the security controls in the
information system to determine if they meet the requirements and are running as
originally intended. Key activities that provide the basis for a security audit of a cloud
infrastructure include:
• Determine how consumers’ data is segregated from each other.
• Review and evaluate the security mechanisms to detect, prevent, and stop an attack in
accordance with provider’s internal policies. Also, review and evaluate physical security.
• Determine how identity management is performed for cloud-based services.
• Determine whether adequate disaster recovery processes are available to provide
consumers uninterrupted access to the cloud services.
• Review and evaluate whether appropriate governance processes are available to meet
consumers’ requirements.
In a privacy audit, the cloud auditor performs an assessment of the processes and
mechanisms deployed in the cloud infrastructure to determine whether a cloud provider
meets relevant privacy regulations. The key activities based on which a privacy audit of a
cloud infrastructure is performed include:
• Review and evaluate the usage of encryption to protect consumers’ data that is stored on
the providers’ infrastructure or is transmitted over the network.
• Determine the level of access the provider’s employees have to consumers’ resources
and data. Also, determine how the access of the provider’s employees is controlled.
• Review and evaluate processes for controlling consumers’ access to the cloud resources.
• Review and evaluate whether data retention and destruction practices are in accordance
with privacy laws.
This lesson covered governance in the cloud, risk management for the cloud, compliance
management for the cloud, and cloud auditing.
This lesson covers RSA and VMware security products. These products are RSA SecurID, RSA
Adaptive Authentication, RSA Security Analytics, RSA Archer eGRC, and VMware vCloud
Networking and Security.
RSA SecurID two-factor authentication provides an added layer of security to ensure that
only valid users have access to systems and data. RSA SecurID is based on something a user
knows (a password or PIN) and something a user has (an authenticator device). It provides a
much more reliable level of user authentication than reusable passwords. It generates a
new, one-time token code every 60 seconds, making it difficult for anyone other than the
genuine user to input the correct token code at any given time. To access their resources,
users combine their secret Personal Identification Number (PIN) with the token code that
appears on their SecurID authenticator display at that given time. The result is a unique,
one-time password used to assure a user’s identity.
RSA Security Analytics helps security analysts detect and investigate threats often missed by
other security tools. Security Analytics provides converged network security monitoring and
centralized security information and event management (SIEM). Security Analytics combines
big data security collection, management, and analytics; full network and log-based
visibility; and automated threat intelligence – enabling security analysts to better detect,
investigate, and understand threats they often could not easily see or understand before. It
provides a single platform for capturing and analyzing large amounts of network, log, and
other data. Also, it accelerates security investigations by enabling analysts to pivot through
terabytes of metadata, log data, and recreated network sessions. It archives and analyzes
long-term security data through a distributed computing architecture and provides built-in
compliance reports covering a multitude of regulatory regimes.
(Contd.)
RSA Adaptive Authentication is a comprehensive authentication and fraud detection
platform. Adaptive Authentication is designed to measure the risk associated with a user’s
login and post-login activities by evaluating a variety of risk indicators. Using a risk and rules
based approach, the system then requires additional identity assurance, such as out-of-band
authentication, for scenarios that are high risk and violate a policy. This methodology
provides transparent authentication for organizations that want to protect users accessing
web sites and online portals, mobile applications and browsers, Automated Teller Machines
(ATMs), Secure Sockets Layer (SSL) virtual private network (VPN) applications, web access
management (WAM) applications, and application delivery solutions.
vCloud Networking and Security virtualizes networking and security to enable greater agility,
efficiency and extensibility in the data center. vCloud Networking and Security delivers
software-defined networks and security with a broad range of services, which includes a
virtual firewall, virtual private network, load balancing and VXLAN extended networks.
• Virtual firewall – Stateful inspection firewall that can be applied either at the perimeter of
the virtual data center or at the virtual network interface card (vNIC) level directly in front
of specific workloads. The firewall-rule table is designed for ease of use and automation
with VMware vCenter objects for simple and reliable policy creation. Stateful failover
enables high availability for business-critical applications.
(Contd.)
This module covered key security terminologies, key security threats in the cloud, security
mechanisms for the cloud, and, governance, risk, and compliance.
This module focuses on service portfolio management and service operation management
processes.
Copyright © 2014 EMC Corporation. All rights reserved Module 9: Service Management 516
This module describes the service management, highlighted in the figure, that supports
service portfolio planning and administrative operations across all the layers of the cloud
infrastructure with a goal to meet service requirements.
This lesson covers an overview of cloud service management and its key functions.
The U.S. National Institute of Standards and Technology (NIST) in its Special Publication (SP)
500-291, Version 2 describes cloud service management as: “cloud service management
includes all of the service-related functions that are necessary for the management and
operation of those services required by or proposed to cloud consumers.” These functions
align the creation and delivery of cloud services to the provider’s business objectives and to
the expectations of consumers. They are performed by the administrators of provider’s
organization.
Cloud service management has a service-based focus, meaning that the management
functions are linked to the service requirements and service level agreement (SLA). For
example, when a service provider needs to make a change in a service, the service
management ensures that the cloud infrastructure resources are configured, modified or
provisioned to support the change; likewise, it ensures that cloud infrastructure resources
and management processes are appropriate so that services can maintain the required
service level. Additionally, cloud service management ensures that the cloud services are
delivered and operated optimally by using as few resources as needed.
Traditional IT management: Traditionally, IT management is infrastructure element or asset
(such as compute, storage, and business application) specific. The management tools
provided by IT asset vendors only enable monitoring and management of specific asset(s). A
large environment composed of many multi-vendor IT assets residing in world-wide
locations raises management complexity and asset interoperability issues. Further,
traditional management processes and tools may not support a service oriented
infrastructure, especially if the requirement is to meet on-demand service provisioning,
rapid elasticity, workflow orchestration, and sustained service levels. Even if traditional
management processes enable service management, they are usually implemented
separately from service planning and design, which may lead to gaps in monitoring and
management.
Cloud service management: Management in the cloud has a service-based focus
commensurate with the requirements of each cloud service rather than an asset-based
focus. It is based on a holistic approach that must be able to span across all the IT assets in
a cloud infrastructure. Depending on the size of a cloud environment, service management
may encompass a massive IT infrastructure comprising multi-vendor assets, a variety of
technologies, and multiple data centers. Compared to traditional IT management, cloud
service management must be optimized to handle increased flexibility, increased
complexity, increased data access, increased change rates, and increased risk exposure that
may lead to service outages and SLA violations. In order to operate sustainably in a cloud
environment, service management must rely on automation and workflow orchestration.
Cloud service management may still follow the traditional IT service management processes
such as ITIL; however, the processes should support cloud rapid deployment, orchestration,
elasticity, and service mobility.
Cloud service management performs two key functions:
• Service portfolio management: It defines the suite of service offerings, called service
portfolio, aligning it to the provider’s strategic business goals.
• Service operation management: It maintains cloud infrastructure and deployed services,
ensuring that services and service levels are delivered as committed.
Often service portfolio management and service operation management are performed
jointly, providing the capability to create viable cloud services that meet consumer
expectations and market demand, and compete appropriately on service price, quality, and
time-to-provision.
These functions are detailed in subsequent slides.
This lesson covered an overview of cloud service management and a comparison between
traditional IT management and cloud service management. It also covered an overview of
service management functions – service portfolio management and service operation
management.
This lesson covers service portfolio management activities and key processes to support
portfolio management.
Service portfolio management makes decision to deliver those services that provides value
and strategic advantage to the provider. It provides guidelines on how these services will be
designed, implemented, supported, and priced (where appropriate). It also makes
investment decision on services across services’ entire lifecycle and ensures that services
are delivered in the most cost-effective manner and as quickly as possible.
A chart showing service portfolio management activities are provided in the slide.
A list of key processes that support portfolio management activities are provided in the
slides. These processes are described next.
Note: Though the aim is not to promote any particular service management methodology as
ideal for cloud, this module follows ITIL service management practices that are widely known
across IT organizations.
The goal of service catalog management is to ensure that a service catalog is created and
maintained with accurate information on all of the available services. As described before in
module 6, the service catalog, available through the cloud portal, is the menu of services
that lists services, attributes of services and associated prices. Key functions of service
catalog management are described below:
• Service catalog management team is responsible for the design and implementation of
the service catalog. It updates the service catalog to incorporate new service offerings or
changes to the existing service offerings. Changes to the service offerings are
communicated to service catalog management through an orchestrated workflow that
first routes change decisions through a change management process. Following
affirmative change decisions, the service catalog management updates the service
catalog to include the new services and/or changes to existing service offerings. Change
management is described later in this module.
• Service catalog management team ensures that the information in the service catalog is
up-to-date. It emphasizes clarity, completeness, and usefulness when describing service
offerings in the service catalog, ensuring that the features, intended use, comparisons,
prices, order process, and levels of services are unambiguous and valuable to consumers.
The service catalog design and implementation process consists of a sequence of steps.
These are:
1. Create service definition: Creating a definition for each service offering is the first step
in designing and implementing the service catalog. A service definition comprises
service attributes such as service name, service description, features and options,
provisioning time, and price. The cloud portal software provides a standard user
interface to create service definitions. The interface commonly provides text boxes,
check boxes, radio-buttons, and drop-downs to make entries for the service attributes.
2. Define service request: After creating a service definition, the next step is to define the
Web form used to request the service. The portal software includes a form designer for
creating the service request form that consumers use to request the service.
3. Define fulfillment process: After defining the service request form, the next step is to
define the process that fulfills delivery of the service. Once the process is modeled,
approved, and validated, it is implemented using workflows in the orchestrator.
4. Publish service: The final step is to publish the service catalog to the consumers. Before
publishing, it is good practice to perform usability and performance testing. After the
service is published, it becomes available to consumers on the cloud portal.
Service catalog management team usually creates different views of or personalizes the
cloud portal to meet different requirements of the cloud administrators and the consumers.
These are:
• Service management-specific view: It is only visible to the cloud administrators and
contains details of all service assets and support services to provision a cloud service. It is
useful to track order status, and check resource bundling issues and orchestration
failures.
• Consumer-specific view: It comprises a description of services including the business
processes they support, consumer-facing value of the services, and service policies and
rules. It also provides information on rented service instances and utilized resources,
incident status, and billing reports. It enables consumers to request any changes to
services, decommission service instances, and use technical support services.
The goal of financial management is to manage the cloud service provider's budgeting,
accounting, and billing requirements. Key functions of financial management are described
below:
• The financial management team, in collaboration with the provider’s finance department,
plans for investments to provide cloud services and determines the IT budget for cloud
infrastructure and operations for the lifecycle of services. The financial management
team is responsible for providing any necessary business cases for investments, while the
finance department may help out with cost analysis, budget adjustment, and accounting.
The business case usually includes financial justification for a service-related initiative
including demand forecast of services, service stakeholders inputs, sources of initial and
long-term funding, and value proposition for the business. The business case provides
visibility into the financials and helps communicate the initiatives to the top executives.
• Financial management team is responsible for performing service valuation. Service
valuation determines the price a consumer is expected to pay for a service, which helps
recover the cost of providing the service, ensuring profitability, and meeting the
provider’s return on investment (ROI) and reinvestment goals. Financial management
team defines the billing policy based on the service-specific pricing strategy. It manages
the deployment of billing system or tool that enables administrators to define the billing
policy. Based on the policy, the billing system automatically collects billing data, stores
billing records in a database, and generates the billing report per consumer.
Investment decisions are usually factored by total cost of ownership (TCO) and return on
investment (ROI). These are described below:
• TCO estimates the full lifecycle cost of owning service assets. The cost includes capital
expenditure (CAPEX), such as procurement and deployment costs of hardware and on-
going operational expenditure (OPEX), such as power, cooling, facility, and administration
cost. CAPEX is typically associated with one-time or fixed costs. Recurring costs or
variable costs are typically OPEX that may vary over the life of a service.
• ROI is a measurement of the expected financial benefit of an investment. It is calculated
as the gain from an investment minus the cost of the investment, and the whole is
divided by the cost of the investment.
The formulae to calculate TCO and ROI are shown in the slide.
Financial management team performs a series of steps to determine the price for a service.
These steps are:
1. It aggregates all types of costs (both CAPEX and OPEX) down to service asset level of
granularity by mapping assets to relevant cloud services.
2. It calculates service cost on per-unit basis by dividing the aggregated cost for a service
by some logical unit of demand such as GB of storage or an hour of usage for that
service.
3. The per-unit service costs may vary over time, depending on demand for or utilization of
the services and service assets. Thus, financial management through specialized
management tools should track demand and utilization to establish a stable per-unit
cost baseline.
4. Finally, financial management team may add some margin amount over per-unit service
cost to define service price, or may establish the price at the true cost of service
depending on the provider’s business goal.
The goal of supplier management is to ensure that all contracts with the suppliers of cloud
products, technologies, and supporting services meet the business requirements of the
cloud service provider and that the suppliers adhere to contractual commitments. Cloud
service providers usually obtain IT products and services from multiple suppliers to build and
maintain cloud infrastructure and provide cloud services to their consumers. Examples of
suppliers to cloud service providers are hardware and software vendors, network and
telecom service providers, and public cloud service providers (IT department functions as
cloud service broker).
Key functions of supplier management are described below:
• The supplier management team gathers and evaluates information on different suppliers
that are enabling cloud technologies and making product offerings suitable for the
provider’s business and technical needs. The supplier management team creates a list of
preferred suppliers and builds criteria for the selection of suppliers and their products
and services. From the list of prospective suppliers, it selects the most suitable supplier.
(Contd.)
• A provider usually enters into a contract with each supplier, which legally underpins the
supply of products and supporting services. Supplier management team prepares terms
and conditions for the contracts and creates a framework for the procurement of
supplies. It negotiates and agrees on the terms in the contracts with the suppliers and
ensures that the value for money is obtained. During contract negotiations, emphasis is
placed on supply guarantees, price and payments, delivery timeline, responsibilities,
contract renewal and termination options, and dispute resolution. While negotiating the
termination option, the supplier management team should assess the terms and
conditions under which the relationships with suppliers will end, addressing potential
vendor lock-in issues. Further, it ensures that the procurement timeline aligns with the
demand for resources in the cloud infrastructure and relevant SLAs.
• Supplier management team periodically evaluates the quality and cost of the products
and services offered by the suppliers against changing business requirements of service
provider and SLAs. Based on this, it reassesses the existing suppliers and contracts and
plans for contract renewal, extension, and termination.
• Supplier management team manages the relationships with the suppliers and
communicates about the provider’s business strategy, risks associated with a delivery,
required changes in the contract, and improvements needed in products and services.
A list of common criteria for selecting suppliers is provided in the slide.
This use case explores the performance of existing suppliers of a cloud service provider
organization. The supplier management team reassesses suppliers performance and
accordingly plans for contract renewal, extension, and termination.
Currently the provider organization has contract with four suppliers (Supplier A, Supplier B,
Supplier C, and Supplier D). The percentage of IT spending by suppliers are shown in the
slide. A report documenting contract renewal date, number of contract, compliance to
contract terms, number of disputes with suppliers, total cost by suppliers, and overall
satisfaction rating is created for each supplier. The report provides visibility to each
supplier’s performance against the contracts. It helps the supplier management team to
determine what is going wrong and where action is needed.
The report reveals that supplier B is the worst performer and the contract with supplier B is
unlikely to be renewed by the supplier management team. Supplier management team may
also decide on early termination of the contract with supplier B depending on the terms in
the contract. The report also discloses that the performance of supplier D is not satisfactory.
The supplier management team should communicate to supplier D on improvements
needed, try to resolve all disputes, and periodically evaluate the supplies. It may propose a
new contract, extend the existing contract for a short period, or decide on not to renew the
contract depending on improvement in the supplier’s performance over time.
This lesson covered key processes that support service portfolio management – service
catalog management, financial management, and supplier management.
This lesson covers service operation management activities and monitoring process.
Service operation management involves on-going management activities to maintain cloud
infrastructure and deployed services. All of these activities have the goal of ensuring that
services and service levels are delivered as committed. A chart showing service operation
management activities are provided in the slide.
Ideally, service operation management should be automated. Manual management
operations are subject to errors, difficult to audit, and require considerable time and effort,
making them ill-suited to the dynamic, abstracted nature of cloud environments. Manual
operations increase administration cost and consequently increase the cost of providing
cloud services. They also raise the risk of deviating from increasingly stringent compliance
requirements and service quality. To enable zero-touch service operation management,
organizations typically deploy cloud service management tools. These tools automate many
service operation management activities and their functions are programmatically
integrated through orchestrated workflows.
A list of key processes that support service operation management activities are provided in
the slides. Although these processes are distinct, however, they often interact among
themselves through orchestrated workflows to produce valued services and to respond
quickly to the needs of a provider and consumers. These processes are described next.
Managing a cloud requires visibility and control. Monitoring provides visibility, and forms the
basis for administering cloud infrastructure and deployed services. Some key benefits of
monitoring are described below:
• Monitoring provides information on the availability and performance of various services
and the infrastructure components or assets on which they are built. Cloud
administrators uses this information to track the health of infrastructure components and
services.
• Monitoring helps to analyze the utilization and consumption of resources by service
instances. This analysis facilitates capacity planning, forecasting, and optimal use of these
resources.
• Monitoring events in the cloud environment, such as a change in the performance or
availability state of a component or a service, may be used to trigger automated routines
or recovery procedures. Such procedures can reduce downtime due to known
infrastructure errors and reduce the level of manual intervention needed to recover from
them.
• Monitoring is the foundation of metering, reporting, and alerting. Monitoring helps in
metering resource consumption by service instances. It helps in generating report for
billing and trends. It also helps trigger alerts when thresholds are reached, security
policies are violated, and service performance deviates from SLA. Alerting and reporting
are detailed later in this module.
• In addition to its use in management, monitored information may be made available to
consumers and presented as metrics of the cloud services.
Monitoring is usually automated using specialized monitoring tools. These tools should have
below capabilities:
• These tools should provide end-to-end visibility into the cloud infrastructure and
deployed cloud services throughout their lifecycle.
• They must be capable of collecting relevant data in a rapidly changing and varying
workload environment. This includes tracking the movement of data and services within
and between clouds (e.g., private to private, private to public, or multi-hop).
• Depending on the size of the cloud infrastructure and the number of services involved,
the monitoring tools may have to collect data about hundreds or thousands of
components.
• These tools should map the relationships between services and the associated physical
components, regardless of dynamic creation, termination, scaling, load balancing, and
migration of services in the cloud. This mapping can improve the mean time to restore
service (MTRS) by quickly and automatically isolating the root cause when an event that
causes numerous systems or components to generate alerts.
• The data collection frequency should be consistent with the dynamic workload nature of
the cloud services, helping service management processes to obtain recent and accurate
information about services.
• Monitoring tools should have a minimal runtime footprint so as not to impact service
performance.
This lesson covered service operation management activities, a list of key processes to
support operation management, and monitoring.
This lesson covers monitoring parameters, alerting, and reporting.
Cloud services are primarily monitored for configuration, availability, capacity, performance,
and security. These are described next.
Configuration monitoring involves tracking configuration and deployment of cloud service
instances and associated infrastructure components. It detects configuration errors, non-
compliance with configuration policies, and unauthorized configuration changes.
This use case illustrates the importance of monitoring the configuration of services and their
compliance to configuration policies.
Figure on the slide illustrates a cloud environment that provides services to four consumer
organizations. For each consumer organization one or more VMs are deployed to provide
service. These VMs are provisioned by allocating resources from a compute pool. The
service provider needs to ensure that the configuration of these VMs are compliant to a
configuration policy related to system security. The configuration policy includes 27
conditions. VMs are compliant to the configuration policy if all conditions in the policy are
compliant to the existing configuration of VMs.
The monitoring tool verifies VM configuration against the configuration policy and provides
the compliance results. The result shows that 22 percent of the evaluated conditions in the
policy are compliant or passed against existing configuration of VMs and 78 percent are not
compliant or failed. This result helps cloud administrators to assess compliance to
configuration policy and to take necessary action to ensure compliance.
Availability monitoring involves tracking availability of services and infrastructure
components during their intended operational time. It helps to identify a failure of any
component that may lead to service unavailability or degraded performance. Availability
monitoring also helps an administrator to proactively identify failing services and plan
corrective action to maintain SLA requirements.
This use case illustrates the importance of monitoring the availability of the cloud
infrastructure components.
Consider an implementation in a cloud infrastructure with two compute systems (compute
system A and compute system B) and a storage system interconnected through a VLAN. Both
the compute systems are running hypervisors and are member of the cluster formed by the
hypervisors. During normal condition, a total of six VMs are hosted on the hypervisor cluster
– three on each compute system. These VMs are components of a cloud service. Dynamic
failover of VMs is enabled for the hypervisor cluster.
If compute system A fails, VMs running on it automatically failover to compute system B and
consumers continue to access services. However, due to absence of redundant compute
system, failure of compute system B could result in unavailability of services. Further,
because all VMs are hosted on a single compute system, the utilization of compute system B
reaches 80 percent, which may degrade service performance.
A monitoring tool detects the compute system failure and notifies cloud administrators to
take corrective action before another failure occurs. In most cases, the monitoring tool
provides symptom notifications for a failing component to the administrators so as to
initiate actions before the component fails.
Capacity monitoring involves tracking the amount of cloud infrastructure resources used
and available. It ensures availability of sufficient resources and prevents service
unavailability or performance degradation. It examines the used and unused capacity
available in a resource pool, the utilization of capacity by service instances, the capacity limit
for each service, dynamic allocation of storage capacity to thin LUNs, over-commitment of
memory and processor cycles, and so on. Capacity monitoring ensures uninterrupted service
availability and scalability by averting outages before they occur. Further, capacity
monitoring helps to perform capacity trend analysis. This analysis helps in understanding
future resource requirements and planning for procuring resources on time.
This use case illustrates the importance of monitoring the storage pool space (used and
available) in a cloud infrastructure.
Figure on the slide illustrates a storage pool created from a collection of physical disk drives.
Storage volumes are provisioned from the storage pool to each consumer of a storage
service. When the pool is full and no storage space is available to the consumers, it results in
service outage.
Monitoring tools can be configured to issue a notification when thresholds are reached on
the storage pool capacity. For example, when the storage pool reaches 66 percent of its
capacity, a notification is issued; likewise, another notification is issued when the pool
reaches 80 percent of its capacity. This enables the administrators to take action to extend
the storage pool by adding more disk drive to it before it runs out of capacity. This way
proactively monitoring can prevent service outage caused due to lack of space in a storage
pool.
Performance monitoring evaluates how efficiently different services and infrastructure
components are performing. It examines performance metrics such as response time,
throughput, I/O wait time, and processor utilization of services and infrastructure
components. This helps to identify performance bottlenecks. Performance monitoring also
helps in analyzing the performance trends of infrastructure components and finding
potential performance degradation or failure of a service.
Synthetic monitoring can provide an organization the ability to perform deep transactional
monitoring from a consumer’s perspective, looking beyond the availability and performance
of the constituent elements such as processor, memory, and storage of a cloud service. For
example, a monitoring agent external to the cloud might simulate or perform a set of
transactions against a cloud service, such as searching an online store for an item, adding it
to a shopping cart and then checking out, to ensure that the service is functioning within
desired operating parameters. This can provide end-to-end visibility into the availability and
performance of the cloud service and provide early insight into any issues, enabling
remediation before an issue impacts the service itself.
This use case illustrates the importance of monitoring performance of a service.
In this context, a service comprises below elements:
• 2 Web servers
• 1 application server
• 1 database server
• 2 load balancers distributing clients connections across Web servers
• 1 IP router enabling network address translation (NAT)
• 1 firewall filtering client traffic from internet to the Web servers
• VLAN 10, 20, 30, 40 interconnecting above elements as shown in the figure
A synthetic monitoring reveals that the throughput of the service has drastically dropped
since 8 AM. The monitoring tool deployed at provider’s premises shows that at about same
time the utilization of processor at the database server has started raising and is nearing 100
percent. However, the utilization of other elements of the cloud service is normal. This
clearly indicates that the database server is the performance bottleneck. The monitored
information enables the cloud administrators to take action to increase processing power of
the VM providing database service before consumers report performance issues.
Security monitoring tracks unauthorized access and configuration changes to the cloud
infrastructure and services. It detects all service-related operations and data movement that
deviate from the security policies implemented and included in the service contract by the
provider. Security monitoring also detects unavailability of services to authorized consumers
due to security breaches. Physical security within the cloud service provider’s premises
should also be continuously monitored using appropriate technologies such as badge
readers, biometric scans, and video cameras.
This use case illustrates the importance of monitoring security in a cloud environment.
Figure shows a community cloud infrastructure at a provider’s premises. The community
cloud provides service to the provider’s organization as well as a partner site. All network
traffic between the cloud infrastructure and the partner site is transferred over the internet.
This environment poses risk of malicious or unintended transfer of provider’s sensitive data
such as user account information, credit card data, key employee data, and organization’s
intellectual property to the partner site.
A network monitoring and analysis system deployed at the provider’s infrastructure can
mitigate the risk of sensitive data leakage. It monitors and analyzes the content of all
network traffic transferred through a network link and blocks traffic that is not compliant to
applicable security policy. It also shows security events (shown in the table) by listing date of
violation, severity of event, name of the user who violated the policy, violated policy name,
and any action taken to prevent the data leaking. If the security events are not monitored or
recorded, it is difficult to track such violations of security policy. Conversely, if these events
are monitored, notifications can be sent to prompt appropriate administrative action, to
educate users on policy violation, or at least to enable discovery as part of regular auditing
operations.
An alert is a system-to-provider (and sometimes consumer) notification that provides
information about an event or impending threats or issues. Alerting is based on monitoring
and is triggered when a specific situation or condition is reached. These conditions may be
defined by cloud administrators through monitoring tools that generate alerts based on the
identified conditions. Alerting keeps administrators and consumers informed about the
status of various services, infrastructure components, and service-related processes. It also
helps administrators respond to service-related issues quickly and proactively. Consider an
example: a condition such as degraded performance of a data center or site (also called a
service availability zone) providing a service in a multi-zone cloud environment may trigger
an alert message, indicating a need for administrative intervention. Other examples, such as
a resource pool having reached a capacity threshold, a service operation having breached a
security policy, or a consumer having violated an SLA term may also be indicated via alerts.
Alerts may be classified as informational, warning, or fatal based on their severity level.
Informational alerts provide useful information but may not require any intervention by the
administrators or consumers. Warning alerts require attention so that the alerted condition
is contained and does not affect service availability and performance in the future. Fatal
alerts require immediate attention because the condition might affect the performance and
availability of a cloud service. The table in the slide describes examples of the different types
of alerts.
(Contd.)
Reports are formulated from the configuration, availability, capacity, performance, and
security data gathered as a result of monitoring. They may provide current status and
historical trends about configuration, connectivity, capacity, performance, resource usage,
service cost, or billing. They are usually visible through cloud portal and management
software interface.
Reports are commonly displayed like a digital dashboard, which provide real time tabular or
graphical views of monitored information. Dashboard reporting helps cloud service
providers to make instantaneous and informed decisions on resource procurement, plans
for modifications in the existing infrastructure, and improvements in service management
processes. In other words, dashboard reporting is a critical component of both portfolio and
operation management. Reports may also be made available to consumers, enabling them
to view service usage and bills. The table in the slide describes some common report types.
This lesson covered configuration monitoring, availability monitoring, capacity monitoring,
performance monitoring, security monitoring, alerting, and reporting.
This lesson covers service asset and configuration management, change management,
capacity management, and performance management.
The goal of service asset and configuration management is to maintain information on
configuration items (CIs) and their relationships. CIs are components such as services,
process documents, infrastructure components including hardware and software, people,
and SLAs that need to be managed in order to deliver services. Key functions of service asset
and configuration management are described below:
• Service asset and configuration management team maintains information about CI
attributes such as the CI’s name, manufacturer name, serial number, license status,
version, description of modification, location, and inventory status (e.g., on order,
available, allocated, or retired). Along with these attributes, it keeps information on used
and available capacities of CIs and any issues linked to the CIs. Further, it maps and
maintains information on the inter-relationships among CIs. Such relationships in a cloud
infrastructure commonly include service-to-consumer, VM-to-service, compute resource
pool-to-VM, physical compute system-to-compute resource pool, physical compute
system-to-network switch, and data center-to geographic location. This mapping ensures
that CIs are viewed as integrated components. A consolidated view of CIs helps in the
process of identifying the root cause of issues in the cloud infrastructure and services and
in assessing the impact of any change in relationships. For example, when a switch fails,
administrators of the provider’s organization get an automated alert from the service
management tools about other CIs that are affected by that outage. Similarly, when
administrators decide to increase the capacity of a storage pool, they can easily identify
the CIs affected by the change.
(Contd.)
• Service asset and configuration management team is responsible for maintaining all
information about CIs in a single database or in multiple autonomous databases mapped
into a federated database called a configuration management system (CMS). The
federated database bridges information from multiple autonomous databases to provide
a single view about the CI attributes and their relationships across even very large
infrastructures. Other service management processes interact with the CMS to view the
relationships between CIs and to track their configurations. This interaction, for example,
helps another process to examine the deployment of a security patch, resolve an issue in
the cloud infrastructure, or to determine required changes in a service. The CMS is
usually populated by discovery tools that automatically gather information on CIs.
Discovery tools also update the CMS when new CIs are deployed or when attributes of
CIs change. Further, these tools periodically check the veracity and currency of
information about CIs to ensure that the information maintained in the CMS is an exact
representation of the CIs used to provide cloud services.
Note: Some readers may be more familiar with the term configuration management
database (CMDB) rather than CMS regarding configuration management. CMDB is a single
repository to store configuration records as opposed to a federated database that is more
common in the cloud environment.
The goal of change management is to standardize change-related procedures in a cloud
infrastructure for prompt handling of all changes with minimal impact on service quality. A
good change management process enables the cloud service provider to respond to
changing business requirements in an agile way and greatly minimizes risk to the
organization and its services.
Relevant changes could range from the introduction of a new service offering, to
modification of an existing service’s attributes, to retirement of a service; from a hardware
configuration change, to expansion of a resource pool, to a software upgrade, and even to a
change in process or procedural documentation. Change management oversees all changes
to CIs to minimize adverse impact of those changes to the business and the consumers of
cloud services.
Drivers and causes of changes in cloud infrastructure and services can be as varied as the
range of changes themselves. There may be a need to improve service, to mitigate the
impact of specific problems, to meet compliance or regulatory requirements, to address a
modification in an SLA, or to accommodate revised business objectives.
(Contd.)
The goal of capacity management is to ensure that a cloud infrastructure is able to meet the
required capacity demands for cloud services in a cost effective and timely manner. Capacity
management helps ensure that peak demands from consumers can be met without
compromising SLAs and at the same time optimizes capacity utilization by minimizing spare
and stranded capacity. Key functions of capacity management are described below:
• Capacity management team determines the optimal amount of resources required to
meet the needs of a service regardless of dynamic resource consumption and seasonal
spikes in resource demand. With too few resources, consumers may have to wait for
resources to free up or their requests may be rejected until additional resources are
available in the cloud. With too many resources, the service cost may rise unnecessarily
due to maintenance of many unused, spare resources. Effective capacity planning
maximizes the utilization of available capacity without impacting service levels. Slide
provides a list of common methods to maximize capacity utilization.
(Contd.)
• Capacity management team plans both current and future infrastructure capacity
requirements. Capacity management tools are usually capable of gathering historical
information on resource usage over a specified period of time, establishing trends on
capacity consumption, and performing predictive analysis of future demand. This analysis
serves as input to the capacity planning activities and enables the procurement and
provisioning of additional capacity in the most cost effective and least disruptive manner.
This use case illustrates the expansion of a storage pool to respond quickly to the needs of a
provider and consumers using an orchestrated workflow.
An administrator of capacity management team requests for additional storage to a storage
pool (see figure on the slide) through the cloud portal. The request is transferred to the
orchestrator that triggers a change approval and execution workflow. The orchestrator
determines whether the request for change needs to be reviewed by change management
team (CAB). If the request is preapproved, it is exempted from change management review.
If not, the orchestrated workflow ensures that the change management team reviews and
approve/reject the request.
If the request is approved by the change management team, a resource provisioning request
is sent to the unified manager. The unified manager interacts with element manager to
provision required storage to the storage pool. The orchestrated workflow also initiates the
discovery of additional storage in the pool. Upon discovery, the CMS is updated with
information about modified pool capacity. The orchestrator responds by sending updates to
the cloud portal appropriately following completion or rejection of the request for
expanding the storage pool.
The goals of performance management are to monitor, measure, analyze, and improve the
performance of cloud infrastructure and services. Key functions of performance
management are described below:
• Performance management team uses specialized management tools that monitor and
measure performance metrics of the cloud infrastructure components and service
instances, such as response time and data transfer rate. These tools analyze the
performance statistics and detect components and service instances that are performing
below expected levels. They proactively alert administrators about potential performance
issues and may prescribe a course of action to improve a situation.
• Administrators of performance management team implement changes before consumers
are impacted by performance issues in the cloud environment. Some of the examples of
changes that may be implemented in order to maintain needed performance levels are
listed in the slide.
(Contd.)
• Overutilization of resource pools is a key reason for performance degradation.
Performance management team determines the required capacity of the resource pools
to meet expected performance levels. It works together with capacity management team
to implement changes related to capacity and performance.
• Performance management team ensures that all infrastructure components are meeting
or exceeding required performance levels. To maintain the agreed performance level,
administrators may define strategies for policy-based placement of VMs across clustered
hypervisors, quality of service (QoS) networking, automated storage tiering, and cache
tiering, as several examples.
This lesson covered service asset and configuration management, change management,
capacity management, and performance management.
This lesson covers incident management, problem management, availability management,
and information security management.
The goal of incident management is to return cloud services to consumers as quickly as
possible when unplanned events, called incidents, cause an interruption to services or
degrade service quality. An incident may not always cause service failure; for example, the
failure of a disk from a mirrored set of RAID protected storage does not cause data
unavailability. However, if not addressed, recurring incidents may cause service interruption
in the future. Key functions of incident management are described below:
• The incident management process detects and records all incidents in the cloud
infrastructure. Cloud service providers usually deploy incident management tools that
enable cloud administrators to manage and track the incidents from their initiation to
closure. These tools may proactively recognize specific event(s) in the cloud infrastructure
as incidents and automatically log them for resolution. They automatically respond to,
resolve, and escalate incidents. Incidents may also be registered by the consumers
through the cloud portal or by email. Additionally, incidents may be reported by
consumers through a service desk. The service desk may consist of a call center to handle
a large volume of telephone calls and a help desk as the first line of service support.
Upon initial report of an incident to the service desk, it attempts to resolve the incident
by consulting the known error database and the CMS. Known errors are common issues
or incidents experienced by consumers for which solutions or workarounds have already
been defined. The known error database helps the service desk to execute the necessary
technical actions to more quickly restore a failed or degraded service. If the service desk
is unsuccessful in providing solutions against the incidents, they are escalated to other
incident management support groups or to problem management (described next).
(Contd.)
• The incident management team investigates the incidents escalated by the incident
management tool and the service desk. It identifies appropriate solutions to resolve the
incidents. The incident management team usually consists of multiple support groups,
such as first and second levels. Incident tracking, assignment to appropriate personnel,
status update, and escalation across these groups are automated through incident
management tool. First level support specialists are capable of addressing most of the
common incidents. If they are unable to do so, they quickly escalate the incidents to the
second level technical experts and the process continues until the incident is resolved.
The exact number of support groups depends on provider’s service management
structure, but irrespective of the number, incident management should provide solutions
to bring back cloud services within an agreed timeframe specified in the SLA.
• If incident management team is unable to determine and correct the root cause of an
incident, error-correction activity is transferred to problem management. In this case, it
provides a temporary solution (workaround) to the incident; for example, migration of a
cloud service to a different resource pool in the same data center or in a different data
center. During the incident resolution process, the consumer is kept apprised of the
incident status.
(Contd.)
This use case illustrates the resolution of an incident detected by an incident management
tool deployed in the cloud infrastructure. The tool recognizes an event related to a cloud
service as an incident and automatically logs it for resolution. If an orchestrated workflow
exists to resolve the incident, the incident management tool triggers the workflow through
interaction with orchestrator. The workflow enables diagnostics and remedial procedure to
address the incident. If no such workflow exists to resolve this incident, the incident
management tool opens an incident ticket and assigns the incident resolution activity to the
incident management team.
Administrators of incident management team investigate the reason(s) that cause the
incident and resolves the incident by providing appropriate solution. Further, if the incident
is chronic or is a recurring issue, the incident management team turns it over to problem
management for further analysis. Otherwise, the incident management team closes the
incident.
The goal of problem management is to prevent incidents that share common symptoms or –
more importantly – root causes from reoccurring, and to minimize the adverse impact of
incidents that cannot be prevented. A problem is recognized when multiple incidents exhibit
one or more common symptoms. Problems may also be identified from a single significant
incident that is indicative of a single error for which the cause is unknown, but the impact is
high. Key functions of problem management are described below:
• The problem management process detects problems and ensures that the underlying
root cause that creates a problem is identified. Incident and problem management,
although separate service management processes, require automated interaction
between them and use integrated incident and problem management tools. These tools
may help an administrator to track and mark specific incident(s) as a problem and
transfer the matter to problem management for further investigation. Alternatively, these
tools may automatically identify incidents that are most likely to require root cause
analysis. Further, these tools may have analytical ability to perform root cause analysis
based on various alerts. They search alerts that are indicative of problems and correlate
these alerts to find the root cause. This helps to resolve problems more quickly.
(Contd.)
• Problem management team minimizes the adverse impact of incidents and problems
caused by errors in the cloud infrastructure and initiates actions to prevent recurrence of
incidents related to those errors. Problem handling activities may occur both reactively
and proactively. These are described below:
• Reactive problem management: It involves a review of all incidents and their
history for problem identification. It prioritizes problems based on their impact to
business and consumers. It identifies and investigates the root cause that creates
a problem and initiates the most appropriate solution and/or preventive
remediation for the problem. It also reviews the known error database to find out
if the problem has occurred in the past and if a resolution is already in place. If
complete resolution is not available, problem management provides solutions to
reduce or eliminate the impact of a problem.
• Proactive problem management: It helps prevent problems. By proactively
analyzing errors and alerts in the cloud infrastructure and identifying impending
service failures or quality degradation, problem management team identifies and
solves errors before a problem occurs. By preventing problems, instead of just
reacting to them, the problem management function reduces the number of
incidents and potential SLA violations.
• Problem management is responsible for creating the known error database. After
problem resolution, the issue is analyzed and a determination is made whether to add it
to the known error database. Inclusion of resolved problems to the known error database
provides an opportunity to learn and better handle future incidents and problems.
This use case illustrates reactive problem management. It resolves a problem recorded by
an integrated incident and problem management tool deployed in the cloud infrastructure.
The incident and problem management tool automatically identify incidents that are most
likely to require root cause analysis. It performs root cause analysis by correlating various
alerts across all components of the cloud infrastructure. This helps a cloud administrator to
quickly identify the root cause of incidents, symptoms of the issue, and the impacted
components and services. The incident and problem management tool also logs a problem
for administrative action.
Administrators of the problem management team can view the problem details including
root cause recorded by the incident and problem management tool. They determine
remedial steps to correct the root cause. If the root cause can be prevented by a change in
the cloud infrastructure, they create a request for change for change management approval.
Upon obtaining the approval, they ensure that the change is implemented. Once an
appropriate solution is implemented to remediate the root cause, the problem management
team closes the problem.
The goal of availability management is to ensure that stated availability commitments are
consistently met. The availability management process optimizes the capability of cloud
infrastructure, services, and the service management team to deliver to consumers a cost
effective and sustained level of service that meets SLA requirements. Key functions of
availability management are described below:
• The availability management team gathers information on the availability requirements
for upgraded and new services. Different types of cloud services may be subjected to
different availability commitments and recovery objectives. A provider may also decide to
offer different availability levels for same type of services, creating tiered services.
• The availability management team proactively monitors whether availability of existing
cloud services and infrastructure components is maintained within acceptable and
agreed levels. The monitoring tools identify differences between the committed
availability and the achieved availability of services and notify administrators through
alerts.
• Availability management team interacts with incident and problem management teams,
assisting them in resolving availability-related incidents and problems. Through this
interaction, incident and problem management teams provide key input to the availability
management team regarding the causes of service failures. Incident and problem
management also provide information about errors or faults in the infrastructure
components that may cause future service unavailability. With this information, the
availability management team can quickly identify new availability requirements and
areas where availability must be improved.
(Contd.)
• The availability management team analyzes, plans, designs, and manages the procedures
and technical features required to meet or exceed both current and future availability
needs of services at a justifiable cost. Based on the SLA requirements of enhanced and
new services, and areas found for improvement, the administrators of availability
management team may suggest changes in the existing business continuity (BC) solutions
or architect new solutions that provide additional tolerance and resilience against service
failures. The availability management team also ensures that the deployed BC solutions
are tested regularly. Slide provides examples of BC solutions that could be proposed by
the availability management team.
While the primary goal of availability management is to ensure that service availability
meets SLA commitments, it provides additional benefits, such as reducing the time
consumers spend with the service desk addressing outages, improving consumer
satisfaction and consequently lifting the cloud service provider’s reputation. With a
reduction in the frequency and duration of availability related incidents and problems, fewer
staff members are required to handle incident and problem management functions,
reducing the cloud service provider’s operational costs.
The goal of information security management is to prevent the occurrence of incidents or
activities adversely affecting the confidentiality, integrity and availability of information in
cloud services and all service management processes. Protecting corporate and consumer
data to the extent required to meet regulatory or compliance concerns (both internal and
external), and at reasonable/acceptable costs are additional goals of information security.
The interests of all stakeholders of a cloud service, including consumers who rely on
information and the IT infrastructure, are considered. Key functions of information security
management are described below:
• Information security management team implements the cloud service provider’s security
requirements. It develops information security policies that govern the provider’s
approach towards information security management. These policies may be specific to a
cloud service, an external service provider, an organizational unit, or they can be
uniformly applicable. Information security policies must be approved by top executive
management. These security assurances are often detailed in SLAs and contracts.
Information security management requires periodic reviews and, as necessary, revision of
these policies.
(Contd.)
Slide provides examples of information security policy developed by information security
management team.
(Contd.)
• Information security management team establishes a security management framework
aligned with the security policies. The framework specifies security architecture,
processes, mechanisms, tools, responsibilities (for both consumers and cloud
administrators), and standards needed to ensure information security in a cost-effective
manner. The security architecture describes the structure and behavior of security
processes, methods of integrating security mechanisms with the existing IT
infrastructure, service availability zones, locations to store data, and security
management roles. Information security management team evaluates the risk exposure
to data and cloud services and potential security breaches while creating the framework.
• Information security management team ensures that the security processes and
mechanisms articulated in the framework are properly implemented. It collaborates with
change management to obtain approval on the security-related changes. The
implementation may be performed by an internal group of implementation engineers
and supported by the suppliers of security products and services. However, the
information security management function is responsible for monitoring, managing, and
providing guidelines for such implementations.
The security management framework, if successfully implemented, should provide the
below capabilities:
• Safeguard data, services, and service management processes by automatically preventing
attacks on them; for example, scan applications and databases to identify vulnerabilities
and provide protection against any threats.
• Minimize the impact of an attack on a compromised system by promptly taking corrective
action such as isolating a compromised physical compute system from the production
environment or blocking a specific type of network traffic.
• Automatically and quickly detect security attacks, deviations from security policies,
anomalies in the pattern of operation and usage of a service, and violations of regulatory
compliance in real-time, and generate alerts to take preventative or corrective action.
• Control access to data and services at multiple levels (defense in depth, elaborated in
module 8) reducing the risk of a security breach if a protection mechanism at one level
gets compromised.
• Isolate and mask consumer’s data and activities from other consumers and unauthorized
personnel in the provider’s organization.
• Automatically log all activities on the data and services including data movement in and
out of the cloud and within the cloud. This facilitates proving compliance to security
policies during an audit.
This use case illustrates a security architecture aligned with a provider’s approach towards
information security management. It is created by the information security management
team of provider’s organization to conform to security policy. It secures a cloud service by
filtering incoming network traffic.
Given the context, a cloud service is divided into three tiers – Web, application, and
database. The Web tier consists of two VMs (Web servers) and all client traffic routed
through internet are load balanced (not shown in this architecture) across these VMs. The
application tier comprises a single VM running application software; likewise, the database
tier contains a single VM that provides database service. Communication between these
tiers enables the cloud service to function.
The security architecture ensures that virtual firewalls are deployed when a service is
provisioned. These firewalls examine network traffic traversing from/to VMs and only
permit traffic that passes the firewall rules. In this use case, network traffic from internet to
Web tier is processed by a firewall, shown in the figure, that allows only HTTP or HTTPS
traffic according to the firewall rule. The traffic from Web tier to application tier is processed
by another firewall that permits only deployed application-specific traffic. Similarly, the
network traffic from application tier to database tier is filtered by another firewall that
allows only Structured Query Language (SQL) traffic.
This lesson covered incident management, problem management, availability management,
and information security management.
The Concepts in Practice section covers four products that facilitate cloud service
management. These products are: EMC ProSphere, EMC Smarts, VMware vCenter
Operations Management Suite, and VMware IT Business Management Suite.
EMC ProSphere enables cloud administrators to view and manage storage resources from a
common console. With ProSphere, administrators can quickly discover cloud infrastructure
components across one or multiple virtualized data centers.
ProSphere enables storage capacity monitoring and analysis in a cloud environment. It offers
end-to-end capacity views across the cloud while improving administrator productivity by
offering an easily navigated dashboard. The dashboard displays storage resource allocation,
usage, and trends. This helps in optimizing utilization of resources and service performance,
as well as in meeting SLAs.
ProSphere proactively identifies performance issues that may impact service levels. It also
displays an end-to-end view of performance across compute, storage, and network
components, and their relationships and dependencies. Further, it can analyze historical
performance data and quickly identify performance trends. This allows administrators to
more quickly optimize and troubleshooting performance.
ProSphere continuously tracks all the changes in the cloud infrastructure and validates them
against organization specific policies defined in the software. This ensures consistent service
levels.
(Contd.)
The VMware vCenter Operations Management Suite includes a set of tools that automates
performance, capacity and configuration management and provides an integrated approach
to service management. It enables IT organizations to ensure service levels, optimum
resource usage, and configuration compliance in virtualized and cloud environments. The
vCenter Operations Management Suite includes four components. These components are
described below:
• vCenter Operations Manager provides operations dashboards to gain visibility into the
cloud infrastructure. It identifies potential performance bottlenecks automatically and
helps remediate them before consumers notice problems. Further, it enables optimizing
usage of capacity and performs capacity trend analysis.
• vCenter Configuration Manager automates configuration management tasks such as
configuration data collection, configuration change execution, configuration reporting,
change auditing, and compliance assessment. This automation enables organizations to
maintain configuration compliance and to enforce IT policies, regulatory requirements,
and security hardening guidelines.
• vCenter Hyperic monitors hardware resources, operating systems , middleware, and
applications. It provides immediate notification in the event of application performance
degradation or unavailability. This enables administrators to ensure availability and
reliability of business applications.
• vCenter Infrastructure Navigator automatically discovers application services running on
the VMs and maps their dependency on IT infrastructure components.
(Contd.)
This module covered key service management processes – service catalog management,
financial management, supplier management, monitoring, service asset and configuration
management, change management, capacity management, performance management,
incident management, problem management, availability management, and information
security management.

CISv2 Student Guide - Final PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CISv2 Student Guide - Final PDF

Uploaded by

Copyright:

Available Formats

Welcome to Cloud Infrastructure and Services version 2.

Copyright © 2014 EMC Corporation. All rights reserved [Course Title] 4

Copyright © 2014 EMC Corporation. All rights reserved [Course Title] 5

Copyright © 2014 EMC Corporation. All rights reserved [Course Title] 23

Copyright © 2014 EMC Corporation. All rights reserved [Course Title] 28

Copyright © 2014 EMC Corporation. All rights reserved [Course Title] 38

Copyright © 2014 EMC Corporation. All rights reserved (Contd.)

You might also like