You are on page 1of 87

CS-524 Distributed Computer Systems

M. Engg. (Computer Systems) Fall Semester 2009


Instructor: Shahab Tahzeeb (Assistant Professor)
Department of Computer & Information Systems Engineering

NED University of Engineering & Technology, Karachi

June 24, 2009

CS-524(NED) Lec 01
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Todays Agenda
Getting to know each other
Describing our roles to make this course a
real success
Overview of the Course

June 24, 2009

CS-524(NED) Lec 01

My Role
Continuously strive to expose you to the
subject knowledge in a manner that helps
save your time in getting hold of details

June 24, 2009

CS-524(NED) Lec 01

Your Role
Continuously strive to be regular in every aspect
schedule some time for review of lectures
before
coming to the class
Take sessional work seriously
Ask questions. There are NO stupid questions
Learning-centered approach
You learn as well as earn good grade

Grading-centered approach
You may get good grade but you never learn

June 24, 2009

CS-524(NED) Lec 01

Academic Calendar
9 weeks Teaching
22nd June, 2009 to 22nd August, 2009

5 weeks (Ramazan/Eid Break)


24th August, 2009 to 26th September, 2009

7 weeks Teaching
26th September, 2009 to 14th November, 2009

Final Examinations
1st December, 2009 to 15th December, 2009

Results Declaration
Last week of December, 2009

June 24, 2009

CS-524(NED) Lec 01

Books

Andrew S. Tanenbaum and Maarten van Steen


Distributed Systems: Principles and Paradigms
Prentice Hall

George Coulouris, Jean Dollimore and Tim Kindberg


Distributed Systems: Concepts and Design
Pearson Education

June 24, 2009

CS-524(NED) Lec 01

Topics

Introduction
Communication
Processes
Naming
Synchronization
Consistency and Replication
Fault Tolerance
Security
* We shall add topics to this list if time permits

June 24, 2009

CS-524(NED) Lec 01

Course Objectives

Describe fundamental concepts of and techniques in distributed


systems

Analyze distributed systems according to desired qualities (such as


performance, reliability, or availability)

Apply distributed systems techniques (such as Remote Procedure


Call, event-based communication, or transactions) to implement
distributed system designs

Compare and contrast concepts of and techniques in distributed


systems with respect to their ability to fulfill desired qualities

Design distributed systems according to desired qualities by


choosing among introduced concepts and techniques

June 24, 2009

CS-524(NED) Lec 01

Grading
Quizzes

05%

3 announced quizzes
weeks 3, 6 and 12

2 surprise quizzes
2 announced and 1 surprise quiz will be graded

Homework
05%
Class Participation
05%
Term Paper
05%
Mid-Term (09th Week)
10%
Final
70%
No early or makeup exams please!

June 24, 2009

CS-524(NED) Lec 01

10

Web Group for Course Management


http://groups.yahoo.com/group/cs524-09B

June 24, 2009

CS-524(NED) Lec 01

11

Distributed Computer Systems


Fundamentals

June 24, 2009

CS-524(NED) Lec 01

12

June 24, 2009

CS-524(NED) Lec 01

13

Whats a Distributed System?

June 24, 2009

CS-524(NED) Lec 01

14

Definition # 1
A collection of independent computers that
act as an integrated system and hence
appear to the end user as a single
computer (i.e. a virtual uniprocessor)
Two aspects
Hardware: autonomous machines
Software: users think theyre dealing with a
single system

June 24, 2009

CS-524(NED) Lec 01

15

Definition # 1
Users view of a Distributed System:
Multiple computers that work together in a more or
less seamless fashion (single system image)

To support heterogeneous computers and


networks and still present a single-system
image, systems may rely on middleware:
a software layer that provides a consistent interface to
the user, regardless of the underlying platform.

June 24, 2009

CS-524(NED) Lec 01

16

Definition # 1

A distributed system organized as middleware. The middleware layer runs on all


machines, and offers each application the same interface, provides a programming
abstraction as well as masking the heterogeneity of the underlying networks,
hardware, operating systems and programming languages
June 24, 2009

CS-524(NED) Lec 01

17

CORBA: A Middleware Example


CORBA is the OMG's open, vendor-independent
architecture and infrastructure that computer
applications use to work together over networks.
Using the standard protocol IIOP, a CORBA-based
program from any vendor, on almost any computer,
operating system, programming language, and
network, can interoperate with a CORBA-based
program from the same or another vendor, on almost
any other computer, operating system, programming
language, and network.
June 24, 2009

CS-524(NED) Lec 01

18

Other Middleware Examples


DCOM
Distributed Component Object Management

RPC
Remote Procedure Call

RMI
Remote Method Invocation

June 24, 2009

CS-524(NED) Lec 01

19

ONC RPC
Open
Network
Computing
Remote
Procedure Call, is a widely deployed
remote procedure call system.
ONC was originally developed by Sun
Microsystems as part of their Network File
System project, and is sometimes referred
to as Sun ONC or Sun RPC

June 24, 2009

CS-524(NED) Lec 01

20

Definition # 2
Enslow:
A distributed system is the one, wherein
hardware, control and data achieve some
degree of decentralization and resources
distribution is transparent to the user

June 24, 2009

CS-524(NED) Lec 01

21

Definition # 2

H1. A single CPU with one control unit.


H2. A single CPU with multiple ALUs. There is only one
control unit.
H3. Separate specialized functional units, such as one
CPU with one floating-point coprocessor.
H4. Multiprocessor with single I/O system and a global
memory.
H5. Multicomputer with multiple I/O systems and local
memories.

C1. Single fixed control point. Note that physically the


system may or may not have multiple CPUs.
C2. Single dynamic control point. In multiple CPU cases
the controller changes from time to time among CPUs.
C3. A fixed master/slave structure. For example, in a
system with one CPU and one coprocessor, the CPU is a
fixed master and the coprocessor is a fixed slave.
C4. A dynamic master/slave structure. The role of
master/slave is modifiable by software.
C5. Multiple homogeneous control points where copies of
the same controller are used.
C6. Multiple heterogeneous control points where different
controllers are used.

D1. Centralized databases with a single copy of both files


and directory.
D2. Distributed files with a single centralized directory
and no local directory.
D3. Replicated database with a copy of files and a
directory at each site.
D4. Partitioned database with a master that keeps a
complete duplicate copy of all files.
D5. Partitioned database with a master that keeps only a
complete directory.
D6. Partitioned database with no master file or directory.

Extension to Enslows Definition


June 24, 2009

CS-524(NED) Lec 01

22

Extension to Enslows Definition

Definition # 2

June 24, 2009

CS-524(NED) Lec 01

23

Definition # 3
An Intimidating Definition
A distributed system is one in which failure of
a computer you even didnt know existed can
render your own computer unusable
(Leslie Lamport)

June 24, 2009

CS-524(NED) Lec 01

24

Examples of Distributed Systems (1)

Internet
Mobile and Ubiquitous Computing
P2P Systems
Sensor Networks
Distributed Mobile Robots
Air Traffic Control (ATC) System
Banking, Stock Markets, Stock Brokerages
Heath Care, Hospital Automation
Control of Power Plants, Electric Grid
Telecommunications Infrastructure

June 24, 2009

CS-524(NED) Lec 01

25

Examples of Distributed Systems (2)


Electronic Commerce and Electronic Cash on the Web
(very important emerging area)
Corporate Information Base: a companys memory of
decisions, technologies and strategies
Military Command, Control, and Intelligence Systems
Embedded Systems: automotive control systems
Mercedes S-Klasse automobiles these days are equipped with
50+ autonomous embedded processors
Connected through proprietary bus-like LANs

June 24, 2009

CS-524(NED) Lec 01

26

Distributed System vs. Network


Theres no or little coordination among
networked machines
Users are aware of separate machines in
a network while a distributed system
operates in a seamless fashion.

June 24, 2009

CS-524(NED) Lec 01

27

Motivation (1)

Inherently Distributed Applications


Distributed systems have come into existence in some very natural
ways, e.g., in our society people are distributed and information should
also be distributed.
Applications which require sharing or dissemination of information
among distant entities are natural distributed systems
Distributed database system information is generated at different branch
offices (sub databases), so that a local access can be done quickly.
The system also provides a global view to support various global
operations.
E.g. ATM, airline reservation systems, remote monitoring, etc.

June 24, 2009

CS-524(NED) Lec 01

28

Motivation (2)
Improved PCR
The parallelism of distributed systems reduces
processing bottlenecks and provides improved allaround performance, at much lower cost.

Resource Sharing
Distributed systems can efficiently support information
and resource (hardware and software) sharing for
users at different locations.

June 24, 2009

CS-524(NED) Lec 01

29

Motivation (3)
Fault Tolerance
With the multiplicity of storage units and processing
elements, distributed systems have the potential ability to
continue operation in the presence of failures in the
system.

Scalability
Distributed systems are capable of incremental growth and
have the added advantage of facilitating modification or
extension of a system to adapt to a changing environment
without disrupting its operations.
Think of upgrading a mainframe or super computer!
June 24, 2009

CS-524(NED) Lec 01

30

Motivation (4)
Distribution as an Artifact
Distribution may be an artifact of an engineering solution to
satisfy some specific requirements such as
Fault-tolerance
Load-balancing
Minimum level of Quality of Service (QoS)

E.g. Replicated servers

Functional Distribution
Computers have different functional capabilities
Client / server
Host / terminal
Data gathering / data processing

June 24, 2009

CS-524(NED) Lec 01

31

Driving Forces
There are two main stimuli for the current
interest in distributed systems:
Technological Enhancement
microelectronics
fast and inexpensive processors

communication
highly efficient computer networks

User Needs
many enterprises are cooperative in nature

June 24, 2009

CS-524(NED) Lec 01

32

Classes of Distributed Systems


Distributed Computing Systems
Distributed Information Systems
Distributed Pervasive Systems

June 24, 2009

CS-524(NED) Lec 01

33

Distributed Computing Systems


High-Performance Computing Systems
Cluster computing
Grid computing

June 24, 2009

CS-524(NED) Lec 01

34

Cluster Computing
A collection of similar processors (PCs, workstations)
running the same (commodity) operating system,
connected by a high-speed network.
Runs parallel programs
Popular because they offer parallel computing
capabilities using inexpensive PC hardware; an
organization may be able to capitalize on machines it
already has.
Microsoft, Sun, and others sell clustering software and
you can also buy turnkey systems

June 24, 2009

CS-524(NED) Lec 01

35

Cluster Computing

June 24, 2009

CS-524(NED) Lec 01

36

Clusters Beowulf Model


Linux-based
Structured according to master-slave paradigm
One processor is the master; allocates tasks to other
processors, maintains batch queue of submitted jobs,
handles interface to users
Libraries to handle message-based communication or
other features

June 24, 2009

CS-524(NED) Lec 01

37

Clusters MOSIX Model


Provides a symmetric,
hierarchical paradigm

rather

than

High degree of distribution transparency


Processes can migrate between nodes
dynamically and preemptively

June 24, 2009

CS-524(NED) Lec 01

38

Grid Computing Systems

Modeled loosely on the electrical grid.


Unlike clusters, computers in grids are highly heterogeneous in their
hardware, software, networks, security policies, etc.
Grids support virtual organizations: a collaboration of users who pool
resources (servers, storage, databases) and share them
Grid software is concerned with managing sharing across
administrative domains
each part potentially under a different administrative domain,
hardware/software/network
Key issue sharing resources across organizations
much pain goes into standards and interfaces

June 24, 2009

CS-524(NED) Lec 01

39

Grid Computing Systems


Grid
Middleware

A layered architecture for grid computing systems


June 24, 2009

CS-524(NED) Lec 01

40

A Proposed Software Architecture

Fabric Layer
interfaces to local resources
Connectivity Layer
protocols to support usage of
multiple resources for a single
application; e.g., access a
remote resource or transfer
data between sites
Resource Layer
manages a single resource

June 24, 2009

CS-524(NED) Lec 01

41

A Proposed Software Architecture

Collective Layer
services for resource discovery,
resource allocation, resource
scheduling, etc.
Interacts with the connectivity
and resource layers
Application layer
applications within a virtual
organization (V.O.) which share
the grid computing resources.

June 24, 2009

CS-524(NED) Lec 01

42

OGSA A Grid Architecture


Open Grid Services Architecture
a service-oriented architecture
sites that offer resources to share do so by
offering specific Web services.
The architecture of the OGSA model is more
complex than the previous layered model.

June 24, 2009

CS-524(NED) Lec 01

43

Other Grid Resources


The Globus Alliance
a community of organizations and individuals developing
fundamental technologies behind the Grid, which lets people
share computing power, databases, instruments, and other online tools securely across corporate, institutional, and geographic
boundaries without sacrificing local autonomy

Grid Computing Info Centre


aims to promote the development and advancement of
technologies that provide seamless and scalable access to widearea distributed resources

June 24, 2009

CS-524(NED) Lec 01

44

Distributed Information Systems


Business-oriented
Systems to make a number of separate network
applications interoperable and build enterprisewide information systems.
Two types are discussed here:
Transaction Processing Systems
Enterprise Application Integration

June 24, 2009

CS-524(NED) Lec 01

45

Transaction Processing Systems


Provide a highly structured client-server approach for
database applications
Transactions obey the ACID properties:
Atomic:
all or nothing at all
Consistent:
invariants are preserved (if
consistent before, consistent after)
Isolated
concurrent transactions dont
interfere with each other
Durable:
committed operations cant be
undone
June 24, 2009

CS-524(NED) Lec 01

46

Enterprise Application Integration


Supports a less-structured approach (as
compared to transaction-based systems)
Application components are allowed to
communicate directly
Communication mechanisms to support this
include
Remote Procedure Call (RPC)
Remote Method Invocation (RMI)

June 24, 2009

CS-524(NED) Lec 01

47

Enterprise Application Integration

Middleware as a communication facilitator in enterprise application integration


June 24, 2009

CS-524(NED) Lec 01

48

Distributed Pervasive Systems


The first two types of systems are characterized
by their stability: nodes and network connections
are more or less fixed
This type of system is likely to incorporate small,
battery-powered, mobile devices
Home systems
Electronic health care systems patient monitoring
Sensor networks data collection, surveillance

June 24, 2009

CS-524(NED) Lec 01

49

Electronic Health Care Systems

Monitoring a person in a pervasive electronic health care system, using (a) a local
hub or (b) a continuous wireless connection.

June 24, 2009

CS-524(NED) Lec 01

50

Sensor Networks

Organizing a sensor network database, while storing and processing


data only at the operators site
June 24, 2009

CS-524(NED) Lec 01

51

Sensor Networks

Organizing a sensor network database, while storing and processing data only at the
sensors.
June 24, 2009

CS-524(NED) Lec 01

52

Distributed Systems vs. Parallel Systems

DS often refers to a system that is to


be used by multiple (distributed) users.

e-commerce or business applications.

generally refers to a cooperative work


environment

Security is much more of a concern

This is not an option, for example, in


the design of a distributed database for
e-commerce. By its very nature, this
system must be accessible to the real
world -- and as a consequence must be
designed with security in mind.

June 24, 2009

PS often has the connotation of a


system that is designed to have only a
single user or user process
scientific applications
typically refers to an environment
designed to provide the maximum
parallelization and speed-up for a
single task
If the only goal of a super computer is
to rapidly solve a complex task, it can
be locked in a secure facility,
physically and logically inaccessible -security problem solved.

CS-524(NED) Lec 01

53

Distributed System Challenges

Resource Accessibility
Security
Concurrency
Heterogeneity
Transparency
Openness
Scalability
Reliability
Lack of Global Clock and Global State

June 24, 2009

CS-524(NED) Lec 01

54

Resource Accessibility
Support user access to remote resources (printers, data
files, web pages, CPU cycles) and the fair sharing of the
resources
making convenient to share resources

June 24, 2009

CS-524(NED) Lec 01

55

Security
Sharing, as always, introduces security issues
Confidentiality
avoiding the disclosure of the content of a message to a party
distinct from the intended receiver

Integrity
avoiding the corruption of the transmitted contents by a third
party

Availability
the capability of providing a service in all circumstances

June 24, 2009

CS-524(NED) Lec 01

56

Concurrency
Resources can be shared by clients in a
distributed system, therefore several clients may
access a shared resource at the same time
Not acceptable that each request be processed
in turn, must be able to process requests
concurrently
For each object that represents a shared
resource, its operations must be synchronized in
such a way that its data remains consistent
June 24, 2009

CS-524(NED) Lec 01

57

Heterogeneity - I

Heterogeneity (variety and difference) applies to:

Networksdifferences are masked by the fact that all of the computers use the Internet
protocols to communicate.

Hardwaredata types, such as integers, may be represented in different ways on different


sorts of hardware (byte ordering: big-endian, little-endian)

Operating Systemsdo not provide the same application API to the Internet protocols.

Programming languagesuse different representations for characters and data structures,


such as arrays and records.

Developersrepresentation of primitive data items and data structures needs to be agreed


upon (standards)

Middleware

Software layer that abstracts from the above providing a uniform computational model

All middleware deals with the differences in operating systems and hardware.

June 24, 2009

CS-524(NED) Lec 01

58

Heterogeneity - II
Mobile Code
A code that can be sent from one computer to another and runs
at the destination (e.g. Java applets).
Machine code suitable for running on one type of computer
hardware is not suitable for running on another.

Virtual Machines Approach


provides a way of making code executable on any hardware: the
compiler for a particular language generates code for a virtual
machine instead of a particular hardware.

June 24, 2009

CS-524(NED) Lec 01

59

Transparency
A distributed system that appears to its users &
applications to be a single computer system is
said to be transparent.
Users & applications should be able to access
remote resources in the same way they
access local resources.
Aims to conceal the component-based structure
of the system, and facilitate a perception of the
system as a whole
June 24, 2009

CS-524(NED) Lec 01

60

Transparency Classes (1)

Access Transparency
Hides differences in data representation, different architectures and filename conventions of machines
Enables interoperability

Location Transparency
Hides location of resource i.e. the user can use the resource without
being aware of its location
The key is naming
E.g. URLs, email, etc.
(Access + Location) Transparency = Network Transparency

June 24, 2009

CS-524(NED) Lec 01

61

Transparency Classes (2)

Migration Transparency
Hides from the user that the resource being used has moved to another
location

Relocation Transparency
Hides from the user that the resource being used is being moved
Enables mobile computing

Persistence Transparency
Hides whether a resource is in memory or on disk

June 24, 2009

CS-524(NED) Lec 01

62

Transparency Classes (3)

Replication Transparency

Concurrency Transparency

Hides that multiple copies of the resource exist (for reliability and/or availability)

Hides that the resource may be shared concurrently

Failure Transparency

Hides failure and (possible) recovery of the resource

Email is eventually delivered, even when servers or communication links fail.

Scaling Transparency

Allows system and applications to expand without need to change structure or application
algorithms

Performance Transparency

Adaptation of the system to varying load situations without the user noticing it

June 24, 2009

CS-524(NED) Lec 01

63

Degrees of Transparency

Performance
e.g. multiple attempts to contact a remote server can slow down the
system should you report failure and let user cancel request?

Convenience
e.g. direct the print request to my local printer, not one on the next floor

Too much emphasis on transparency may prevent the user from


understanding system behavior

Transparency is sometimes against applications goals e.g. pervasive


computing and location awareness

June 24, 2009

CS-524(NED) Lec 01

64

Openness - I
Services should follow agreed-upon rules on component
syntax & semantics for interoperability and portability
Using interfaces, any process that needs a service
should be able to communicate with a process that
provides the service.
Multiple implementations of the same service may be
provided, as long as the interface is maintained

June 24, 2009

CS-524(NED) Lec 01

65

Openness - II

Interoperability
The ability of two different systems or applications to work together by relying on
each others services as specified by a common standard

Portability
The ability of an application designed to run on distributed system A to run on
distributed system B which implements the same interface, without modification

Extensibility
If a distributed system is open (implements standard interfaces) it should be
possible to add and delete components without affecting the system as a whole.
e.g., replace the file system

June 24, 2009

CS-524(NED) Lec 01

66

Scalability I

A system is scalable if it will remain effective if there is a significant


increase in the number of resources and the number of users

The design of scalable distributed system poses the following


challenges
Controlling Cost of Physical Resources
For a system with n users to be scalable, the quantity of physical resources required to
support them should be at most O(n) that is, proportional to n. E.g., if a single file
server can support 20 users, then two such servers should be able to support 40 users.

Controlling Performance Loss


Maximum performance loss should be no worse than O(log n) where n is size of data.

Preventing Software Resources Running Out


IP Addresses (initially 32 bits in IPv4). 128-bit in IPv6

June 24, 2009

CS-524(NED) Lec 01

67

Scalability II
With respect to size
With respect to geographical distribution
With respect to the number of administrative
organizations it spans
Most systems account only, to a certain extent, for
size scalability.
Today, the challenge lies in geographical and
administrative scalability.
June 24, 2009

CS-524(NED) Lec 01

68

Size Scalability
The more users and resources a system has, the harder
it is to support a centralized model.
Scalability is affected when the system is based on
Centralized server
one for all users
Centralized data
a single database for all users
Centralized algorithms
e.g. for routing: one site collects all information,
processes it, distributes the results to all sites
June 24, 2009

CS-524(NED) Lec 01

69

Size Scalability
A single centralized server, running on a single machine,
can saturate if the workload becomes too heavy.
Communication links around the server can limit
performance, as well
Centralized
databases

data

storage

is

impractical

for

large

If the Internets Domain Name Service consisted of a


single table, it would be virtually impossible to resolve
a URL in reasonable time
June 24, 2009

CS-524(NED) Lec 01

70

Size Scalability
Centralized algorithms rely on a central coordinator that
collects data from all sites in the network and then
makes decisions.
Complete knowledge
good

Time and network traffic


bad

Wherever possible, distributed algorithms are desirable.

June 24, 2009

CS-524(NED) Lec 01

71

Size Scalability
Decentralized or Distributed Algorithms
No machine has complete information about the
system state
Machines make decisions based only on local
information
Failure of a single machine doesnt ruin the algorithm
There is no assumption that a global clock exists.

June 24, 2009

CS-524(NED) Lec 01

72

Geographic Scalability
Early distributed systems ran on LANs; relied on
synchronous communication
requesting client blocks until it gets a response,
makes it hard to scale

June 24, 2009

CS-524(NED) Lec 01

73

Administrative Scalability
Different domains may have different
policies
about
resource
usage,
management, security, etc.
Trust often stops at administrative
boundaries

June 24, 2009

CS-524(NED) Lec 01

74

Scaling Techniques
Scalability affects performance more than
anything else.
Three techniques to improve scalability:
Hiding Communication Latencies
Distribution
Replication

June 24, 2009

CS-524(NED) Lec 01

75

Scalability Amazon.com

Werner Vogels talk Order in the Chaos: Building the Amazon.com


Platform

1995: Started out with a single web service on a single server

Today Amazon has about 150 web services on its homepage alone.

1 million merchant partners; 60 million customers

1999: A misstep during this exponential growth period was moving


to mainframe from distributed server.
Failed to meet scalability, reliability and performance; it was scratched
in 2000.

June 24, 2009

CS-524(NED) Lec 01

76

Hiding Communication Delays

Key for geographic scalability


Structure applications to use asynchronous communication (no
blocking for replies)
While waiting for one answer, do something else; create one
thread to wait for the reply and let other threads continue to
process or schedule another task
Download part of the computation to the requesting platform to
speed up processing
E.g. Filling in forms to access a DB:
send a separate message for each field
download form/code and submit finished version. JavaScript and
Java applets support this approach.

June 24, 2009

CS-524(NED) Lec 01

77

Hiding Communication Delays

June 24, 2009

CS-524(NED) Lec 01

78

Distribution
Instead of one centralized service, divide into
parts and distribute them geographically
Example: DNS namespace is organized as a
tree of domains; each domain is divided into
zones; names in each zone are handled by a
different name server

June 24, 2009

CS-524(NED) Lec 01

79

Distribution

An example of dividing the DNS name space into zones


June 24, 2009

CS-524(NED) Lec 01

80

Replication
Replication: multiple identical copies of
something
Replication
Increases availability
Improves performance through load balancing
May avoid latency by improving proximity of
resource

June 24, 2009

CS-524(NED) Lec 01

81

Replication - Caching
Caching is a form of replication
Normally creates a (temporary) replica of
something closer to the user
User decides to cache, system decides to
replicate
Replication is more permanent
Both lead to consistency problems

June 24, 2009

CS-524(NED) Lec 01

82

Replication - Caching
Having multiple copies (cached or replicated), leads to
inconsistencies:
modifying one copy makes that copy different from the rest.

Always keeping copies consistent and in a general way


requires global synchronization on each modification.
Global synchronization precludes large-scale solutions.

If we can tolerate inconsistencies, we may reduce the


need for global synchronization.
Tolerating inconsistencies is application dependent.
June 24, 2009

CS-524(NED) Lec 01

83

Reliability Failure Handling

Techniques
Failure Detection
message checksum

Failure Masking
making a detected failure hidden or less severe
email retransmission

Tolerating Failures
Web pages (informing users about failure)

Failure Recovery

permanent data rolled back

Redundancy (use of redundant components)


Duplication in routes, hardware,
DNS every name table replicated in at least two different servers
Databases replicated in several
servers
several servers
June 24, 2009

CS-524(NED) Lec 01

84

Reliability Failure Handling


Availability
Measure of the proportion of time a system is
available for use.
DS provide a high degree of availability
regarding hardware faults.

June 24, 2009

CS-524(NED) Lec 01

85

Lack of Global Clock & State


There are limits on the precision with which
processes in a distributed system can
synchronize their clocks
There is no single process in the distributed
system that would have a knowledge of the
current global state of the system

June 24, 2009

CS-524(NED) Lec 01

86

Fallacies of Distributed Computing


Source: Peter Deutsch (The following false assumptions
add to the challenges)

The network is reliable


Latency is zero
Bandwidth is infinite
The network is secure
Topology doesnt change
There is one administrator
Transport cost is zero
The network is homogeneous

June 24, 2009

CS-524(NED) Lec 01

87

Summary

Distributed computing brings transparent access to as much computer


power and data as the user needs to accomplish any given task, and at the
same time, achieves high performance and reliability objectives

Despite the failure, uncertainty, and lack of specialized hardware support,


we can build and effectively use systems that are an order of magnitude
more powerful. In fact we can do this while providing a more available, more
robust, more convenient solution.

Middleware is a key facility for building distributed systems

Its difficult to design a good distributed system: there are a lot of problems
in getting good characteristics, not the least of which is people.

June 24, 2009

CS-524(NED) Lec 01

88

You might also like