You are on page 1of 89

GRID

COMPUTING
Faisal N. Abu-Khzam
&
Michael A. Langston

University of Tennessee
1
Outline
 Hour 1: Introduction
Break
 Hour 2: Using the Grid
Break
 Hour 3: Ongoing Research
Q&A Session
2
Hour 1: Introduction
What is Grid Computing?
Who Needs It?
An Illustrative Example
Grid Users
Current Grids

3
What is Grid Computing?
Computational Grids
– Homogeneous (e.g., Clusters)
– Heterogeneous (e.g., with one-of-a-kind
instruments)
Cousins of Grid Computing
Methods of Grid Computing

4
Computational Grids
A network of geographically distributed
resources including computers,
peripherals, switches, instruments, and
data.
Each user should have a single login
account to access all resources.
Resources may be owned by diverse
organizations.
5
Computational Grids
Grids are typically managed by
gridware.
Gridware can be viewed as a special
type of middleware that enable sharing
and manage grid components based on
user requirements and resource
attributes (e.g., capacity, performance,
availability…)
6
Cousins of Grid Computing
Parallel Computing
Distributed Computing
Peer-to-Peer Computing
Many others: Cluster Computing,
Network Computing, Client/Server
Computing, Internet Computing, etc...

7
Distributed Computing
People often ask: Is Grid Computing a
fancy new name for the concept of
distributed computing?
In general, the answer is “no.”
Distributed Computing is most often
concerned with distributing the load of a
program across two or more processes.
8
PEER2PEER Computing
Sharing of computer resources and
services by direct exchange between
systems.
Computers can act as clients or servers
depending on what role is most efficient
for the network.

9
Methods of Grid Computing
Distributed Supercomputing
High-Throughput Computing
On-Demand Computing
Data-Intensive Computing
Collaborative Computing
Logistical Networking

10
Distributed Supercomputing
Combining multiple high-capacity
resources on a computational grid into a
single, virtual distributed
supercomputer.
Tackle problems that cannot be solved
on a single system.

11
High-Throughput Computing
Uses the grid to schedule large numbers
of loosely coupled or independent tasks,
with the goal of putting unused
processor cycles to work.

12
On-Demand Computing
Uses grid capabilities to meet short-term
requirements for resources that are not
locally accessible.
Models real-time computing demands.

13
Data-Intensive Computing
The focus is on synthesizing new
information from data that is maintained
in geographically distributed
repositories, digital libraries, and
databases.
Particularly useful for distributed data
mining.
14
Collaborative Computing
Concerned primarily with enabling and
enhancing human-to-human
interactions.
Applications are often structured in
terms of a virtual shared space.

15
Logistical Networking
Global scheduling and optimization of data
movement.
Contrasts with traditional networking, which
does not explicitly model storage resources
in the network.
Called "logistical" because of the analogy it
bears with the systems of warehouses,
depots, and distribution channels.
16
Who Needs Grid Computing?
A chemist may utilize hundreds of
processors to screen thousands of
compounds per hour.
Teams of engineers worldwide pool
resources to analyze terabytes of
structural data.
Meteorologists seek to visualize and
analyze petabytes of climate data with
enormous computational demands.
17
An Illustrative Example
Tiffany Moisan, a NASA research
scientist, collected microbiological
samples in the tidewaters around
Wallops Island, Virginia.
She needed the high-performance
microscope located at the National
Center for Microscopy and Imaging
Research (NCMIR), University of
California, San Diego.
18
Example (continued)
She sent the samples to San Diego and
used NPACI’s Telescience Grid and
NASA’s Information Power Grid (IPG)
to view and control the output of the
microscope from her desk on Wallops
Island. Thus, in addition to viewing the
samples, she could move the platform
holding them and make adjustments to
the microscope.
19
Example (continued)
The microscope produced a huge
dataset of images.
This dataset was stored using a storage
resource broker on NASA’s IPG.
Moisan was able to run algorithms on
this very dataset while watching the
results in real time.
20
Grid Users
Grid developers
Tool developers
Application developers
End Users
System Administrators

21
Grid Developers
Very small group.
Implementers of a grid “protocol” who
provides the basic services required to
construct a grid.

22
Tool Developers
Implement the programming models
used by application developers.
Implement basic services similar to
conventional computing services:
– User authentication/authorization
– Process management
– Data access and communication

23
Tool Developers
Also implement new (grid) services
such as:
– Resource locations
– Fault detection
– Security
– Electronic payment

24
Application Developers
Construct grid-enabled applications for
end-users who should be able to use
these applications without concern for
the underlying grid.
Provide programming models that are
appropriate for grid environments and
services that programmers can rely on
when developing (higher-level)
applications.
25
System Administrators
Balance local and global concerns.
Manage grid components and
infrastructure.
Some tasks still not well delineated due
to the high degree of sharing required.

26
Some Highly-Visible Grids
The NSF PACI/NCSA Alliance Grid.
The NSF PACI/SDSC NPACI Grid.
The NASA Information Power Grid
(IPG).
The Distributed Terascale Facility
(DTF) Project.

27
DTF
Currently being built by NSF’s
Partnerships for Advanced
Computational Infrastructure (PACI)
A collaboration: NCSA, SDSC,
Argonne, and Caltech will work in
conjunction with IBM, Intel, Quest
Communications, Myricom, Sun
Microsystems, and Oracle.
28
DTF Expectations
A 40-billion-bits-per-second optical
network (Called TeraGrid) is to link
computers, visualization systems, and
data at four sites.
Performs 11.6 trillion calculations per
second.
Stores more than 450 trillion bytes of
data.
29
GRID COMPUTING

BREAK

30
Hour 2: Using the Grid
Globus
Condor
Harness
Legion
IBP
NetSolve
Others
31
Globus
A collaboration of Argonne National
Laboratory’s Mathematics and
Computer Science Division, the
University of Southern California’s
Information Sciences Institute, and the
University of Chicago's Distributed
Systems Laboratory.
Started in 1996 and is gaining
popularity year after year.
32
Globus
A project to develop the underlying
technologies needed for the
construction of computational grids.
Focuses on execution environments
for integrating widely-distributed
computational platforms, data
resources, displays, special
instruments and so forth.
33
The Globus Toolkit
The Globus Resource Allocation
Manager (GRAM)
– Creates, monitors, and manages services.
– Maps requests to local schedulers and
computers.
The Grid Security Infrastructure (GSI)
– Provides authentication services.

34
The Globus Toolkit
The Monitoring and Discovery Service
(MDS)
– Provides information about system status,
including server configurations, network
status, and locations of replicated datasets,
etc.
Nexus and globus_io
– provides communication services for
heterogeneous environments.
35
The Globus Toolkit
Global Access to Secondary Storage
(GASS)
– Provides data movement and access
mechanisms that enable remote programs
to manipulate local data.
Heartbeat Monitor (HBM)
– Used by both system administrators and
ordinary users to detect failure of system
components or processes.
36
Condor
The Condor project started in 1988 at
the University of Wisconsin-Madison.
The main goal is to develop tools to
support High Throughput Computing on
large collections of distributively owned
computing resources.

37
Condor
Runs on a cluster of workstations to glean
wasted CPU cycles.
A “Condor pool” consists of any number of
machines, of possibly different architectures
and operating systems, that are connected by
a network.
Condor pools can share resources by a
feature of Condor called flocking.
38
The Condor Pool Software
Job management services:
– Supports requests about the job queue .
– Puts a job on hold.
– Enables the submission of new jobs.
– Provides information about jobs that are
already finished.
A machine with job management
installed is called a submit machine.
39
The Condor Pool Software
Resource management:
– Keeps track of available machines.
– Performs resource allocation and
scheduling.
Machines with resource management
installed are called execute machines.
A machine could be a “submit” and an
“execute” machine simultaneously.
40
Condor-G
A version of Condor that uses Globus to
submit jobs to remote resources.
Allows users to monitor jobs submitted
through the Globus toolkit.
Can be installed on a single machine.
Thus no need to have a Condor pool
installed.

41
Legion
An object-based metasystems software
project designed at the University of
Virginia to support millions of hosts and
trillions of objects linked together with
high-speed links.
Allows groups of users to construct
shared virtual work spaces, to
collaborate research and exchange
information.
42
Legion
An open system designed to
encourage third party development
of new or updated applications, run-
time library implementations, and
core components.
The key feature of Legion is its
object-oriented approach.
43
Harness
A Heterogeneous Adaptable
Reconfigurable Networked System
A collaboration between Oak Ridge
National Lab, the University of
Tennessee, and Emory University.
Conceived as a natural successor of the
PVM project.

44
Harness
An experimental system based on a
highly customizable, distributed virtual
machine (DVM) that can run on
anything from a Supercomputer to a
PDA.
Built on three key areas of research:
Parallel Plug-in Interface, Distributed
Peer-to-Peer Control, and Multiple
DVM Collaboration.
45
IBP
The Internet Backplane Protocol (IBP)
is a middleware for managing and using
remote storage.
It was devised at the University of
Tennessee to support Logistical
Networking in large scale, distributed
systems and applications.
46
IBP
Named because it was designed to
enable applications to treat the Internet
as if it were a processor backplane.
On a processor backplane, the user has
access to memory and peripherals, and
can direct communication between them
with DMA.
47
IBP
IBP gives the user access to remote
storage and standard Internet
resources (e.g. content servers
implemented with standard sockets)
and can direct communication
between them with the IBP API.

48
IBP
By providing a uniform, application-
independent interface to storage in the
network, IBP makes it possible for
applications of all kinds to use logistical
networking to exploit data locality and
more effectively manage buffer
resources.

49
NetSolve
A client-server-agent model.
Designed for solving complex scientific
problems in a loosely-coupled
heterogeneous environment.

50
The NetSolve Agent
A “resource broker” that represents
the gateway to the NetSolve system
Maintains an index of the available
computational resources and their
characteristics, in addition to usage
statistics.

51
The NetSolve Agent
Accepts requests for computational
services from the client API and
dispatches them to the best-suited
sever.
Runs on Linux and UNIX.

52
The NetSolve Client
Provides access to remote resources
through simple and intuitive APIs.
Runs on a user’s local system.
Contacts the NetSolve system through
the agent, which in turn returns the
server that can best service the request.
Runs on Linux, UNIX, and Windows.
53
The NetSolve Server
The computational backbone of the
system.
A daemon process that awaits client
requests.
Runs on different platforms: a single
workstation, cluster of workstations,
symmetric multiprocessors (SMPs), or
massively parallel processors (MPPs).
54
The NetSolve Server
A key component of the server is the
Problem Description File (PDF).
With the PDF, routines local to a given
server are made available to clients
throughout the NetSolve system.

55
The PDF Template
PROBLEM Program Name

LIB Supporting Library Information

INPUT specifications

OUTPUT specifications

CODE
56
Network Weather Service
Supports grid technologies.
Uses sensor processes to monitor cpu
loads and network traffic.
Uses statistical models on the collected
data to generate a forecast of future
behavior.
NetSolve is currently integrating NWS
into its agent.
57
Gridware Collaboarations
NetSolve is using Globus' "Heartbeat
Monitor" to detect failed servers.
A NetSolve client is now in testing that
allows access to Globus.
Legion has adopted NetSolve’s client-user
interface to leverage its metacomputing
resources.
The NetSolve client uses Legion’s data-flow
graphs to keep track of data dependencies.
58
Gridware Collaboarations
NetSolve can access Condor pools
among its computational resources.
IBP-enabled clients and servers allow
NetSolve to allocate and schedule
storage resources as part of its resource
brokering. This improves fault
tolerance.
59
GRID COMPUTING

BREAK

60
Hour 3: Ongoing Research
Motivation.
Special Projects.
– Ongoing work at Tennessee
General Issues.
– Open questions of interest to the entire
research community

61
Motivation
Computer speed
doubles every 18
months
Network speed
doubles every 9
months

Graph from Scientific American (Jan-2001) by Cleo Vilett,


source Vined Khoslan, Kleiner, Caufield and Perkins 62
Special Projects
The SInRG Project.
– Grid Service Clusters (GSCs)
– Data Switches
Incorporating Hardware Acceleration.
Unbridled Parallelism
– SETI@home and Folding@home
– The Vertex Cover Solver
Security.
63
The SInRG Project

64
The Grid Service Cluster
The basic grid building block.
Each GSC will use the same software
infrastructure as is now being deployed
on the national Grid, but tuned to take
advantage of the highly structured and
controlled design of the cluster.
Some GSCs are general-purpose and
some are special-purpose.
65
The Grid Service Cluster

66
An advanced data switch
The components that make up a
GSC must be able to access each
other at very high speeds and with
guaranteed Quality of Service
(QoS).
Links of at least1Gbps assure QoS
in many circumstances simply by
over provisioning.
67
Computational Ecology GSC
 Collaboration between computer
science and mathematical ecology.
 8-processor Symmetric Multi-
Processor (SMP).
 Initial in-core memory (RAM) is
approximately 4 gigabytes.
 Out-of-core data storage unit provides
a minimum of 450 gigabytes.
68
Medical Imaging GSC
 Collaboration between computer
science and the medical school.
 High-end graphics workstations.
 Distinguished by the need to have
these workstations attached as directly
as possible to the switch to facilitate
interactive manipulation of the
reconstructed images.
69
Molecular Design GSC
Collaboration between computer
science and chemical engineering.
Data visualization laboratory
32 dual processors
High performance switch

70
Machine Design GSC
Collaboration between computer
science and electrical engineering.
12 Unix-based CAD workstations.
8 Linux boxes with Pilchard boards.
Investigating the potential of
reconfigurable computing in grid
environments.
71
Machine Design GSC

72
Types of Hardware
General purpose hardware – can
implement any function
ASICs – hardware that can implement
only a specific application
FPGAs – reconfigurable hardware that
can implement any function

73
The FPGA
FPGAs offer reprogrammability
Allows optimal logic design of each
function to be implemented
Hardware implementations offer
acceleration over software
implementations which are run on
general purpose processors
74
The Pilchard Environment
Developed at Chinese University in Hong
Kong.
Plugs into 133MHz RAM DIMM slot and
is an example of “programmable active
memory.”
Pilchard is accessed through memory
read/write operations.
Higher bandwidth and lower latency than
other environments.
75
Objectives
Evaluate utility of NetSolve gridware.
Determine effectiveness of hardware
acceleration in this environment.
Provide an interface for the remote use
of FPGAs.
Allow users to experiment and gauge
whether a given problem would benefit
from hardware acceleration.
76
Sample Implementations
Fast Fourier Transform (FFT)
Data Encryption Standard algorithm
(DES)
Image backprojection algorithm
A variety of combinatorial algorithms

77
Implementation Techniques
Two types of functions are implemented
Software version - runs on the PC’s
processor
Hardware version - runs in the FPGA
To implement the hardware version of
the function, VHDL code is needed

78
The Hardware Function
Implemented in VHDL or some other
hardware description language.
The VHDL code is then mapped onto the
FPGA (synthesis).
 CAD tools help make mapping decisions
based on constraints such as: chip area,
I/O pin counts, routing resources and
topologies, partitioning, resource usage
minimization.
79
The Hardware Function
Result of synthesis is a configuration
file (bit stream).
This file defines how the FPGA is to be
reprogrammed in order to implement the
new desired functionality.
To run, a copy of the configuration file
must be loaded on the FPGA.
80
Behind the Scenes
VHDL Software Server
programmer Synthesis programmer administrator
VHDL Configuration Software and
code file Hardware functions

PDFs,
Libraries
Request
NetSolve
Client
server
Request
results

81
Conclusions
Hardware acceleration is offered to both
local and remote users.
Resources are available through an
efficient and easy-to-use interface.
A development environment is provided
for devising and testing a wide variety
of software, hardware and hybrid
solutions.
82
Unbridled Parallelism
Sometimes the overhead of gridware is
unneeded. Well known examples
include SETI@home and
Folding@home.
We’re currently building a Vertex
Cover solver with multiple levels of
acceleration.
83
A Naked SSH Approach
A bit of blasphemy: the anti-gridware
paradigm 
Our work begs several questions.
– When does it make sense?
– How much efficiency are we gaining?
– What are its limitations?

84
Grid Security
Algorithm complexity theory
– Verifiability
– Concealment
Cryptography and checkpointing
– Corroboration
– Scalability
Voting and spot-checking
– Fault tolerance
– Reliability
85
Some General Issues
Grid architecture.
Resource management.
QoS mechanisms.
Performance monitoring.
Fault tolerance.

86
References
URL to these slides:
http://www.cs.utk.edu/~abukhzam/grid-tutor
ial.htm
Condor:
http://www.cs.wisc.edu/condor
Globus:
http://www.globus.org
87
References
NetSolve:
http://icl.cs.utk.edu/netsolve
Harness:

NWS:
http://nws.cs.ucsb.edu/
SInRG:
http://icl.cs.utk.edu/sinrg
88
GRID COMPUTING

END

89

You might also like