You are on page 1of 13

William Kerney

4/29/00

Clusters Separating Myth from Fiction


I. Introduction

three self-made clusters in the top 500 of

Over the last few years, clusters of

supercomputers1

commodity PCs have become ever more


prevalent. Since the early 90s computer

This has led to a lot of excitement in the

scientists have been predicting the

field of clustered computing, and to

demise of Big Iron that is, the custom

inflated expectations as to what clusters

built supercomputers of the past such as

can achieve. In this paper, we will

the Cray XMP or the CM-* -- due

survey three clustering systems, compare

workstations superior

some common clusters with a modern

price/performance. Big Iron machines

supercomputer, and then discuss some of

were able to stay viable for a long time

the myths that have sprung up about

since they were able to perform

clusters in recent years.

computations that were infeasible on


even the fastest of personal computers.
In the last few years though, clusters of
personal computers have nominally
caught up to supercomputers in raw CPU
power and interconnect speed, putting

II. The NOW Project


One of the most famous research efforts
in clustered computing was the NOW
project at UC Berkeley, which ran from
1994-1998. A case for NOW2 by
Culler et al. is an authoritative statement
1

http://www.netlib.org/benchmark/top500/top500.
list.html
2
http://now.cs.berkeley.edu/Case/case.html

of why clusters are a good idea they

performing heavy computations, the OS

have lowered costs, greater performance,

will notice and migrate any global jobs

and can even be used as a general

to a less loaded node. In other words,

computer lab for students.

GLUnix is transparent it appears to a


user that he has full access to his

The NOW cluster physically looked like

workstations CPU at all times with a

any other undergraduate computer lab: it

batch submission system to access spare

had (in 1998), 64 UltraSPARC I boxes

cycles on all the machines across the lab.

with 64MB of main memory each, all of

The user does not decide which nodes to

which could be logged into individually.

run on he simply uses the resources of

For all intents and purposes they looked

the whole lab.

like individual workstations that can


submit jobs to an abstract global pool of

David Culler and the other developers of

computational cycles. This global pool is

NOW also discovered one of the most

provided by way of GLUnix, a

important ideas to come out of clustered

distributed operating system that sits

computing Active Messages. Active

atop Solaris and provides load balancing,

Messages were devised to compensate

input redirection, job control and co-

for the slower networks that

scheduling of applications that need to

workstations typically use typically

be run at the same time. GLUnix load

10BaseT or 100BaseT, which get

balances by watching the CPU usage of

nowhere near the performance of high-

all the nodes in the cluster; if a user sits

performance custom hardware like

down at one workstation and starts

hypercubes of CrayLinks.3 In the NOW

mean size is 382 (according to a study

cluster, when a active data packet arrives

performed on the Berkeley network),

the NIC writes the data directly into an

Active Messages will be very superior to

applications memory. Since the

TCP.

application no longer has to poll the NIC


or copy data out of the NICs buffer, the

The downside to Active Messages is that

overall end-to-end latency decreases by

programs must be rewritten to take

50% for medium-sized (~1KB)

advantage of the interface; by default

messages and from 4ms to 12us for short

programs poll the network with a

(1 packet) messages, a 200x reduction in

select(3C) call and do not set up regions

time. A network with active messages

of memory for the NIC to write into. It is

running through it has a lower half

not a straightforward conversion from

power-point the message size that

TCP sockets since the application has to

achieves half the maximum bandwidth

set up handlers to get called back when a

than a network using TCP since active

message arrives for the process. The

messages have a much smaller latency,

NOW group worked around this by

especially with short messages. A

implementing Fast Sockets5, which

network with AM hits the half power

presents the same API as UNIX sockets,

point at 176 bytes, as compared with

but has an active message

1352 bytes for TCP. 4 When 95% of

implementation beneath.

packets are less than 192 bytes and the


3

http://www.sgi.com/origin/images/hypercube.pdf
4

file://ftp.cs.berkeley.edu:/ucb/CASTLE/Active_
Messages/hotipaper.ps

http://www.usenix.org/publications/library/proce
edings/ana97/full_papers/rodrigues/rodrigues.ps

The results that came out of the NOW

Chien et al. at the University of Illinois

project were quite promising. It broke

at Urbana-Champaign (1997-present)

the world record for the datamation disk-

that built in part on the successes of the

to-disk sorting benchmark6 in 1997,

NOW project.8 Their goal was similar to

demonstrating that a large number of

PVMs, in that they wanted to present an

cheap workstation drives can have a

abstract layer that looked like a generic

higher aggregate bandwidth than a

supercomputer to its users, but was

smaller number of high-performance

actually composed of heterogeneous

drives in a centralized server. Also, the

machines beneath.

NOW project showed that for a fairly


broad class of problems the cluster was

The important difference between

scalable and could challenge the

HPVM with PVM and NOW is that

performance of traditional

where PVM and NOW use their own

supercomputers with inexpensive

custom API to access the parallel

components. Their Active Messaging

processing capabilities of their system,

system, by lowering message delay,

requiring programmers to spend a

mitigated the slowdown caused by

moderate amount of effort porting their

running on a cheap interconnect.7

code, HPVM presents four different


APIs which mimic common

III. HPVM

supercomputing interfaces. So, for

HPVM, or High-Performance Virtual

example, if the programmer already has

Machine, was a project by Andrew

a program written using SHMEM the

http://now.cs.berkeley.edu/NowSort/nowSort.ps
http://www.cs.berkeley.edu/~rmartin/logp.ps

http://www-csag.ucsd.edu/papers/hpvmsiam97.ps

one sided memory transfer API used by

1) FM allows the user to send messages

Crays then he will be able to quickly

larger than fit in main memory, AM does

port his program to HPVM. The

not.

interfaces implemented by HPVM are:

2) AM returns an automatic reply to

MPI, SHMEM, global arrays (similar to

every request sent to detect packet loss.

shared memory but allowing multi-

FM implements a more sophisticated

dimensional arrays) and FM (described

reliable delivery protocol and guarantees

below).9

correct order in the delivery of messages.


3) AM requires the user to specify the

The layer beneath HPVMs multiple

remote memory address the message will

APIs is a messaging layer called Fast

get written into; FM only requires that a

Messages. FM was developed in 1995 as

handler be specified for the message.

an extension of Active Messages.10 Since


In keeping with HPVMs goal of
then, AM has been worked on as well, so
providing an abstract supercomputer, it
the projects have diverged slightly over
theoretically allows its interface to sit
the years though both have
above any combination of hardware that
independently implemented new features
a system admin can throw together. In
like having more than one active process
other words, it would allow an
per node. The improvements FM made
administrator to put 10 Linux Boxes, 20
to AM include the following11:
NT Workstations and a Cray T3D into a
9

http://wwwcsag.ucsd.edu/projects/hpvm/doc/hpvmdoc_7.ht
ml#SEC7
10
http://www-csag.ucsd.edu/papers/myrinet-fmsc95.ps
11
http://www-csag.ucsd.edu/papers/fm-pdt.ps

virtual supercomputer that could run


MPI, FM or SHMEM programs quickly
(via the FM underlying it all).

Soupercomputer16) while others will say


In reality, Chiens group only

that a true Beowulf cluster is one that

implemented the first version of HPVM

mimics the original cluster at NASA.17

on NT and linux boxes, and their latest

Yet even others claim that any group of

version only does NT clustering. A

boxes running an open source operating

future release might add support for

system is a Beowulf. The definition

more platforms.

we will use here is: any cluster of


workstations which runs Linux with the

IV. Beowulf

packages available off the official

Beowulf has been the big name in

Beowulf website.

clustering recently. Every member of the


high-tech press has run at least on story
12

13

The various packages include:


1) Ethernet bonding this allows

14

on Beowulf: Slashdot , Zdnet , Wired ,

multiple Ethernet channels to be

15

CNN and others. One of the more


interesting things to note about Beowulf
clusters is that there is no such thing as a
definitive Beowulf cluster. Various
managers have labeled their projects
Beowulf-Style (like the Stone
12

http://slashdot.org/articles/older/00000817.shtml
13

logically joined into one higherbandwidth connection. In other words, if


a machine had two 100Mb/s connections
to a hub, it would be able to transmit
data over the network at 200Mb/s,
assuming that all other factors are
negligible.

http://www.zdnet.com/zdnn/stories/news/0,4586,
2341316,00.html
14

http://www.wired.com/news/technology/0,1282,1
4450,00.html

16

http://stonesoup.esd.ornl.gov/

15

17

http://www.cnn.com/2000/TECH/computing/04/
13/cheap.super.idg/index.html

http://cesdis.gsfc.nasa.gov/linux/beowulf/beowul
f.html

2) PVM or MPI these standard toolkits

address as an Origin 2000 does, with this

are what allow HPC programs to

kernel patch a process can use pages of

actually be run on the cluster. Unless the

memory that physically exist on a

users has an application whose

remote machine. When a process tries to

granularity is so high that it can be done

access memory not in local RAM, it

merely with remote shells, he will want

triggers a page fault, which invokes a

to have either PVM or MPI or the

handler that fetches the memory from

equivalent installed.

the remote machine.

3) Global PID space This patch allows

5) Modified standard utilities they

only one given process id to be in use in

have altered utilities like ps and top to

any of the linux boxes in the cluster.

give process information over all the

Thus, two nodes can always agree on

nodes in the cluster instead of just the

what Process 15 is; this helps promote

local machine. This can be thought of as

the illusion of the cluster being one large

a transparent way of dealing with things

machine instead of a number of smaller

typically handled by a supercomputers

ones. As a side effect, the Global PID

batch queue system. Where a user on the

space patch allows processes to be run

Origin 2000 would do a bps to examine

on remote machines.

the state of the processes in the queues, a

4) Virtual Shared Memory This also

Beowulf user would simply do a ps and

contributes to the illusion of the Beowulf

look at the state of both local and remote

cluster being one large machine. Even

jobs at the same time. It is up to a users

though each machine in hardware

tastes to determine which way is

has no concept of a remote memory

preferable.

In fact, there are only three self-made


A good case study of Beowulf is the

systems on the list, with the Avalon

Avalon project18 at Los Alamos National

cluster (number 265) being one of them.

Laboratory. They put together a 70-CPU


alpha cluster for 150,000$ in 1998. In

Why is that the case?

terms of peak performance, it scored


twice as high as a multi-million dollar

Although they get a great peak

Cray with 256 nodes. Peak rate, though,

performance three times greater than

is a misleading performance metric:

the Origin 2000 a Beowulf cluster like

people will point to the high GFLOPS

Avalon doesnt work as well in the real

rate and ignore the fact that those

world. Real applications communicate

benchmarks did not take communication

heavily, and a fast Ethernet switch

into account. This leads to claims like

cannot match the better speed of the

the ones that the authors make, that do-

custom Origin interconnect. Even though

it-yourself supercomputing will make

Avalon was using an equal number of

vendor-supplied supercomputers

533MHz 21164 Alphas as 195Mhz

obsolete since their price/performance

R10ks for the Origin 2000, the NASPAR

ratio is so horrible.

Class B benchmark rated the O2k at


twice the performance. A 533Mhz 21164

Interestingly enough, in the two years

specints at 27.919 while the 195Mhz

since that paper was published, the top

R10k only gets 10.420 This means that,

500 list of supercomputers is still

19

overwhelmingly dominated by vendors.


18

http://cnls.lanl.gov/avalon/

http://www.spec.org/osg/cpu95/results/res98q3/c
pu95-980914-03070.html
20

http://www.spec.org/osg/cpu95/results/res98q1/c
pu95-980206-02411.html

due to the custom hardware on the O2k,

(This is actually something that the

it was able to get six times the

GRID book is wrong about pages 440-

computing power out of the processors.

441 say that NOWs are dedicated. But

Although the authors claim a win since

the cited papers for NOW repeatedly

their system was 20 times cheaper than

state that they have the ability to migrate

the Origin, the opposite is true: it is

jobs away from workstations being used

justifying the cost of an Origin by

interactively.)

saying, If you want to make your


system run six times faster, you can pay

Both NOWs and Beowulfs are made of

extra for some custom hardware. And

machines which have independent local

given the moderate success of the Origin

memory spaces, but they go about

2000 line, users seem to be agreeing

presenting a global machine in different

with this philosophy.

ways. A Beowulf uses kernel patches to


pretend to be a multi-CPU machine with

One important thing to note about

a single address space, whereas the

Beowulf clusters is that they are

NOW project uses GLUnix, which is a

different from a NOW instead of being

layer that sits above the system kernel,

a computer lab where students can sit

that loosely glues machines together by

down and use any of the workstations

allowing MPI invocations to be

individually, a Beowulf is a dedicated

scheduled and moved between nodes.

supercomputer with one point of entry.

V. Myth

As the Avalon paper demonstrated, there are a lot of inflated expectations of what clusters
can accomplish. Scanning through the forums of Slashdot21, one can easily assess that
there is a negative attitude prevailing towards vendor supplied supercomputers. Quotes
like Everything can be done with a Beowulf cluster! and Supercomputers are dead
are quite common. This reflects a naivet on the part of the technical public as a whole.
There are two refutations to beliefs such as these:
1) The difference between buying a supercomputer and making a cluster is the
difference between repairing a broken window yourself or having a professional do
it for you. Building a Beowulf cluster is a do-it-yourself supercomputer. It is a lot
cheaper than paying professionals like IBM or Cray to do it for you but as a trade-off,
you will have a lower reliability in your system because it is being done by amateurs. The
Avalon paper tried to refute this by saying that they had over 100 days of uptime, but
reading their paper carefully, one can see that only 80% of their jobs completed
successfully. Why did 20% fail? They didnt know.

Holly Dail mentioned that the people that built the Legion cluster at the University of
Virginia suffered problems from having insufficient air conditioning in their machine
room. A significant fraction of the cost of a supercomputer is in building the chassis, and
the chassis is designed to properly ventilate multiple CPUs running heavy loads. Sure, the
Virginia people had a supercomputer for less than a real one costs, but they made up for
it in hardware problems.

21

http://www.slashdot.org/search.pl, search for Beowulf

Businesses need high availability. 40% of IT managers interviewed by zdnet13 said that
the reason that they were staying with mainframes and not moving to clusters of PCs is
that large expensive computers have more stringent uptime guarantees. IBM, for
example, makes a system that has a guaranteed 99.999% uptime which means that the
system will only be down for fifty minutes during an entire year. Businesses cant afford
to rely on systems like ASCI Blue, which is basically 256 quad Pentium Pro boxes glued
together with a custom interconnect. ASCI Blue has never been successfully rebooted.

A large part of the cost of vendor-supplied machines is for testing. As a researcher, you
might not care if you have to restart your simulation a few times, but a manger in charge
of a mission-critical project definitely wants to know that his system has been verified to
work. Do-it-yourself projects just cant provide this kind of guarantee. Thats why
whenever a business needs repairs done on the building, they hire a contractor instead of
having their employees do it for less.

3) Vendors are already doing it. It is a truism right now that Commercial, Off The
Shelf (COTS) technology should be used whenever possible. People use this to justify
not buying custom-built supercomputers. The real irony is that the companies that build
these supercomputers are not dumb, and do use COTS technology whenever they can
with the notable exception of Tera/Cray, who believe in speed at any price. The only
times that most vendors build custom hardware is when they feel that the added cost will
justify a significant performance gain.

For example, Blue Horizon, the worlds third most powerful computer, is built using
components from IBM workstations: its CPUs, memory and Operating System are all
recycled from their lower end systems. The only significant parts that are custom are the
high performance file system (which holds 4TB and can write data in parallel very
quickly), the chassis (which promotes reliability as discussed above), the SP switch
(which is being used for backwards compatibility), the monitoring software (the likes of
which cannot be found on Beowulf clusters) and the memory crossbar, which replaces the
bus-based memory system found on most machines these days. By replacing the bus with
a crossbar it greatly increases memory bandwidth and eliminates a bottleneck found in
many SMP programs: when multiple CPUs try to hit memory at once, only one at a time
can be served, causing severe system slowdown. Blue Horizon was sold to the
Supercomputer Center for 20,000,000$, which works out to roughly 20,000$ a processor,
an outrageously expensive price. But the fact that the center was willing to pay for it is
testimony enough that the custom hardware gave it enough of an advantage over systems
built entirely with COTS products.

VI Conclusion
Clustered computing is a very active field these days, with a number of good
advancements coming out of it, such as Active Messages, Fast Messages, NOW, HPVM,
Beowulf, etc. By building systems using powerful commodity processors, connecting
them with high-speed commodity networks using Active Messages and linking
everything together with a free operating system like linux, one can create a machine that
looks, acts and feels like a supercomputer except for the price tag. However, alongside
the reduced price comes a greater risk of failure, a lack of technical support when things

break (NCSA has a full service contract with SGI, for example), and the possibility that
COTS products wont do as well as custom-built ones.

A few people have created a distinction between two different kinds of Beowulf clusters.
The first, Type I Beowulf, is built entirely with parts found at any computer store:
standard Intel processors, 100BaseT Ethernet and PC100 RAM. These machines are the
easiest and cheapest to buy, but are also the slowest due to the inefficiencies common in
standard hardware. The so-called Type II Beowulf is an upgrade to Type I Beowulfs
they add more RAM than can be commonly found in PCs, they replace the 100BaseT
with some more exotic networking like Myrinet, and they upgrade the OS to use Active
Messages. In other words, they replace some of the COTS components with custom ones
to achieve greater speed.

I hold forth the view that traditional supercomputers are the logical extension of this
process, a Type III Beowulf, if you will. Blue Horizon, for example, can be thought of
as 256 IBM RS/6000 workstations that have been upgraded with a custom chassis and
memory crossbar instead of a bus. Just like Type II Beowulfs, they replace some of the
COTS components with custom ones to achieve greater speed. Theres no reason to call
for the death of supercomputers at the hands of clusters; in some sense, the vendors have
done that already.

You might also like