You are on page 1of 333

Introduction: Distributed High Performance

Computing and Gigabit Wide Area Networks


G. Cooperman, E. Jessen and G. Michler
College of Computer Science
Northeastern University
Boston, MA 02115, USA
Technische Universit~it Mtinchen
Institut fiir Informatik
Lehrstuhl VIII - Rechnerstruktur/architektur
D-80290 MOnchen
Institut ~r Experimentelle Mathematik
Universitat Essen
EllemstrafSe 29, D-45326 Essen
In many countries there are nation wide high performance wide area
computer networks, e.g. the German broad band science network B-Wi n or
the very high speed backbone network services vBNS of the Ameri can
National Science Foundation (NSF). They form a part of the Internet, the
widest wide area network of them all. Experience with these networks
provides a wealth of information, in particular on hardware and software for
wide area networking and high performance computing.
The lectures and discussions at the meeting have inspired several fruitful
international research collaborations on distributed information technologies
and applications. They emphasize the interdisciplinary cooperat i on of
electrical engineers, mathematicians and computer scientists in this very
active area of research. In particular, this meeting surveys recent research
directions in large scale networking:
a) Technologies enabling optical, electrical and wireless communi cat i ons;
b) Network engineering and network management systems;
c) System software and program devel opment environments for distributed
high
performance computing;
d) Distributed applications such as electronic commerce, distance learning
and digital
libraries;
e) High confidence systems with secure access, high availability, and strong
privacv
guarantees.
Experimental results for such systems of the future can onl y be gained
through access to today' s gigabit wide area networks. Therefore another
important section of this workshop was devoted to reports about existing
gigabit testbeds and planned high speed networks in the United States,
Europe, and in particular in Germany. These testbeds include: the American
initiatives for the next generation internet (NGI) and the Internet 2 of about
120 Universities in the United States (cf. K. Abdali); the Norwegian National
Research Network (cf. T. Plagemann); and the planned German
Gigabitwissenschaftsnetz (GWin) (cf. E. Jessen).
The NGI initiative is financially supported by the American federal agencies
DARPA, DOE, NASA, NIH, NIST and NSF. It aims:
a) to promote research, development and experimentation in advanced
networking
technologies;
b) to deploy a NGI testbed emphasizing end-to-end performance, end-to-end
quality of
service and security;
c) to develop ultra high speed switching and transmission technologies;
d) to develop demanding applications that make use of the advances in
network
technologies. Among the proposed application areas are: health care, crisis
management
and response, distance learning, and distributed high performance
computations for
biomedicine, climate modeling, and basic science.
The Internet 2 project is funded by 77 American universities and some
industrial partners. It is driven by education and research. Internet 2 will
include a gigabit network (Project Abilene), which will operate in 1999. Of
course it will benefit from the experiments and results of the NGI initiative.
In Germany, in spring 2000, GWin, the gigabit network of DFN (Deutsches
Forschungsnetz; German national scientific networking association) will start
its operation.
As a forerunner for its gigabit network, DFN is supporting two gigabit
testbeds in West and South Germany (with a link to Berlin) where
experiments are performed. Several articles of this volume report both on
planned and on completed experiments. The gigabit testbed West connects
the research centers GMD St. Augustin and FZ Jtilich in North Rhine
Westphalia with a bandwidth of 2.5 Gbps. It has broadband connections to
the computer centers of the DLR (Deutsches Zentrum fur Luft- und
Raumfahrt) in Cologne-Porz and the Universities of Cologne and Essen. The
gigabit testbed Stid connects the University of Erlangen and the Technical
University of Munich. It currently consists of a dark fiber connection
between the computer centers of these universities. The bandwidth of this
switched ATM network is initially 3 times 2.5 Gbps, and has a capacity many
times larger through the use of wave length division multiplexing. The
gigabit t est bed SOd will be extended to Berlin and Stuttgart, in order to
connect the supercomputers at the Konrad Zuse Institute in Berlin, Leibniz
Computing Center in Munich and the Computing Center of Stuttgart
University. This wide area network of supercomputers will be used for
demanding distributed high performance computations in applied
mathematics, physics, chemistry, engineering, economics and medicine.
In 1997 the United States has established the National Computational Science
Alliance. It is led by the Supercomputer Centers at the University of Illinois
at Urbana and the University of California at San Diego. Each alliance
consists of more than 60 partner institutions, including academic and
government research labs and industrial organizations. These cooperating
institutions all benefit from a metacomputing environment based on high
speed networks. K. Abdali of the NSF provides further details in the article,
"Advanced computing and communications research under NSF support".
The strictest requirements for high bandwidth applications can currently be
found in the areas of metacomputing and distributed high performance
computation. These applications serve a secondary purpose in stress testing
the network, to help the engineers and computer scientists design better ones.
However, the number of such large scale experiments is currently rather
small. As long as the cooperating institutions interconnected by a wide area
high speed network are not given extra resources for distributed computer
applications this situation is likely to continue.
Another set of lectures at the meeting was devoted to the interplay between
communication hardware and software for high speed computer networks
and mathematical algorithm development for distributed high performance
computations. In particular, the implementations of the parallel linear algebra
algorithms help to create experiments checking the technical communication
properties of a broadband computer network with 155 Mbit/s bandwidth and
higher. On the other side such benchmarks also help to analyze the
effficiency of a mathematical algorithm.
This volume also contains several contributions concerning the very
organization of scientific knowledge, itself. Many scientific publications
quoting high performance computer applications lack proper documentation
of the original computer programs and of the memory intensive output data.
Recently many mathematical and other scientific journals have begun
offering both paper and digital formats. The digital versions offer many
advantages for the future. They have the potential for being searched, and
they can be incorporated into a distributed library system. Over scientific
wide area networks such as the planned German Gigabitwissenschaftsnetz,
libraries of universities and research institutes in digital form can be
combined into a national distributed research library. The members of these
institutions can be allowed to search, read, print and even annotate the digital
texts (and their computational appendices containing the original programs
and the output data) at their personal computers. The wide area networks
offer not only distributed library applications but they also offer completely
new applications making use of multimedia, distance learning, and computer-
aided teaching. Therefore this volume also contains several contributions
describing such applications.
Adv a nc e d Co mp u t i n g and Co mmu n i c a t i o n s
Res earch
unde r NS F Suppo r t
S. Kamal Abdal i
National Science Foundation, Arlington, VA 22230, USA
Ab s t r a c t . This paper discusses the research initiatives and programs support ed
by the National Science Foundation to promote high-end computing and large-
scale networking. This work mainly falls under the US interagency act vi t y called
High Performance Computing and Communications (HPCC). The paper describes
t he Federal geovernment context of HPCC, and t he HPCC programs and t hei r
main accomplishemnts. Finally, it decribes t he recommendations of a recent hi gh-
level advisory committee on information technology, as these are likely t o have a
major impact on the future of government initiatives in high-end comput i ng and
networking.
1 I n t r o d u c t i o n
A pr evi ous paper [1] descr i bed t he act i vi t i es of t he Nat i onal Sci ence Foun-
dat i on (NSF) in t he U.S. High Per f or mance Comput i ng and Communi ca-
t i ons ( HPCC) pr ogr am unt i l 1996. The pur pose of t he pr esent paper is t o
upda t e t ha t descr i pt i on t o cover t he devel opment s since t hen. Whi l e some
management changes have t aken pl ace dur i ng t hi s per i od, and t her e is some
r edi r ect i on of its t hr ust s, t he HPCC pr ogr am cont i nues t o fl ouri sh, t o say
t he least. The mai n new act i vi t i es at t he NSF ar e Par t ner s hi ps for Advanced
Comput i ng I nf r ast r uct ur es (PACIs), t he Next Gener at i on I nt er net ( NGI ) ,
and t he Knowl edge and Di st r i but ed Int el l i gence (KDI) i ni t i at i ve, and t her e
are renewed pr ogr ams for Science and Technol ogy Cent er s and Di gi t al Li-
brari es. New i ni t i at i ves t hat may repl ace t he pr ogr am or change its di r ect i on
subst ant i al l y ar e also expect ed t o resul t f r om t he r ecommendat i ons of t he
Pr esi dent i al I nf or mat i on Technol ogy Advi sor y Commi t t ee ( PI TAC) . The pa-
per is mai nl y concer ned wi t h t hese new issues. But t o make it sel f-cont ai ned,
t he ent i r e HPCC cont ext is bri efl y descr i bed also.
2 Th e HPCC pr o g r a m
The US Hi gh Per f or mance Comput i ng and Communi cat i on ( HPCC) pr ogr a m
was l aunched in 1991. It oper at ed as a congressi onal l y ma nda t e d i ni t i at i ve
f r om Oct ober 1991 t hr ough Sept ember 1996, following t he e na c t me nt of t he
6
High Performance Comput i ng Act of 1991. Since Oct ober 1996, it has con-
tinued as a program under the leadership of t he Comput i ng, Information,
and Communications (CIC) Subcommi t t ee of t he Commi t t ee on Technology
(CT) which is itself overseen by t he National Science and Technology Coun-
cil, a US Cabinet-level organization. Inst rument al in t he est abl i shment of t he
program was a series of national-level studies of scientific and technologi-
cal trends in computing and networking [2-5]. These studies concl uded and
persuasively argued t hat a federal-level initiative in high-performance com-
puting was needed t o ensure the preeminence of American science and tech-
nology. Solving t he challenging scientific and engineering probl ems t hat were
already on t he horizon required significantly more comput at i onal power t han
was available. Anot her factor was t he progress made abroad, especially t he
Japanese advances in semiconductor chip manufact ure and supercomput er
design, and the Western European advances in supercomput i ng applications
in science and engineering. It was also clear t hat t he advances in information
technology would have a far reaching impact beyond science and technology,
and would affect society in general in profound, unprecedent ed ways. The
HPCC program was thus established t o stimulate, accelerate, and harness
these advances for coping with scientific and engineering challenges, solving
societal and environmental problems, meeting national security needs, and
in improving the nations economic product i vi t y and competitiveness.
As late as 1996, t he goals of t he HPCC initiative were st at ed separat el y
(e.g., in [10]) from the CIC mission descriptions. Now t hat HPCC has become
a CIC research and development (R&:D) program, its goals are subsumed in
the CIC goals, which are formally st at ed as follows ([13]):
* Assure continued US leadership in computing, information, and commu-
nications technologies to meet Federal goals and t o suppor t U.S. 21st
century academic, defense, and industrial interests
* Accelerate deployment of advanced and experimental information tech-
nologies to maintain world leadership in science, engineering, and mat h-
ematics; improve the quality of life; promot e long t erm economic growth;
increase lifelong learning; prot ect the environment; harness information
technology; and enhance national security
* Advance U.S. product i vi t y and industrial competitiveness t hrough long-
t erm scientific and engineering research in computing, information, and
communications technologies
3 HP CC P a r t i c i p a n t s a n d Co mp o n e n t s
The HPCC program at present involves 12 Federal agencies, each wi t h its
specific responsibilities. In alphabetical order, t he part i ci pat i ng agencies are:
Agency for Health Care Policy and Research ( AHCPR) , Defense Advanced
Research Proj ect s Agency (DARPA), Depart ment of Energy (DOE), De-
part ment of education (ED), Environmental Prot ect i on Agency (EPA), Na-
tional Aeronautics and Space Admi ni st rat i on (NASA), National I nst i t ut e of
Health (NIH), National Inst i t ut e of St andards and Technology (NIST), Na-
tional Oceanic and Atmospheric Administration (NOAA), Nat i onal Securi t y
Agency (NSA), National Science Foundat i on (NSF), and Depar t ment of Vet-
eran Affairs (VA). The activities sponsored by these agencies have br oad
participation by universities as well as t he industry. The program activities
of the participating organizations are coordi nat ed by t he Nat i onal Coordi na-
tion Office for Computing, Information, and Communications (NCO), which
also serves as t he liaison to t he US Congress, st at e and local governments,
foreign governments, universities, industry, and the public. The NCO dissem-
inates information about HPCC program activities and accompl i shment s in
the form of announcements, technical reports, and the annual report s t hat are
popularly known as "blue books" [6-13]. The NCO also mai nt ai ns t he web
site http://www.ccic.gov to provide up-t o-dat e, online document at i on about
the HPCC program, as well as links to t he HPCC-rel at ed web pages of all
participating organizations.
The program currently has five components: 1) High End Comput i ng
and Comput at i on, 2) Large Scale Networking, 3) High Confidence Systems,
4) Human Centered Systems, and 5) Education, Training, and Human Re-
sources. Together, these components are meant t o foster, among ot her things,
scientific research, technological development, industrial and commercial ap-
plications, growt h in education and human resources, and enhanced public
access t o information. In addition to these components, t here is a Federal
Information Services and Applications Council to oversee t he appl i cat i on of
CIC-developed technologies for federal information systems, and t o dissemi-
nat e information about HPCC research t o ot her Federal agencies not formally
participating in the program.
The goals of t he HPCC components are as follows (see t he "Blue Book"
99 [13] for an official description):
1. Hi g h En d Co mp u t i n g a n d Co mp u t a t i o n : To assure US leadership
in computing through investment in leading-edge hardware, software,
and algorithmic innovations. Some representative research directions are:
comput i ng devices and storage technologies for high-end comput i ng sys-
tems; advanced computing architectures; advanced software systems, al-
gorithms, and software for modeling and simulation. This component
also support s investigation of ideas such as optical, quant um, and bio-
molecular computing t hat are quite speculative at present, but may lead
t o feasible computing technologies in t he future, and may radically change
t he nat ure of computing.
2. La r ge Scal e Ne t wo r k i n g : To assure US leadership in high-performance
communications. This component seeks t o improve t he st at e-of-t he-art
in communications by investing in research on networking component s,
systems, services, and management. The support ed research directions
include: advanced technologies t hat enable wireless, optical, mobile, and
wireline communications; large-scale network engineering; system soft-
ware and program development environments for network-centric com-
puting; and software technology for distributed applications, such as elec-
tronic commerce, digital libraries, and health care delivery.
3. Hi gh Conf i dence Syst ems: To develop technologies t hat provide users
with high levels of security, protection of privacy and data, reliability, and
restorability of information services. The supported research directions
include: system reliability issues, such as network management under
overload, component failure, and intrusion; survival of threatened systems
by adaptation and reconfiguration; technologies for security and privacy
assurance, such as access control, authentication, and encryption.
4. Huma n Ce n t e r e d S y s t e ms : To make computing and networking more
accessible and useful in the workplace, school, and home. The technolo-
gies enabling this include: knowledge repositories and servers; collabora-
tories t hat provide access to information repositories and t hat facilitate
sharing knowledge and control of instruments at remote labs; systems
that allow multi-modal human- system interactions; and virtual reality
environments and their applications in science, industry, health care, and
education.
5. Ed u c a t i o n , Tr a i ni ng , a nd H u ma n Re s o ur c e s : To support HPCC re-
search that enables modern education and training technologies. All lev-
els and modes of education are targeted, including elementary, secondary,
vocational, technical, undergraduate, graduate, and career-enhancing ed-
ucation. The education and training also includes the production of re-
searchers in HPCC technologies and applications, and a skilled workforce
able to cope with the demands of the information age. The supported
research directions include information-based learning tools, technologies
that support lifelong and distance learning for people in remote locations,
and curriculum development.
4 HPCC at NS F
As mentioned above, NSF is one of the 12 Federal agencies participating
in the HPCC program. The total HPCC budget and the NSF share in it
since the inception of the program are shown in Table 1. Thus, during this
period, NSF's share has ranged approximately between one-fourth and one-
third of the total Federal HPCC spending. The HPCC amount has remained
approximately 10% of the NSF's own total budget during the same period.
Table 1. HPCC Investment: Total budget and NSF's share (in $M)
Fiscal Year 1992 1993 1994 1995 1996 1997 1998 1999
Total HPCC budget 655 803 938 1039 1043 1009 1070 830
NSF's HPCC share 201 262 267 297 291 280 284 297
The NSF objectives for its HPCC effort are:
9 Enabl e U.S. to uphold a position of world leadership in t he science and
engineering of computing, information and communications.
9 Pr omot e understanding of t he principles and uses of advanced comput -
ing, communications, and information syst ems in service t o science and
engineering, to education, and t o society.
9 Cont ri but e t o universal, t ransparent , and affordable part i ci pat i on in an
information-based society.
Thus NSF' s HPCC-rel at ed work spans across all of t he five HPCC program
components.
HPCC research penet rat es t o varying dept h nearly all t he scientific and
engineering disciplines at NSF. But most of this research is concent rat ed
in t he NSF' s Directorate of Comput er and Informat i on Science and Engi-
neering (CISE). This directorate is organized into 5 divisions each of which
is, in turn, divided into 2-8 programs. The work of t he CISE divisions can
be, respectively, characterized as: fundament al comput at i on and communi-
cations research; information, knowledge, intelligent systems, and robotics
research; experimental systems research and integrative activities; advanced
comput at i onal infrastructure research; and advanced networking infrastruc-
ture research. While the phrase "high performance" may not be explicitly
present in the description of many programs, t he act ual research t hey un-
dertake is very much focused on HPCC. Indeed, t he CISE budget is almost
entirely at t ri but ed to HPCC. Representative ongoing research topics include:
scalable parallel architectures; component technologies for HPCC; simula-
tion, analysis, design and t est tools needed for HPCC circuit and syst em
design; parallel software systems and tools, such as compilers, debuggers,
performance monitors, program development environments; heterogeneous
computing environments; di st ri but ed operating systems, tools for building
di st ri but ed applications; network management, aut hent i cat i on, security, and
reliability; intelligent manufacturing; intelligent learning systems; probl em
solving environments; algorithms and software for comput at i onal science and
engineering; integration of research and learning technologies; very large dat a
and knowledge bases; visualization of very large dat a sets.
5 L a r g e HP CC P r o j e c t s
The HPCC program has led to several innovations in NSF' s mechanisms
for support i ng research and human resources development. The t radi t i onal
manner of funding individual researchers or small research t eams continues
t o be applied for HPCC work too. But t o meet special HPCC needs, NSF
has initiated a number of t ot al l y new programs, such as supercomput i ng cen-
ters, partnerships for advanced comput at i onal infrastructures, science and
10
technology centers, and various "challenges". Also launched were special ini-
tiatives such as digital libraries, knowledge and di st ri but ed intelligence, and
the next generation internet. These proj ect s are much larger t han t he t ra-
ditional ones in t he scope of research, number of part i ci pat i ng investigators,
research duration, and award size.
5.1 Science and Technol ogy Centers (STCs)
The purpose, structure, and HPCC cont ri but i ons of STCs were descri bed in
[1]. So here we mainly st at e the developments t hat have t aken place since.
STCs are intended t o st i mul at e "integrative conduct of research, educa-
tion, and knowledge transfer." They provide an environment for i nt eract i on
among researchers in various disciplines and across i nst i t ut i onal boundari es.
They also provide the st ruct ure t o identify i mport ant complex scientific prob-
lems beyond disciplinary and institutional limits and scales, and t he critical
mass and funding stability and durat i on needed for their successful solution.
They carry out fundamental research, facilitate research applications, pro-
mote technology transfer through industrial affiliations, disseminate knowl-
edge via visitorships, conferences and workshops, educat e and t rai n people for
scientific professions, and introduce minorities and underrepresent ed groups
to science and technology through out reach activities.
STCs are large research proj ect s each of which involves typically 50+
principal investigators from 10+ academic institutions, and also has links
to the industry. The participants work t oget her on interdisciplinary research
unified by a single theme, such as parallel comput i ng or comput er graphics.
The projects are awarded initially for 5 years, are renewable for anot her 5
years, and are finally given an ext ra year for orderly phaseout . There is no
further renewal, so a center has t o shut down definitely in at most 11 years.
Of course, the investigators are free t o regroup and compet e again in t he
program in the future if it continues.
As a result of t he competitions t hat t ook place in 1989 and 1991, 25 STCs
were established by NSF. All of t hem have entered their final year now. The
following four of those STCs were support ed by t he HPCC program: The Cen-
ter for Research in Parallel Comput at i on ( CRPC) at Rice University; The
Center for Comput er Graphics and Scientific Visualization at t he University
of Utah; The Center for Discrete Mat hemat i cs and Theoret i cal Comput er
Science (DIMACS) at Rutgers University; and The Center for Cognitive Sci-
ence at the University of Pennsylvania. These STCs have cont ri but ed nu-
merous theoretical results, algorithms, mat hemat i cal and comput er science
techniques, libraries, software tools, languages, and environments. They have
also made significant advances in various scientific and engineering applica-
tion areas. Their out put has been impressive in quality, quantity, and i mpact .
In 1995, NSF undert ook a t horough evaluation of t he STC program. For
one st udy [14], Abt Associates, a private business and policy consulting firm
was commissioned to collect various kind of information about the STCs, and
11
the National Academy of Science was asked to examine t hat dat a and evaluate
the program. Another study [15] was conducted by the National Academy
of Public Administration. Both studies concluded t hat the STC program
represented excellent return on federal research dollar investment, and rec-
ommended that the program be continued further. The studies also endorsed
most of the past guidelines regarding the funding level, award duration, em-
phasis on education and knowledge transfer (additionally to research), review
and evaluation criteria, and management structure.
Based on these findings, NSF has decided to continue the STC program.
A new round of proposal solicitations took place in 1998. The submitted
proposal have been evaluated, and the awards are expected to be announced
soon (as of March 1999).
5. 2 P a r t n e r s h i p s for Ad v a n c e d C o mp u t a t i o n a l I n f r a s t r u c t u r e s
( PACI s )
The precursor to PACIs was a program called Supercomputing Centers (SCs)
that was established by NSF in 1985 even before the start of the HPCC ini-
tiative. But the SC program greatly contributed to the momentum behind
HPCC, and, since its launch, became a significant part of the initiative. For
a 10-year duration, the program funded four SCs: Cornell Theory Center,
Cornell University; National Center for Supercomputing Applications, Uni-
versity of Illinois at Urbana-Champaign; Pittsburgh Supercomputer Center,
University of Pittsburgh; and San Diego Supercomputer Center, University
of California-San Diego. Several of their accomplishments and HPCC contri-
butions have been reported in [1].
A Task Force to evaluate the effectiveness of the SC program was com-
missioned by NSF in 1995. This resulted in a document which is popularly
known as the "Hayes Report [16]. The study considered the alternatives of
renewing the SCs or having a new competition, and recommended the latter.
For a more effective national computing infrastructure development, it also
recommended funding fewer but larger alliances of research and experimen-
tal facilities and national and regional high- performance computing centers.
Based on these findings, NSF instituted the PACI program in 1996, as the
successor to the SC program. The aim of the PACIs is to help maintain US
world leadership in computational science and engineering by providing ac-
cess nationwide to advanced computational resources, promoting early use
of experimental and emerging HPCC technologies, creating HPCC software
systems and tools, and training a high quality, HPCC-capable workforce.
After holding a competition, NSF made two PACI awards in 1997. These
are the National Computational Science Alliance (Alliance) led by the Na-
tional Center for Supercomputing Applications (NCSA) at the University of
Illinois at Urbana-Champaign, and the National Partnership for Advanced
Computational Infrastructure (NPACI) led by the San Diego Supercomputer
Center at the University of California at San Diego. Each consists of more
12
than 60 partner institutions, including academic and government research
labs, national, state-level and local computing centers, and business and in-
dustrial organizations. The leading sites, which maintain a variety of high-
performance computer systems, and the partners which maintain smaller
configurations of similar systems, jointly constitute a metacomputing envi-
ronment connected via high-speed networks. The partners contribute to the
infrastructure by developing in-house, using, and testing the necessary soft-
ware, tools, environments, applications, algorithms, and libraries, thereby
contributing to the further growth of a "national grid" of networked high-
performance computers.
The initial mission of the SCs was to satisfy the supercomputing needs
of US computational scientists and engineers. The major role of the PACIs
continues to be to provide supercomputing access to the research community
in all branches of science and engineering. But their expanded mission puts
a heavy emphasis on education and training at all levels.
5.3 Ne x t Generat i on Internet ( NGI )
The NGI initiative, a multi-agency Federal R&D program t hat began in Oc-
tober 1997, is the main focus of LSN. It represents consolidation and refine-
ment of ideas behind the vision of a National Information Infrastructure.
This infrastructure is a subject of various studies, most importantly [17,18].
The NGI initiative supports foundational work to lead to much more pow-
erful and versatile networks than the present-day Internet. To advance this
work, the initiative fosters partnerships among universities, industry and the
government. The participating federal government agencies include: DARPA,
DOE, NASA, NIH, NIST and NSF. The NGI goals are:
1. Promote research, development, and experimentation in networking tech-
nologies.
2. Deploy testbeds for systems scale testing of technologies and services.
3. Develop "revolutionary" applications that utilize the advancements in
network technologies and exercise the testbeds.
The aim of the advancement stipulated in Goal 1 is to dramatically im-
prove the performance of networks in reliability, security, quality of ser-
vice/differentiation of service, and network management. Two testbeds are
planned for Goal 2. The first testbed is required to connect at least 100 sites
and deliver speeds that are at least 100 times faster end-to-end than the
present-day Internet. The second testbed is required to connect about 10
sites with end-to-end performance speed faster than the present Internet by
at least a factor of 1000. The "revolutionary" applications called for in Goal
3 are to range over enabling applications technologies as well as disciplinary
applications. Suggested examples of the former include collaboration tech-
nologies, digital libraries, distributed computing, virtual reality, and remote
13
operat i on and simulation. Suggested application areas for the latter include
basic science, education, heal t h care, manufacturing, electronic commerce,
and government information services.
The NGI work in progress was showcased in t he Supercomputing 98 confer-
ence in a special session called Netamorphosis. The "Netamorphosis" demon-
strations consisted of 17 significant NGI applications, ranging over visualiza-
tion, scene analysis, simulation, manufacturing, remot e operation, etc. For
example, a demonst rat i on entitled "Real-Time Functional MRI: Watching
the Brain in Action" showed how one could remot el y view brain activity
while a patient was performing cognitive or sensory- mot or tasks. The syst em
could process functional MRI dat a in real-time, t hough t he dat a acquisition,
main comput at i ons, and visualization all t ook place at different sites con-
nected by advanced networks. Anot her demonst rat i on entitled "Di st ri but ed
Image Spreadsheet: Eart h Dat a from Satellite t o Deskt op" showed how sci-
entists could analyze, process, and visualize massive amount s of geologic,
atmospheric, or oceanographic dat a t ransmi t t ed t o their workst at i ons from
Eart h Observing Syst em satellites.
5. 4 Di g i t a l Li brari es I ni t i a t i v e ( DLI )
The original DLI, now referred to as DLI Phase 1, st ar t ed as a joint venture
of NSF, DARPA, and NASA. Now t he initiative is in Phase 2, and includes
as sponsors those agencies as well as t he National Li brary of Medicine, t he
Li brary of Congress, and t he National Endowment for t he Humanities.
The initiative seeks to advance t he technologies needed t o offer infor-
mat i on essentially about anything, to anyone, l ocat ed anywhere around t he
nation and the world. A digital library is intended t o be a very large-scale
storehouse of knowledge in multimedia form t hat is accessible over t he net.
The construction and operat i on of digital libraries requires developing tech-
nologies for acquiring information, organizing this information in di st ri but ed
multimedia knowledge bases, ext ract i ng information based on requested cri-
teria, and delivering it in t he form appropri at e for t he user. Thus, t he DLI
promot es research on information collection, analysis, archiving, search, fil-
tering, retrieval, semantic conversion, and communication.
The Phase 1 is supporting 6 large consortia consisting of academic and in-
dustrial partners. Their main proj ect t hemes and their lead institutions are:
geographic information systems, maps and pictures, content-base retrieval
(University of California-Santa Barbara); intelligent internet search, seman-
tic retrieval, scientific j ournal publishing alternatives (University of Illinois);
media integration and access, new models of "documents," nat ural language
processing (University of California-Berkeley); digital video libraries, speech,
image and nat ural language technology integration (Carnegie Mellon Univer-
sity); intelligent agent architecture, resource federation, AI service market
economies, educational impact (University of Michigan); uniform access, dis-
14
t ri but ed obj ect architectures, interface for di st ri but ed information retrieval
(Stanford University).
The Phase 1 of t he initiative was mainly concerned wi t h learning, pro-
totyping, and experimenting in t he small. The Phase 2 expect s t o put this
experience into actually building larger, operational, and usable syst ems and
testbeds. There is emphasis on larger contents and collections, interoperabil-
ity and technology integration, and expansion of domains and user commu-
nities for digital libraries. The support ed activities are expect ed t o range
t hrough t he full spect rum of fundamental research, cont ent and collections
development, domain applications, t est beds, operat i onal environments, and
applications for developing educational resources and preserving t he national
cultural heritage.
5.5 Knowledge and Distributed Intelligence (KDI)
KDI is a new initiative t hat NSF established in 1998. The HPCC research
has traditionally been concentrated in t he NSF' s Comput er and Informat i on
Science and Engineering directorate. The KDI initiative stems from t he re-
alization t hat the advances in computing, communications, and information
technologies provide unprecedented possibilities for accelerating progress in
all spheres of human t hought and action. KDI stresses knowledge as opposed
t o information, but realizes, of course, t hat intelligent gathering of informa-
tion is a prerequisite to creating knowledge. Thus, a goal of KDI is t o improve
the human ability t o discover, collect, represent, store, apply, and t ransmi t
information. This is t o lead t o improvements in t he ways t o creat e knowledge
and in t he actual acquisition of new knowledge. The KDI research is classified
into three components:
1. Knowledge Networking (KN)
2. Learning and Intelligent Systems (LIS)
3. New Comput at i onal Challenges (NCC)
The KN component aims at building an open and context-rich envi ronment
for online interactions among individuals as well as groups. For such an
environment t o arise, advances have t o be made in the techniques for col-
lecting and organizing information and discovering knowledge from it. The
KN-enabled vast scale of information acquisition and t he power t o uncover
knowledge buried in collected dat a has grave implications for privacy and
other human interest matters. Hence, KN is also concerned with research on
social, societal, ethical, and other aspects of networked information.
The focus of the LIS component of KDI is t o bet t er underst and t he process
of learning itself, as it occurs in humans, animals, and artificial systems. This
understanding is to be used for improving our own learning skills, developing
bet t er teaching methods, and creating intelligent artifacts.
The NCC component is in t he spirit of NSFs "Challenges programs, such
as Grand Challenges, National Challenges, and Multidisciplinary Challenges.
15
In [1], these programs were described, and their impact and some of their
accomplishments were stated. The NCC component continues to seek solu-
tions of very complex scientific and engineering problems, ones t hat are com-
putationally expensive, data intensive, and require multidisciplinary team
approaches. The Challenges research and the advance in high-performance
computing and communications system have a mutually benefiting push-pull
relationship; the former stress tests the latter, and the latter helps the former
grow in scale and scope. NCC research aims to improve our ability to model
and simulate complex systems such as the oceans or the brain. In adopt-
ing the Challenges research, the KDI initiative sees it as another knowledge
creation activity.
In 1998, NSF made 40 awards for KDI research for a total funding of
$51.5M. The awards span a broad range of topics, vast scopes of research,
and investigators representing diverse disciplines and institutions. The 1999
KDI competition is in process.
6 HPCC Eva l ua t i on
General directions as well as clear objectives were defined for the HPCC pro-
gram from the very beginning. Thus, some evaluation is built into the pro-
gram. Some objectives naturally lead to quantifiable measures of progress,
such as computation speeds in teraflops, communication bandwidth in giga-
bits, network extent in number of connected nodes, etc. On the other hand,
there are qualitative aspects of progress, such as scientific breakthroughs,
innovative industrial practices, societal penetration of knowledge and tech-
nology, quality of work force trained, etc.
The evaluation of the STC and SC programs has already been mentioned.
Other parts of the NSF HPCC program have also produced impressive results.
For the effectiveness of the HPCC program as whole, a number of evaluation
studies have been done. The "Branscomb Report [19], is devoted to studying
the means for making the program more productive. A thorough assessment
of the effectiveness of the program is undertaken in the "Brooks-Sutherland
Report" [20]. The purpose of a more recent recent study [21] is to suggest the
most important future HPCC applications, specially the ones with highest
national, societal, and economic impact.
There is consensus that the HPCC program has been successful on most
fronts. Not only the year by year milestones for quantifiable progress have
been met, but the activities undertaken by the program have led to several
significant, unanticipated beneficial developments. The launch of new impor-
tant HPCC-inspired initiatives witnesses the programs strong momentum.
But as the next section shows, there is a perception t hat the HPCC program
is underfunded and the progress resulting from it is going to decelerate unless
newer and larger investments are added to it.
16
7 Presi dent s I nf ormat i on Technol ogy Advi s or y
Co mmi t t e e ( PI TAC)
PITAC was established in February 1997 to provide advice to the Adminis-
tration on all areas of computing, communications, and information technol-
ogy. This committee at present consists of 26 research leaders representing
academia and the industry. It issued an interim report in August 1998 and
a final one in February 1999 [22], after a series of meetings and broad con-
sultations with the research community. This report examines the impact of
R&D in Information Technology (IT) on US business and science, and makes
a number of recommendations for further work.
The PITAC report observes t hat the past IT R&D through HPCC and
other programs is a significant factor in the nations world leadership position
in science, industry, business, and the general well-being of the citizenry. IT
advances are responsible for a third of the US economic growth since 1992,
and have created millions of high-paying new jobs. The computational ap-
proach to science in conjunction with the HPCC algorithms, software, and
infrastructure have helped the US scientists make new discoveries. The com-
petitiveness of US economy is owed much to the efficiencies resulting from IT
in engineering design, manufacturing, business, and commerce.
If IT is the engine that is driving the economy, then obviously it needs to
be kept running by further investment. The PITAC report argues t hat the
IT industry is spending the bulk of its own resources, financial and human,
on near-term development of new products for an exploding market. The IT
industry can contribute only a small fraction of the long-term R&D invest-
ment needed. Moreover, the industry does not see any immediate benefits of
the scientific and social components of IT, and therefore has no interest in
pursuing them. After estimating the total US R&D expenditure on IT, and
the Federal and industrial shares of it, the PITAC conclusion is t hat the Fed-
eral support of the Information Technology (IT) R&D is grossly inadequate.
Moreover, it is focused too much on near-term and applied research.
PITAC has recommended increments of about $1.3 billion per year for
the next 5 years. PITAC has also identified the following four high priority
areas as main targets of increased investment.
S o f t w a r e : Software production methodologies have to be dramatically
improved, by fundamental research, to deliver robust, usable, manageable,
cost-effective software.
Scalable I n f o r m a t i o n I n f r a s t r u c t u r e : With the ever increasing size,
complexity, and sheer use of networks, research is needed on how to build
networks that can be easily extended yet remain reliable, secure, and easy to
us e .
Hi gh- End Comput i ng: Scientific research and engineering design are
becoming more and more computational. The increasing complexity of prob-
lems demand ever faster computing and communications. Thus, sustained
17
research is needed on high performance architectures, networks, devices, and
systems.
S o c i o e c o n o mi c I mp a c t : Research is needed t o exploit t he IT advances
t o serve t he society and t o spread its benefits t o all citizens. The accompa-
nying social, societal, ethical, and legal issues have t o be studied, and ways
have t o be sought for mitigating any pot ent i al negative i mpact .
Based on t he PITAC recommendations, a new Federal interagency ini-
tiative called Information Technology for t he Twenty-first Cent ur y (IT 2) is
being developed, as a possible successor t o t he HPCC program.
8 C o n c l u s i o n
Scientific and engineering work is becoming more comput at i onal , because,
increasingly, comput at i on is replacing physical experi ment at i on and t he con-
struction and testing of prot ot ypes. (Indeed, t he US Accelerated Strategic
Comput i ng Initiative plans t o depend t ot al l y on comput at i onal simulation
in its weapons research program for those weapons whose physical testing
is banned by international treaties.) Several recent scientific discoveries have
been possible because of comput at i on. The HPCC program has played a key
role in the rise of comput at i onal science and engineering.
In [1], it was observed t hat collaboration and t eam work emerged as an im-
port ant modality of HPCC research. In particular, t he HPCC programs have
emphasized 1) multi-disciplinary, multi-investigator, multi-institution teams,
2) partnerships among academia, business, and industry, and 3) cooperative,
interagency sponsorship of research. In recent years, t he collaboration has
increased in intensity and scale. The transition from SCs t o PACIs is a good
example.
The previous Challenge proj ect s t ended t o be computation-intensive. In a
number of NCC projects, the data-intensive aspect domi nat es t he comput at i on-
intensive one. Because of this situation, dat a mining has emerged as a key
solution st rat egy for many Challenge-scale problems.
In practice, t he HPCC program has so far been focused on applications
and infrastructure development. Par t l y this is because most of t he partici-
pat i ng agencies in t he HPCC program have special missions, and have rightly
emphasized t he fulfillment of their missions rat her t han basic research. The
development of high performance comput i ng i nfrast ruct ure has also served
some critical research needs. But t here is need now t o bol st er fundamental
research in order t o stimulate further progress t owards t he original HPCC
goals. The PITAC report urges this.
R e f e r e n c e s
1. Abdali S.K.: High Performance Computing Research at NSF, In G. Cooper-
man, G. Michler and H. Vinck (Eds.), Proc. Workshop on High Performance
18
Computation and Gigabit Local Area Networks, Lect. Notes in Control and
Information Sci. # 226, Springer-Verlag Berlin, 1997.
2. A National Computing Initiative: The Agenda for Leadership, Society for In-
dustrial and Applied Mathematics, Philadelphia, PA, 1987.
3. Toward a National Research Network, National Academy Press, Washington,
D.C., 1988.
4. Supercomputers: Directions in Technology and Applications, National Academy
Press, Washington, D.C., 1989.
5. Keeping the U.S. Computer Industry Competitive: Defining the Agenda, Na-
tional Academy Press, Washington, D.C., 1990.
6. Grand Challenges: High Performance Computing and Communications ("FY
1992 Blue Book"), Federal Coordinating Council for Science, Engineering, and
Technology, c/o National Science Foundation, Washington, D.C., 1991.
7. Grand Challenges 1993: High Performance Computing and Communications
("FY 1993 Blue Book"), Federal Coordinating Council for Science, Engineering,
and Technology, c/o National Science Foundation, Washington, D.C., 1992.
8. High Performance Computing and Communications: Toward a National Infor-
mation Infrastructure ("FY 1994 Blue Book" ), Office of Science and Technology
Policy, Washington, D.C., 1993.
9. High Performance Computing and Communications: Technology for a National
Information Infrastructure ("FY 1995 Blue Book"), National Science and Tech-
nology Council, Washington, D.C., 1994.
10. High Performance Computing and Communications: Foundation for America's
Information Future ("FY 1996 Blue Book"), National Science and Technology
Council, Washington, D.C., 1995.
11. High Performance Computing and Communications: Advancing the Frontiers
of Information Technology ("FY 1997 Blue Book"), National Science and Tech-
nology Council, Washington, D.C., 1996.
12. Technologies for the 21st Century ("FY 1998 Blue Book"), National Science
and Technology Council, Washington, D.C., 1997.
13. Networked Computing for the 21st Century ("FY 1999 Blue Book"), National
Science and Technology Council, Arlington, VA, 1998.
14. National Science Foundation's Science and Technology Centers: Building an In-
terdisciplinary Research Program, National Academy of Public Administration,
Washington, D.C., 1995.
15. An Assessment of the National Science Foundation's Science and Technology
Centers Program, National Research Council, National Academy Press, Wash-
ington, D.C., 1996.
16. Report of the Task Force on the Future of the NSF Supercomputing Centers
("Hayes report"), Pub. NSF 96-46, National Science Foundation, Arlington,
VA.
17. The Unpredictable Certainty: Information Infrastructure through 2000, Na-
tional Research Council, National Academy Press, Washington, D.C., 1996.
18. More Than Screen Deep: Toward Every-Citizen Interfaces to the Nation's In-
formation Infrastructure ("Biermann Report" ), National Research Council, Na-
tional Academy Press, Washington, D.C., 1997.
19. From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance Com-
puting ("Branscomb Report"), Pub. NSB 93-205, National Science Foundation,
Washington, D.C., August 1993.
]9
20. Evolving the High Performance Computing and Communications Initiative to
Support the Nation's Information Infrastructure ("Brooks-Sutherland Report "),
National Research Council, National Academy Press, Washington, D.C., 1995.
21. Computing and Communications in the Extreme: Research for Crisis Manage-
ment and Other Applications, National Research Council, National Academy
Press, Washington, D.C., 1996.
22. Information Technology Research: Investing in Our Future, President' s Infor-
mation Technology Advisory Committee Report to t he President, National
Coordination Office, Arlington, VA, 1999.
S R P : a S c a l a b l e R e s o u r c e R e s e r v a t i o n
P r o t o c o l f o r t h e I n t e r n e t
Wer ner Al mesber ger 1, Ti zi ana Fer r ar i 2, and Jean- Yves Le Boudec I
1 EPFL ICA, INN (Ecublens), CH-1015 Lausanne, Switzerland
2 DEIS, University of Bologna, viale Risorgimento, 2, 1-40136 Bologna, Italy; and
Italian National Inst. for Nuclear Physics/CNAF, viale Berti Pichat, 6/2,
1-40127 Bologna, Italy
Abs t r a c t . The Scalable Reservation Protocol (SRP) provides a light-weight reser-
vation mechanism for adaptive multimedia applications. Our main focus is on good
scalability to very large numbers of individual flows. End systems (i.e. senders and
destinations) actively participate in maintaining reservations, but routers can still
control their conformance. Routers aggregate flows and monitor t he aggregate to
estimate the local resources needed to support present and new reservations. There
is neither explicit signaling of flow parameters, nor do routers maintain per-flow
state.
1 I n t r o d u c t i o n
Many adapt i ve mul t i medi a appl i cat i ons [1] requi re a well-defined f r act i on of
t hei r traffic t o r each t he dest i nat i on and t o do so in a t i mel y way. We call t hi s
f r act i on t he mi n i mu m rate t hese appl i cat i ons need in or der t o ope r a t e pr op-
erly. SRP aims t o allow such appl i cat i ons t o make a dependabl e r eser vat i on
of t hei r mi ni mum rat e.
The sender can expect t hat , as long as it adher es t o t he agr eed- upon
profile, no reserved packet s will be lost due t o congest i on, b-hrt hermore, for-
wardi ng of reserved packet s will have pr i or i t y over best - ef f or t traffic.
Tr adi t i onal resource r eser vat i on ar chi t ect ur es t ha t have been pr opos ed for
i nt egr at ed servi ce net works ( RSVP [2], ST- 2 [3], Tenet [4], ATM [5,6], et c. )
all have in common t hat i nt er medi at e syst ems ( r out er s or swi t ches) need t o
st ore per-flow s t at e i nf or mat i on. The mor e r ecent l y desi gned Di f f er ent i at ed
Services ar chi t ect ur e [7] offers i mpr oved scal abi l i t y by aggr egat i ng flows and
by mai nt ai ni ng s t at e i nf or mat i on onl y for such aggregat es. SRP ext ends upon
simple aggr egat i on by provi di ng a means for reservi ng net wor k r esour ces in
r out er s al ong t he pat hs flows t ake.
Recent l y, hybr i d appr oaches combi ni ng RSVP and Di f f er ent i at ed Servi ces
have been pr oposed (e.g. [8]) t o over come t he scal abi l i t y pr obl ems of RSVP.
Unlike SRP, whi ch r uns end- t o- end, t hey r equi r e a mappi ng of t he I NTSERV
services ont o t he under l yi ng Di f f er ent i at ed Services net wor k, and a means t o
t unnel RSVP signaling i nf or mat i on t hr ough net wor k regi ons wher e QoS is
pr ovi ded usi ng Di fferent i at ed Services.
22
Reservation mechanism In shor t , our r eser vat i on model works as follows.
A source t ha t wishes t o make a r eser vat i on s t ar t s by sendi ng da t a packet s
mar ked as request packet s t o t he dest i nat i on. Packet s ma r ke d as request ar e
subj ect t o packet admi ssi on cont r ol by r out er s, based on t he fol l owi ng pri n-
ciple. Rout er s moni t or t he aggr egat e flows of reserved packet s and mai nt ai n
a r unni ng est i mat e of what level of r esour ces is r equi r ed t o serve t he m wi t h
a good qual i t y of service. The resources are ba ndwi dt h and buffer on out go-
ing links, plus any i nt er nal resources as r equi r ed by t he r out e r ar chi t ect ur e.
Qual i t y of servi ce is loss r at i o and delay, and is defi ned st at i cal l y. Whe n re-
ceiving a request packet , a r out er det er mi nes whet her hypot het i cal l y addi ng
this packet t o t he flow of reserved packet s woul d yi el d an accept abl e val ue of
t he est i mat or . I f so, t he request packet is accept ed and f or war ded t owar ds t he
dest i nat i on, while still keepi ng t he s t at us of a request packet ; t he r out e r mus t
also updat e t he es t i mat or as if t he packet had been r ecei ved as reserved. In
t he opposi t e case, t he request packet is degr aded and f or war ded t owar ds t he
dest i nat i on, and t he es t i mat or is not updat ed. Degr adi ng a request packet
means assigning it a lower t raffi c class, such as best - ef f or t . A packet sent as
request will r each t he dest i nat i on as request onl y if all r out er s al ong t he pa t h
have accept ed t he packet as request. Not e t ha t t he choi ce of an es t i mat i on
met hod is local t o a r out er and act ual es t i mat or s ma y differ in t hei r pr i nci pl e
of oper at i on.
The dest i nat i on per i odi cal l y sends f eedback t o t he sour ce i ndi cat i ng t he
r at e at which request and reserved packet s have been recei ved. Thi s feed-
back does not receive any special t r e a t me nt in t he net wor k ( except possi bl y
for policing, see below. Upon r ecept i on of t he feedback, t he sour ce can send
packet s mar ked as reserved accor di ng t o a profi l e der i ved f r om t he r at e in-
di cat ed in t he feedback. If necessary, t he sour ce may cont i nue t o send mor e
request packet s in an a t t e mpt t o i ncrease t he r at e t ha t will be i ndi cat ed in
subsequent feedback messages.
Thus, in essence, a r out er accept i ng t o f or war d a request packet as request
allows t he source t o send mor e reserved packet s in t he f ut ur e; i t is t hus a
form of implicit reservat i on.
Aggregation Rout er s aggr egat e flows on out put por t s, and possi bl y on any
cont ent i on poi nt as r equi r ed by t hei r i nt er nal ar chi t ect ur e. The y use est i ma-
t or al gori t hms for each aggr egat ed flow t o det er mi ne t hei r cur r ent r eser vat i on
levels and t o pr edi ct t he i mpact of accept i ng request packet s. The exact def-
i ni t i on of what const i t ut es an aggr egat ed flow is l ocal t o a r out er .
Likewise, senders and sources t r e a t all flows bet ween each pai r of t he m
as a single aggr egat e and use es t i mat or al gor i t hms for char act er i zi ng t hem.
The est i mat or al gor i t hms in r out er s and host s do not need t o be t he same. In
fact, we expect host s t o i mpl ement a fai rl y si mpl e al gor i t hm, whi l e es t i mat or
al gori t hms in r out er s may evol ve i ndependent l y over t i me.
23
Fairness and security Deni al -of-servi ce condi t i ons ma y ari se i f flows can re-
serve di s pr opor t i onal amount s of resources or if flows can exceed t hei r reser-
vat i ons. We pr esent l y consi der fai rness in accept i ng r eser vat i ons a l ocal pol i cy
issue ( much like billing) whi ch ma y be addr essed at a f ut ur e t i me.
Sources vi ol at i ng t he agr eed upon r eser vat i ons ar e a real t hr e a t and need
t o be pol i ced. A scal abl e pol i ci ng mechani sm t o allow r out er s t o i dent i f y
non- conf or mant flows based on cer t ai n heur i st i cs is t he s ubj ect of ongoi ng re-
search. Such a mechani sm can be combi ned wi t h mor e t r adi t i onal appr oaches,
e.g. pol i ci ng of i ndi vi dual flows at l ocat i ons wher e scal abi l i t y is less i mpor -
t ant , e.g. at net wor k edges.
The r est of t hi s paper is or gani zed as follows. Sect i on 2 pr ovi des a mor e
det ai l ed pr ot ocol overvi ew. Sect i on 3 descri bes a si mpl e al gor i t hm for t he
i mpl ement at i on of t he t raffi c est i mat or . Fi nal l y, pr ot ocol oper at i on is illus-
t r a t e d wi t h some si mul at i on r esul t s in sect i on 4 and t he pa pe r concl udes wi t h
sect i on 5.
2 Ar c h i t e c t u r e o v e r v i e w
The pr oposed ar chi t ect ur e uses t wo pr ot ocol s t o manage r eser vat i ons: a reser-
vat i on pr ot ocol t o est abl i sh and mai nt ai n t hem, and a f eedback pr ot ocol t o
i nf or m t he sender about t he r eser vat i on st at us.
Sender Data & reservations Receiver
Router
Fi g. 1. Overview of the components in SRP.
Fi gur e 1 i l l ust rat es t he oper at i on of t he t wo pr ot ocol s:
9 Da t a packet s wi t h r eser vat i on i nf or mat i on ar e sent f r om t he sender t o t he
recei ver. The r eser vat i on i nf or mat i on consi st s in a packet t ype which can
t ake t hr ee values, one of t he m bei ng or di nar y best - ef f or t (sect i on 2.2). I t
is pr ocessed by r out er s, and ma y be modi fi ed by r out er s. Rout er s may
also di scard packet s (sect i on 2.1).
9 The recei ver sends f eedback i nf or mat i on back t o t he sender. Rout er s onl y
f or war d t hi s i nf or mat i on; t hey don' t need t o process it (sect i on 2.3).
Rout er s moni t or t he r eser ved t raffi c whi ch is effect i vel y pr esent and adj ust
t hei r gl obal s t at e i nf or mat i on accordi ngl y. Sect i ons 2.1 t o 2.3 i l l ust r at e t he
r eser vat i on and feedback pr ot ocol .
24
2.1 Reservat i on prot ocol
The reservation protocol is used in the direction from the sender to the re-
ceiver. It is implemented by the sender, the receiver, and the routers between
them. As mentioned earlier, the reservation information is a packet type
which may take three values:
Request This packet is part of a flow which is trying to gain reserved status.
Routers may accept, degrade or reject such packets. When touters accept
some request packets, then they commit to accept in the future a flow of
reserved packets at the same rate. The exact definition of the rate is part
of the estimator module.
Reserved This label identifies packets which are inside the source' s profile
and are allowed to make use of the reservation previously established by
request packets. Given a correct estimation, routers should never discard
reserved packets because of resource shortage.
Be s t ef f ort No reservation is attempted by this packet.
Packet types are initially assigned by the sender, as shown in figure 2.
A traffic source (i.e. the application) specifies for each packet if t hat packet
needs a reservation. If no reservation is necessary, the packet is simply sent as
best-effort. If a reservation is needed, the protocol entity checks if an already
established reservation at the source covers the current packet. If so, the
packet is sent as reserved, otherwise an additional reservation is requested by
sending the packet as request.
Application
Needs
reservation
Doesn' t need
reservation
Protocol stack
Yes
I e o r v = n l
~- est abl i shed? ~ _
Request
No
Best effort
Fig. 2. Initial packet type assignment by sender.
Each router performs two processing steps (see also figure 3). First, for
each request and reserved packet the estimator updates its current estimate of
the resources used by the aggregate flows and decides whether to accept the
packet (packet admission control). Then, packets are processed by various
schedulers and queue managers inside the router.
9 When a reserved packet is received, the estimator updates the resource es-
timation. The packet is automatically forwarded unchanged to the sched-
25
ul er where it will have pr i or i t y over best - ef f or t t raffi c and nor mal l y is not
di scarded.
9 When a request packet is recei ved, t hen t he es t i mat or checks whet her
accept i ng t he packet will not exceed t he avai l abl e resources. I f t he packet
can be accept ed, its request l abel is not modi fi ed. If t he packet cannot be
accept ed, t hen it is degr aded t o best - ef f or t
9 I f a schedul er or queue manager cannot accept a r eser ved or r equest
packet , t hen t he packet is ei t her di scar ded or downgr aded t o best-effort.
Reserved
Request
Best effort
SRP est i mat or Packet schedul er
Update the ] / Can the packet ]
;stimated b a n d wi d t h J - - Reserved ~ | be schedule in the / Yes ~ Reserved
t ~ reserved service 1__......1~. Request
- ~ Yes llequeS [ class ? J
ls an update of the l /
estimated bandwidth / No
accepmble ? , N o ~ e S t e f f o r t _ _ [ C a n t h e p a c k e t l Y e s
be schedule in the ~ Best effort
best effort class ?
N ~ Discard
Fi g. 3. Packet processing by routers.
Not e t ha t t he r eser vat i on pr ot ocol may "t unnel " t hr ough r out er s t ha t
don' t i mpl ement reservat i ons. Thi s allows t he use of unmodi f i ed equi pment
in par t s of t he net wor k which are di mensi oned such t ha t congest i on is not a
pr obl em.
2.2 Packet t ype encodi ng
RFC2474 [9] defines t he use of an oct et in t he I Pv4 and I Pv6 header for
Di f f er ent i at ed Services (DS). Thi s field cont ai ns t he DS Code Poi nt ( DSCP) ,
whi ch det er mi nes how t he r espect i ve packet is t o be t r e a t e d by r out er s ( Per -
Hop Behavi our , PHB) . Rout er s are allowed t o change t he cont ent of a packet ' s
DS field (e.g. t o select a di fferent PHB) .
As i l l ust r at ed in figure 4, SRP packet t ypes can be expr essed by i nt r oduc-
ing t wo new PHBs (for request and for reserved), and by usi ng t he pr e- def i ned
DSCP val ue 0 for best -effort . DSCP values for request and reserved can be
al l ocat ed l ocal l y in each DS domai n.
2.3 Feedback prot ocol
The f eedback pr ot ocol is used t o convey i nf or mat i on on t he success of reser-
vat i ons and on t he net wor k st at us f r om t he recei ver t o t he sender. Unl i ke t he
26
PHB DSCP
Ddault = 0ooooo
SRP I~lquesl I" . . . . . .
SRP Rmmed = y,Jx'J3tY
9 7
t ~
os,
IPv4 header 'Vet I HL i TO6 '* Total length
Fragment ID Rg I Frag. offset
I Protocol Checksum
Source address
Destination address
Of Xions, data .... i
Fig. 4. Packet type encoding using Differentiated Services (IPv4 example).
reservation protocol, the feedback prot ocol does not need to be i nt erpret ed by
routers, because t hey can det ermi ne t he reservation st at us from t he sender' s
choice of packet types.
Feedback i nformat i on is collected by t he receiver and it is peri odi cal l y
sent t o t he sender. The feedback consists of t he number of bytes in request
and reserved packets t hat have reached t he receiver, and the local t i me at
the receiver at which the feedback message was generated.
Receivers collect feedback i nformat i on i ndependent l y for each sender and
senders mai nt ai n the reservation st at e i ndependent l y for each receiver. Not e
t hat , if more t han one flow to t he same dest i nat i on exists, at t r i but i on of
reservations is a local decision at t he source.
1 0 i 1 1 2 1 3 1 4 1 5 ( = 1 7
V~ I Re,terved
tO t
.Reaor,,~d Num REQ (tO)
Re, Berved Num REQ (t)
P,e~ltV~ Num RSV (tO)
P, ~t ved Num RSV (t)
Fig. 5. Feedback message format.
Figure 5 illustrates the content of a feedback message: the t i me when t he
message was generat ed (t), and t he number of bytes in request and reserved
packets received at the dest i nat i on (REQ and RSV). All count ers wrap back
t o zero when t hey overflow.
In order t o improve tolerance to packet loss, also t he i nformat i on sent
in t he previous feedback message (at t i me tO) is repeat ed. Port i ons of t he
message are reserved to allow for fut ure extensions.
27
2. 4 Shapi ng at t he sender
The sender decides whet her packets are sent as reserved or request based
on its own est i mat e of the reservation it has request ed and on t he level of
reservation along t he pat h t hat has been confirmed via t he feedback prot ocol .
A source always uses the mi ni mum of these two paramet ers t o det ermi ne t he
appropri at e out put traffic profile.
Furt hermore, the sender needs t o filter out small differences bet ween
the actual reservation and the feedback in order to avoid reservations from
drifting, and it must also ensure t hat request packets do not interfere wi t h
congestion-controlled traffic (e.g. TCP) in an unfair way [10].
2.5 Exampl e
Figure 6 provides the overall pi ct ure of the reservation and feedback prot o-
cols for two end-systems connected t hrough rout ers R1 and R2. The initial
resource acquisition phase is followed by the generat i on of request packets af-
ter the first feedback message arrives. Dot t ed arrows correspond t o degraded
request packets, which passed the admission control test at r out er R1 but
could not be accept ed at rout er R2 because of resource shortage. Degrada-
tion of requests is t aken into account by the feedback protocol. Aft er receiving
the feedback i nformat i on the source sends reserved packets at an appr opr i at e
rate, which is smaller t han the one at which request packets were generat ed.
REQUEST
data packets
(e.g. at 2Mbps)
Feedback
trafSpec= 1M bps
RESERVED
data packets
REQUEST packet - - - z~- RESERVED packet
....... ~ Degraded REQUEST packet
Feedback packet ( BEST- EFFORT)
Fig. 6. Reservation and feedback protocol diagram.
2.6 Mul t i cas t
In order t o support multicast traffic, we have proposed a design t hat slightly
ext ends the reservation mechanism described in this sections. Refinement of
28
this design is still t he subj ect of ongoing work. A det ai l ed descri pt i on of t he
proposed mechani sm can be found in [11].
3 Es t i mat i on modul e s
edback
Session flow schedulin~
[~ Enforce maximum rate
Control future rate
- Gu~antee R, ES.ERVED
packet aomlSslon
- Control admission of
REQUEST packets
- Est i mat e reserved rate
- Schedule sending of f eedback
Fig. 7. Use of estimators at senders, routers, and receivers
We call estimator t he al gori t hm which at t empt s to cal cul at e t he amount
of resources t hat need to be reserved. The est i mat i on measures t he number
of requests sent by sources and the number of reserved packet s which act ual l y
make use of t he reservation.
Est i mat or s are used for several functions.
9 Senders use t he est i mat or for an opt i mi st i c predi ct i on of t he reservat i on
the net work will perform for t he traffic t hey emit. This, in conj unct i on
with feedback received from t he receiver, is used to decide whet her t o
send request or reserved packets.
9 Rout ers use t he est i mat or for packet-wise admi ssi on cont rol and per haps
also to det ect anomalies.
9 In receivers, the est i mat or is fed with t he received traffic and it generat es
an est i mat e of t he reservat i on at t he last rout er. Thi s is used t o schedule
t he sending of feedback messages to t he source.
Figure 7 shows how t he est i mat or al gori t hm is used in all net work ele-
ment s.
As described in section 2.1, a sender keeps on sending requests until suc-
cessful reservat i on set up is indicated with a feedback packet , i.e. even until
aft er t he desired amount of resources has been reserved in t he network. I t ' s
t he feedback t hat is ret urned t o t he sender, which i ndi cat es t he right al-
location obt ai ned on t he pat h. When t he source is feedback-compl i ant , t he
rout ers on t he pat h st ar t releasing a par t of t he over - est i mat ed reservat i on
al ready allocated. The feedback t hat is ret urned to t he sender may also show
29
an i ncr eased numbe r of r equest s. The sender mus t not i nt er pr et t hos e re-
quest s as a di r ect i ncr ease of t he r eser vat i on. I ns t ead, t he s ender e s t i ma t or
mus t cor r ect t he f eedback i nf or mat i on accordi ngl y, whi ch is achi eved t hr ough
t he c omput a t i on of t he mi ni mum of t he f eedback and of t he r es our ce a mo u n t
r equest ed by t he source.
Our ar chi t ect ur e is i ndependent of t he specific al gor i t hm used t o i mpl e-
ment t he es t i mat or . Sect i ons 3.1 and 3.2 descr i be t wo di fferent sol ut i ons. The
defi ni t i on and eval uat i on of al gor i t hms for r es er vat i on cal cul at i on in host s and
r out er s is still ongoi ng work. A det ai l ed anal ysi s of t he e s t i ma t i on a l gor i t hms
and addi t i onal i mpr ove me nt s can be f ound in [12].
3. 1 Basi c e s t i mat i on al gori t hm
The basi c al gor i t hm we pr esent her e is sui t abl e for sources and des t i nat i ons ,
and coul d be used as a r ough e s t i ma t or by r out er s. Thi s e s t i ma t or count s
t he numbe r of r equest s it recei ves ( and accept s) dur i ng a cer t ai n observat i on
i nt erval and uses t hi s as an es t i mat e for t he ba ndwi dt h t h a t will be used in
f ut ur e i nt er val s of t he s ame dur at i on.
I n addi t i on t o r equest s for new r eser vat i ons, t he use of exi st i ng r eser va-
t i ons needs t o be meas ur ed t oo. Thi s way, r eser vat i ons of sour ces t h a t s t op
sendi ng or t ha t decr ease t hei r sendi ng r at e can a ut oma t i c a l l y be r emoved.
For t hi s pur pos e t he use of r eser vat i ons can be s i mpl y me a s ur e d by count i ng
t he numbe r of reserved packet s t ha t ar e recei ved in a cer t ai n i nt er val .
To c ompe ns a t e for devi at i ons caused by del ay var i at i ons, s pur i ous packet
loss (e.g. in a best - ef f or t pa r t of t he net wor k) , et c. , r es er vat i ons can be "hel d"
for mor e t ha n one obs er vat i on i nt erval . Thi s can be accompl i s hed by r e me m-
ber i ng t he obser ved t raffi c over sever al i nt er val s and usi ng t he ma x i mu m of
t hese val ues ( st ep 3 of t he following al gor i t hm) . Gi ven a hol d t i me of h obser -
vat i on i nt erval s, t he ma x i mu m a mount of r esour ces whi ch can be al l ocat ed
Ma x , res and req (t he t ot al numbe r of reserved and request bytes r ecei ved in
a gi ven obs er vat i on i nt er val ) , t he r es er vat i on R (in byt es) is c o mp u t e d by a
r out er as follows. Gi ven a packet of n byt es:
i f ( p a c k e t _ t y p e = = R E Q )
i f ( R + r e q + n < M a x ) {
a c c e p t ;
r e q = r e q + n ; / / s t e p 1
}
e l s e d e g r a d e ;
i f ( p a c k e t _ t y p e = = R E S )
i f ( r e s + n < R ) {
a c c e p t ;
r e s = r e s + n ;
)
e l s e d e g r a d e ;
/ / st ep 2
3o
where i ni t i al l y R, r e s , r e q = O. At t he end of each obser vat i on cycl e t he
following st eps ar e comput ed:
f o r ( i -- h ; i > I ; i - - ) R E i ] = R [ i - l ] ;
R I l l = r e s + r e q ;
R = m a x ( R [ h ] , R [ h - 1 ] , . . . , R [ l ] ) ; / / s t e p 3
r e s = r e q = O ;
The same al gor i t hm can be r un by t he dest i nat i on wi t h t he onl y di fference
t ha t no admi ssi on checks are needed.
Exampl es of t he oper at i on of t he basic al gor i t hm ar e shown in sect i on 4.1.
Thi s easy al gor i t hm present s several probl ems. Fi r st of all, t he choi ce of
t he ri ght val ue of t he obser vat i on i nt erval is cri t i cal and difficult. Smal l values
make t he est i mat i on dependent on bur st s of r e s e r v e d or r eques t packet s and
cause an over est i mat i on of t he resources needed. On t he ot her hand, l arge
i nt erval s make t he es t i mat or r eact slowly t o changes in t he t raffi c profile.
Then, t he st ri ct ness of traffic accept ance cont r ol is fixed, whi l e adapt i vi t y
would be hi ghl y desi rabl e in or der t o make t he al l ocat i on of new resources
st r i ct er as t he amount of resources r eser ved gets closer t o t he maxi mum.
These pr obl ems can be solved by devi si ng an adapt i ve enhanced al gor i t hm
like t he one descr i bed in t he following sect i on.
3. 2 E n h a n c e d e s t i ma t i o n a l g o r i t h m
I nst ead of usi ng t he same es t i mat or in ever y net wor k component , we can en-
hance t he pr evi ous appr oach so t ha t senders and recei vers still r un t he si mpl e
al gor i t hm descr i bed above, while r out er s i mpl ement an i mpr oved est i mat or .
Frequently updated ~ Infrequently updated
I
I
Exp. weighted average
Estimate bef~ sm~176 - - t Y ~ - - ~ Correction 13 ~e----~ .~
Unc~ I - - Service ? r e }
I Effective bandwidth I i I V i r t u a l queue I J
t \ '
I
~ j v
I
RESERVED and accepted REQUEST packets
Fi g. 8. Schematic design of an adaptive estimator.
Feedback
We descri be an exampl e al gor i t hm in det ai l in [11]. I t consi st s of t he
pri nci pal component s i l l ust r at ed in figure 8: t he effect i ve ba ndwi dt h used by
r es er ved and accept ed r eques t packet s is measur ed and t hen s moot hed by cal-
cul at i ng an exponent i al l y wei ght ed aver age (7). Thi s cal cul at i on is per f or med
for ever y single packet .
31
The e s t i ma t e 7 is mul t i pl i ed wi t h a cor r ect i on f a c t or / 3 in or der t o cor r ect
for s ys t emat i c er r or s in t he es t i mat i on. Packet s ar e added t o a vi r t ual queue
(i.e. a count er ) , whi ch is e mpt i e d at t he es t i mat ed r at e. I f t he e s t i ma t e is t oo
hi gh, t he vi r t ual queue shri nks. I f t he e s t i ma t e is t oo low, t he vi r t ual queue
grows. Based on t he size of t he vi r t ual queue, ~ can be adj us t ed.
4 S i m u l a t i o n
Sect i on 4.1 pr ovi des a t heor et i c descr i pt i on of t he behavi or of t he r es er vat i on
me c ha ni s m in a ver y si mpl e exampl e, whi l e sect i on 4.2 shows t he s i mul at ed
behavi or of t he pr opos ed ar chi t ect ur e.
4. 1 Re s e r v a t i o n e x a mp l e
The net wor k we use t o i l l ust r at e t he oper at i on of t he r es er vat i on mechani s m,
is shown in fi gure 9: t he sender sends over a del ay-l ess l i nk t o t he r out er , whi ch
pe r f or ms t he r eser vat i on and f or war ds t he t raffi c over a l i nk wi t h a del ay of
t wo t i me uni t s t o t he recei ver. The recei ver per i odi cal l y r e t ur ns f eedback t o
t he sender.
The sender and t he recei ver bot h use t he basi c e s t i ma t or al gor i t hm de-
scr i bed in sect i on 3.1. The r out er ma y - and t ypi cal l y will - use a di fferent
al gor i t hm (e.g. t he one descr i bed in sect i on 3.2).
Sender Router Receiver
~xNDelay=0u ~ ' ] Delay=2u O
Local estimate and reservation in feedback
Fi g. 9. Exampl e network configuration.
The ba ndwi dt h es t i mat e at t he source and t he r es er vat i on t ha t has been
acknowl edged in a f eedback message f r om t he r ecei ver ar e meas ur ed. In fi gure
10, t hey ar e shown wi t h a t hi n cont i nuous line and a t hi ck das hed line,
respect i vel y. The packet s e mi t t e d by t he sour ce ar e i ndi cat ed by ar r ows on
t he r es er vat i on line. A full ar r ow head cor r es ponds t o request packet s, an
e mp t y ar r ow head cor r es ponds t o reserved packet s. For si mpl i ci t y, t he sender
and t he r ecei ver use exact l y t he s ame obs er vat i on i nt er val in t hi s exampl e,
and t he f eedback r at e is const ant .
The sour ce sends one packet per t i me uni t . Fi r st , t he sour ce can onl y
send r equest s and t he r out er r eser ves some r esour ces for each of t hem. At
poi nt (1), t he e s t i ma t or di scovers t ha t it has est abl i shed a r es er vat i on for six
packet s in four t i me uni t s, but t ha t t he sour ce has onl y sent f our packet s
in t hi s i nt erval . Ther ef or e, it cor r ect s i t s e s t i ma t e and pr oceeds. The first
32
Bandwidth reservation estimate at the sender
Bandwidth \ RTT/2 RTT/2
\ I' 'I i: :i Reservation indicated
il ~(~) i (~ / in feedback (at sender)
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
time RTT+Feedback cycle~F~edback c~)cle
unit .b, I ,~ ~ r e que s t
Observation"'~ intervall ~at sender ~ r e s e r v e d
Fig. 10. Basic estimator example.
feedback message reaches t he sender at poi nt (2). It indicates a reservat i on
level of five packet s in four t i me units (i.e. t he est i mat e at t he receiver at
t he t i me when t he feedback was sent), so t he sender can now send reserved
packets i nst ead of requests. At point (3), t he next observat i on i nt erval ends
and the est i mat e is corrected once more. Finally, t he second feedback arri ves
at point (4), indicating t he final rat e of four packet s in four t i me units. The
reservat i on does not change aft er t hat .
4. 2 Si mul a t i on r e s ul t s
The net work configuration used for t he si mul at i on is shown in figure 11.1 The
grey pat hs mar k flows we exami ne below.
Fig. 11. Configuration of the simulated network.
There are eight rout ers (labeled R1. . . RS) and 24 host s (labeled 1. . . 24) .
Each of t he hosts 1. . . 12 tries occasionally t o send t o any of t he host s 13. . . 24.
Connect i on par amet er s are chosen such t hat t he average number of concur-
rent l y act i ve sources sending vi a t he R1- R2 link is appr oxi mat el y fifty. Flows
1 The programs and configuration files used for the simulation are available on
http ://ircwww. epfl. ch/srp/
33
have an on-off behavi our , wher e t he on and off t i mes ar e r a n d o ml y chosen
f r om t he i nt er val s [5, 15] and [0, 30] seconds, r espect i vel y. The ba ndwi dt h of
a flow r emai ns cons t ant while t he flow is act i ve and is chosen r a n d o ml y f r om
t he i nt er val [1,200] packet s per second.
All links in t he net wor k have a ba ndwi dt h of 4000 packet s per second a nd
a del ay of 15 ms. 2 We allow up t o 90% of t he l i nk capaci t y t o be al l ocat ed
t o r eser ved t raffi c. The l i nk bet ween R1 and R2 is a bot t l eneck, whi ch can
onl y handl e a b o u t 72% of t he offered t raffi c. The del ay obj ect i ve D of each
queue is 10 ms. The queue size per link is l i mi t ed t o 75 packet s.
Total offered traffic ...... Real queue size at RI
BOCO Estimated reservat~'~ at RI ........ ~" 1OOOO
, ,o ,
0 , 0 2O 30 4O 5O
T i m e (s)
Fi g. 12. Est i mat i on and actual traffic
at R1 towards R2.
0 10 2O 30 4O 5O 60 70
Q u e u e length (packets)
Fi g. 13. Queue length at R1 on t he link
towards R2.
Fi gur e 12 shows t he R1 - R2 l i nk as seen f r om R1 . We show t he t ot a l
offered r at e, t he es t i mat ed r es er vat i on (7,/3) and t he s moot he d act ual r at es
of request and reserved packet s. Fi gur e 13 shows t he be ha vi our of t he r eal
queue. The s ys t e m succeeds in l i mi t i ng queui ng del ays t o a ppr oxi ma t e l y t he
del ay goal of 10 ms, whi ch cor r es ponds t o a queue size of 40 packet s. The
queue l i mi t of 75 packet s is never reached.
130
1 i i R E ! E R V E D p ! ...... 0
140 Estimation at host 15 ......
Rese~'ation de=ired b y I~ost 4
130 . . . . . . . . . . . . . . . . . !
8O
60 a_
40
' ' ' V ' ~
2OO RESERVED packets arriving at host 19 - -
Estirr~tN~n at hOSt 19
150 Reservation d e s i r e d b y host 4 ......
' Z i
7 i
0
6 8 10 12 14 16 18 24 26 28 30 32 34 36
Ti me(s) T i m e ( s )
Fi g. 14. End-t o-end reservation from Fi g. 15. End-t o-end reservation from
host 4 to host 15. host 4 to host 19.
Fi nal l y, we exami ne some end- t o- end flows. Fi gur e 14 shows a successful
r eser vat i on of 84 packet s per second f r om host 4 t o 15. The r eques t ed r at e, t he
es t i mat i on at t he dest i nat i on, and t he ( s moot hed) r at e of reserved packet s
2 Small random variations were added to link bandwi dt h and delay to avoid t he
entire network from being perfectly synchronized.
34
ar e shown. Si mi l arl y, fi gure 15 shows t he s a me d a t a for a less successful
r eser vat i on host 4 a t t e mp t s l at er t o 19, at a t i me when t he offered t raffi c is
al mos t t wi ce a hi gh as t he ba ndwi dt h avai l abl e at t he bot t l eneck. 3
Dur i ng t he ent i r e s i mul at ed i nt er val of 50 seconds, 3' 368 request packet s
and 164' 723 r es er ved packet s were sent f r om R1 t o R2 . Thi s is 83% of t he
ba ndwi dt h of t ha t link.
5 C o n c l u s i o n
We have pr opos ed a new scal abl e r esour ce r es er vat i on ar chi t ect ur e for t he I n-
t er net . Our ar chi t ect ur e achi eves scal abi l i t y for a l ar ge n u mb e r of concur r ent
flows by aggr egat i ng flows at each link. Thi s aggr egat i on is ma de possi bl e by
del egat i ng cer t ai n t raffi c cont rol deci si ons t o end s ys t e ms - an i dea bor r owed
f r om TCP. Reser vat i ons are cont r ol l ed wi t h e s t i ma t i on al gor i t hms , whi ch
pr edi ct f ut ur e r esour ce usage based on pr evi ousl y obs er ved t raffi c. Fur t her -
mor e, pr ot ocol pr ocessi ng is si mpl i fi ed by a t t a c hi ng t he r es er vat i on cont r ol
i nf or mat i on di r ect l y t o da t a packet s.
We di d not pr esent a concl usi ve speci f i cat i on but r a t he r des cr i bed t he
gener al concept s, gave exampl es for i mpl e me nt a t i ons of core el ement s, i ncl ud-
i ng t he desi gn of e s t i ma t or al gor i t hms for sources, des t i nat i ons and r out er s ,
and showed some i l l ust r at i ve si mul at i on resul t s. Fur t her wor k will focus on
compl et i ng t he speci fi cat i on, on eval uat i ng and i mpr ovi ng t he al gor i t hms de-
scr i bed in t hi s paper , and finally on t he i mpl e me nt a t i on of a pr ot ot ype .
R e f e r e n c e s
1. Diot, Christophe; Huitema, Christian; Turletti, Thierry. Mul t i medi a Applica-
tions should be Adaptive, f t p ://www. f r l r o d e o l d i o t / n c a - h p c s , ps . gz,
HPCS' 95 Workshop, August 1995.
2. RFC2205; Braden, Bob (Ed.); Zhang, Lixia; Berson, Steve; Herzog, Shai; Jami n,
Sugih. Resource ReSerVat i on Protocol ( RS VP) - Version 1 Funct i onal Specifi-
cation, I ETF, Sept ember 1997.
3. RFC1819; Delgrossi, Luca; Berger, Louis. ST2+ Protocol Specification, I ETF,
August 1995.
4. Ferrari, Domenico; Banerjea, Anindo; Zhang, Hui. Net work Support f or Multi-
media - A Discussion of the Tenet Approach, Comput er Networks and ISDN
Systems, vol. 26, pp. 1267-1280, 1994.
5. The ATM Forum, Technical Commi t t ee. A T M User-Net work Int erf ace
( UNI ) Signalling Specification, Version 4. O, f t p : / / f t p , atm.forum, corn/pub/
a ppr ove d- s pe c s / a f - s i g- 0061. 000. ps , The ATM Forum, Jul y 1996.
6. The ATM Forum, Technical Commi t t ee. A T M Forum Tr a~c Management
Specification, Version ~{.0, f t p : / / f t p , atraforum, cor n/ pub/ appr oved- s pecs /
af - t m- 0056. 000. ps, April 1996.
3 In this simulation, sources did not back off if a reservation progressed t oo slowly.
35
7. RFC2475; Blake, Steven; Black, David; Carlson, Mark; Davies, Elwyn; Wang,
Zheng; Weiss, Walter. An Architecture for Differentiated Services, IETF, De-
cember 1998.
8. Bernet, Yoram; Yavatkar, Raj; Ford, Peter; Baker, Fred; Zhang, Lixia;
Speer, Michael; Braden, Bob; Davie, Bruce. Integrated Services Op-
eration Over Diffserv Networks (work in progress), Internet Draft
d r a f t - i e t f - i s s l l - d i f f s e r v - r s v p - O 2 . t x t , J u n e , 1 9 9 9 .
9. RFC2474; Nichols, Kathleen; Blake, Steven; Baker, Fred; Black, David. Def-
inition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6
Headers, IETF, December 1998.
10. Floyd, Sally; Mahdavi, Jamshid. TCP-Friendly Unicast Rate-Based Flow Con-
trol, http ://www. psc. edu/networking/papers/tcp_friendly, html, Technical
note, January 1997.
11. Almesberger, Werner; Ferrari, Tiziana; Le Boudec, Jean-Yves. SRP: a Scalable
Resource Reservation Protocol for the Internet, Proceedings of IWQoS'98, pp.
107-116, IEEE, May 1998.
12. Ferrari, Tiziana. QoS Support for Integrated Networks, http://www.cnaf.
i n f n . i t / ' f er r ar i / t es i dot . ht ml , Ph.D. thesis, November 1998.
Di f f erent i at ed I nt e r ne t Se r vi c e s
Florian Baumgart ner, Torsten Braun, Hans Joachi m Einsiedler and
Ibrahi m Khalil
Institute of Computer Science and Applied Mathematics
University of Berne, CH-3012 Bern, Switzerland,
Tel -}-41 31 631 8681 / Fax -}-41 31 631 39 65
http : / / ~ . Jam. unibe, ch/~rvs/
Abst r act . With the grown popularity of the Internet and the increasing use of
business and multimedia applications the users' demand for higher and more pre-
dictable quality of service has risen. A first improvement to offer better than best-
effort services was made by the development of the integrated services architecture
and the RSVP protocol. But this approach proved only suitable for smaller IP
networks and not for Internet backbone networks. In order to solve this problem
the concept of differentiated services has been discussed in the IETF, setting up a
working group in 1997. The Differentiated Services Working Group of the IETF has
developed a new concept which is better scalable than the RSVP-based approach.
Differentiated Services are based on service level agreements (SLAs) that are nego-
tiated between users and Internet service providers. With these SLAs users describe
the packets which should be transferred over the Internet with higher priority than
best-effort packets. The SLAs also define parameters such as the desired bandwidth
for these higher priority packets. The implementation of this concept requires addi-
tional functionality such as classification, metering, marking, shaping, policing etc.
within touters at the domain boundaries. This paper describes the Differentiated
Service architecture currently being defined by the IETF DiffServ working group
and the required components to implement the DiffServ architecture.
1 I n t r o d u c t i o n
The Internet, currently based on the best-effort model, delivers only one
t ype of service. Wi t h this model and FI FO queuing deployed in the network,
any non-adapt i ve sources can take advant age to grab high bandwi dt h while
depriving others. One can always run multiple web browsers or st art multiple
FTP connections and grab substantial amount of bandwi dt h by exploiting the
best effort model. The Internet is also unable to support real time applications
like audio or video.
Incredible rapid growth of Int ernet has resulted in massive increases in
demand for network bandwi dt h performance guarantees to support bot h ex-
isting and new applications. In order to meet these demands, new Qual i t y
of Service (QoS) functionalities need to be i nt roduced t o satisfy cust omer
requirements including efficient handling of bot h mission critical and band-
width hungry web applications. QoS, therefore, is needed for various reasons:
38
9 Better control and efficient use of networks resources (e.g. bandwidth).
9 Enable users to enjoy multiple levels of service differentiation.
9 Special treatment to mission critical applications while letting others to
get fair treatment without interfering with mission sensitive traffic.
9 Business Communication.
9 Virtual Private Networks (VPN) over IP.
1.1 A Pragmat i c Approach t o QoS
A pragmatic approach to achieve good quality of service (QoS) is an adap-
tive design of the applications to react to changes of the network characteris-
tics (e.g. congestion). Immediately after detecting a congestion situation the
transmission rate may be reduced by increasing the compression ratio or by
modifying the A/ V coding algorithm. For this purpose functions to monitor
quality of service are needed. For example, such functions are provided by the
Real-Time Transport Protocol (RTP) [SCFJ96] and the Real-Time Control
Protocol (RTCP). A receiver measures the delay and the rate of the pack-
ets received. This information is transmitted to the sender via RTCP. With
this information the sender can detect if there is congestion in the network
and adjust the transmission rate accordingly. This may affect the coding of
the audio or video data. If only a low data rate is achieved, a coding algo-
rithm with lower quality has to be chosen. Without adaptation the packet
loss would increase, making the transmission completely useless. However,
rate adaptation is limited since many applications need a minimum rate to
work reasonably.
1.2 Reservat i on- based Approach
To achieve the QoS objective as mentioned in the earlier section, basically
two approaches can be offered in a heterogeneous network like the Internet :
I nt egr at ed Servi ce Appr oach: The Integrated Services Architecture
based on the Resource Reservation Setup Protocol (RSVP) is based on
absolute network reservation for specific flows. This can be supported in
small LANs, where routers can store a small number of flow states. In
the backbone, however, it would be extremely difficult, if not impossible,
to store millions of flow states even with very powerful processors. More-
over, for short-lived HTTP connections, it is probably not practical to
reserve resources in advance.
Di f f er ent i at ed Servi ce (DiffServ): To avoid the scaling problem of
RSVP, a differentiated service is provided for an aggregated stream of
packets by marking the packets and invoking some differentiation mech-
anism (e.g. forwarding treatment to treat packets differently) for each
marked packet on the nodes along the stream' s path. A very general ap-
proach of this mechanism is to define a service profile (a contract between
39
a user and the ISP) for each user (or group of users), and to design other
mechanisms in the router that favors traffic conforming to those service
profiles. These mechanisms might be classification, prioritization and re-
source allocation to allow the service provider to provision the network
for each of the offered classes of service in order to meet the application
(user) requirements.
2 Di f f Se r v Ba s i c s a nd Te r mi nol ogy
The idea of differentiated services is based on the aggregation of flows, i.e.
reservations have to be made for a set of related flows (e.g. for all flows
between two subnets). Furthermore, these reservations are rather static since
no dynamic reservations for a single connection are possible. Therefore, one
reservation may exist for several, possibly consecutive connections.
IP packets are marked with different priorities by the user (either in an
end system or at a router) or by the service provider. According to the dif-
ferent priority classes the routers reserve corresponding shares of resources,
in particular bandwidth. This concept enables a service provider to offer dif-
ferent classes of QoS at different costs to his customers.
The differentiated services approach allows customers to set a fixed rate or
a relative share of packets which have to be transmitted by the ISP with high
priority. The probability of providing the requested quality of service depends
essentially on the dimensions and configuration of the network and its links,
i.e. whether individual links or routers can be overloaded by high priority
data traffic. Though this concept cannot guarantee any QoS parameters as a
rule it is more straightforward to be implemented than continuous resource
reservations and it offers a better QoS than mere best-effort services.
2.1 Po pul a r Se r vi c e s of t he Di f f Ser v Appr oa c h
At present, several proposals exist for the realization of differentiated services.
Examples are:
As s ur e d and Pr e mi um Servi ces: The approach allowing the combina-
tion of different services like Premium and Assured Service seems to be
very promising. In both approaches absolute bandwidth is allocated for
aggregated flows. They are based on packet tagging indicating the service
to be provided for a packet. Actually, assured service does not provide
absolute bandwidth guarantee but offers soft guarantee with high prob-
ability t hat traffic marked with high priority tagging will be transmitted
with high probability.
Us e r Shar e Di f f er ent i at i on and Ol ympi c Ser vi ce: An alternative ap-
proach called User-Share Differentiation (USD) assigns bandwidth pro-
portionally to aggregated flows in the routers (for example all flows from
40
or to an IP address or a set of addresses). A similar service is provided
by the Olympic service. Here, three priority levels are distinguished as-
signing different fractions of bandwidth to the three priority levels gold,
silver and bronze, for example 60% for gold, 30% for silver and 10% for
bronze.
2.2 DS byte marking
In differentiated services networks where service differentiation is the main
objective, the differentiation mechanisms are triggered by the so-called DS
byte (or ToS byte) marking of the IP packet header. Various service differ-
entiation mechanisms (queuing disciplines), as we will study them in section
3, can be invoked dependent on the DS byte marking. Therefore, marking is
one of most vital DS boundary enabling component and all DS routers must
implement this facility.
Vemion J IHL ") TOS f" Total Length
Idenlification | Rag I Fragment Offset
Time to Uve I Protocol | Header Checksum
Source Address
J Destination Address
Fig. 1. DS byte in IPv4
[NBBB98]
In the latest proposal for packet marking the the first bit for IN or
OUT-of-Profile traffic, the first 6 bits, called Differentiated Services Code
point (DSCP), are used to invoke PHBs (see Figure 1). Router implementa-
tion should support recommended code point-to-PHB mappings. The default
PHB, for example, is 000000. Since the DSCP field has 6 bits, the number
of code points that can defined is 26 = 64. This proposal will be the basis of
future DiitServ development.
Many existing routers already use IP precedence field to invoke various
PHB treatment similar to the fashion of DSCP. To remain compatible, routers
can be configured to ignore bit 3,4 and 5. Code point 101000 and 101010
would, therefore, map to the same PHB. Router designers must consider
the semantics described above in their implementation and do necessary and
appropriate mapping in order to remain compatible with old systems.
2.3 Per Hop Behavior (PHB)
An introduction of PHB has already been given while discussing DS byte
marking 2.2. Further [BW98] writes: "Every PHB is the externally observable
41
.forwarding behavior applied at a DS capable node to a st ream of packet s t hat
have a part i cul ar value in the bits of the DS field ( DS code poi nt ) . PHBs
can also be grouped when it is necessary to describe the several f orwardi ng
behaviors si mul t aneousl y wi t h respect to some common const rai nt s. "
However, t her e is no ri gi d as s i gnment s of PHBs t o DSCP bi t pa t t e r ns .
Thes e has several reasons:
9 Ther e ar e (or will be) a l ot of mor e PHBs defi ned, t h a n DSCPs avai l abl e,
ma ki ng a st at i c ma ppi ng i mpossi bl e.
9 The under s t andi ng of good choices of PHBs is at t he begi nni ng.
9 I t is desi r abl e t o have compl et e fl exi bi l i t y in t he cor r es pondence of P HB
val ues and behavi or s.
9 Ever y I SP shal l be abl e t o c r e a t e / ma p PHBs in his Di ffServ domai n.
For t hese r easons t her e ar e no st at i c ma ppi ngs bet ween DS code poi nt s
and PHBs . The PHBs ar e e nume r a t e d as t he y be c ome defi ned and can be
ma p p e d t o ever y DSCP wi t hi n a Di ffServ domai n. As l ong as t he e nume r a t i on
space cont ai ns a l arge numbe r of val ues (232), t her e is no danger of r unni ng
out of space t o list t he PHB val ues. Thi s list can be ma d e publ i c for ma x i mu m
i nt er oper abi l i t y. Because of t hi s i nt er oper abi l i t y, ma ppi ngs bet ween PHBs
and DSCPs ar e pr opos ed, even when ever y I SP can choose ot her ma ppi ngs
for t he PHBs in his Di ffServ domai n.
Unt i l now, t wo PHBs and cor r es pondi ng DSCPs have been defi ned.
Ta b l e 1. The 12 different AF code points
Drop Precedences AF Code points
Class 1 Class 2 Class 3 Class 4
Low Drop Precedence 001010 010010 011010 100010
Medium Drop Precedence 001100 010100 011100 100100
High Drop Precedence 001110 010110 011110 100110
As s u r e d F o r wa r d i n g P HB: Based on t he cur r ent Assur ed For war di ng
PHB (AF) gr oup [ HBWW99] , a pr ovi der can pr ovi de f our i ndependent
AF cl asses wher e each class can have one of t hr ee dr op pr ecedence val ues.
Thes e classes ar e not aggr egat ed in a DS node and Ra n d o m Ea r l y De-
t ect i on ( RED) [F J93] is consi der ed t o be t he pr ef er r ed di s car di ng mech-
ani sm. Thi s r equi r ed al t oget her 12 di fferent AF code poi nt s as gi ven in
t abl e 1.
I n a Di f f er ent i at ed Servi ce (DS) Doma i n each AF cl ass recei ves a cer t ai n
a mount of ba ndwi dt h and buffer space in each DS node. Dr op pr ecedence
i ndi cat es r el at i ve i mpor t a nc e of t he packet wi t hi n an AF class. Dur i ng
congest i on, packet s wi t h hi gher dr op pr ecedence val ues ar e di s car ded fi rst
42
to protect packets with lower drop precedence values. By having multi-
ple classes and multiple drop precedences for each class, various levels of
forwarding assurances can be offered. For example, Olympic Service can
be achieved by mapping three AF classes to it' s gold, silver and bronze
classes. A low loss, low delay, low jitter service can also be achieved by us-
ing AF PHB group if packet arrival rate is known in advance. AF doesn' t
give any delay related service guarantees. However, it is still possible to
say that packets in one AF class have smaller or larger probability of
timely delivery than packets in another AF class. The Assured Service
can be realized with AF PHBs.
Expedi t ed For war di ng PHB: The forwarding treatment of the Expe-
dited Forwarding (EF) PHB [JNP98] offers to provide higher or equal
departure rate than the configurable rate for aggregated traffic. Services
which need end-to-end assured bandwidth and low loss, low latency and
low low jitter can use EF PHB to meet the desired requirements. One
good example is premium service (or virtual leased line) which has such
requirements. Various mechanisms like Priority Queuing, Weighted Fair
Queuing (WFQ), Class Based Queuing (CBQ) are suggested to imple-
ment this PHB since they can preempt other traffic and the queue serving
EF packets can be allocated bandwidth equal to the configured rate. The
recommended code point for the EF PHB is 101110.
2.4 Service Profi l e
A service profile expresses an expectation of a service received by a user or
group of users or behavior aggregate from an ISP. It is, therefore, a contract
between a user and provider and also includes rules and regulations a user
is supposed to obey. All these profile parameters are settled in an agreement
called Service Level Agreement (SLA). It also contains Traffic Condition-
ing Agreement (TCA) as a subset, to perform traffic conditioning actions
(described in the next subsection) and rules for traffic classification, traffic
re-marking, shaping, policing etc. In general, a SLA might include perfor-
mance parameters like peak rate, burst size, average rate, delay and jitter
parameters, drop probability and other throughput characteristics. An Ex-
ample is:
Service Profile 1: Code point: X, Peak rat e= 2Mbps, Burst size=1200 bytes,
avg. rate = 1.8 Mbps
Only a static SLA, which usually changes weekly or monthly, is possible
with today' s router implementation. The profile parameters are set in the
router manually to take appropriate action. Dynamic SLAs change frequently
and need to be deployed by some automated tool which can renegotiate
resources between any two nodes.
43
2. 5 Tr af f i c Condi t i one r
Traffic conditioners [BBC+98] are requi red t o i nst ant i at e services in DS ca-
pable rout ers and t o enforce service allocation policies. These conditioners
are, in general, composed of one or more of t he followings: classifiers, mark-
ers, meters, policers, and shapers. When a traffic st ream at t he i nput port
of a rout er is classified, it t hen might have t o travel t hr ough a met er (used
where appropri at e) to measure t he traffic behavi or against a traffic profile
which is a subset of SLA. The met er classifies part i cul ar packets as IN or
OUT-of-profile depending on SLA conformance or violation. Based on the
st at e of t he met er furt her marking, dropping, or shaping act i on is activated.
One or more of:
"++ F/ /
~, d
Fig. 2. DS Traffic Conditioning in Enterprise Network (as a set of queues)
Traffic Conditioners can be applied at any congested net work node (Fig-
ure 2) when t he t ot al amount of i nbound traffic exceeds t he out put capaci t y
of the switch (or rout er). In Figure 2 rout ers between source and dest i nat i on
are model ed as queues in an enterprise net work t o show when and where
traffic conditioners are needed. For example, rout ers may buffer traffic (i.e.
shape t hem by delaying) or mark t hem t o be discarded l at er duri ng medi um
network congestion, but might require t o discard packets (i.e. police traffic)
during heavy net work congestion when queue buffers fill up. As t he number
of rout ers grows in a network, congestion increases due t o expanded volume
of traffic and hence proper traffic conditioning becomes more i mport ant .
Traffic conditioners might not need all four elements. If no traffic profile
exists t hen packets may only pass t hrough a classifier and a marker.
Classifier: Classifiers categorize packets from a traffic st ream based on the
cont ent of some port i on of t he packet header. It mat ches received packets
to statically or dynami cal l y allocated service profiles and pass t hose pack-
ets t o an element of a traffic condi t i oner for furt her processing. Classifiers
44
must be confi gured by some management pr ocedur es in accor dance wi t h
t he appr opr i at e TCA.
Two t ypes of classifiers exist:
BA Cl a s s i f i e r : classifies packet s based on pat t er ns of DS byt e (DS code
poi nt ) only.
MF cl as s i f i er : classifies packet s based on any combi nat i on of DS field,
pr ot ocol ID, source address, dest i nat i on address, sour ce por t , des t i nat i on
por t or even appl i cat i on level pr ot ocol i nf or mat i on.
Ma r k e r s : Packet mar ker s set t he DS field of a packet t o a par t i cul ar code
poi nt , addi ng t he mar ked packet t o a par t i cul ar DS behavi or aggr egat e.
The mar ker can (i) mar k all packet s whi ch ar e ma ppe d t o a single code
poi nt , or (ii) mar k a packet t o one of a set of code poi nt s t o sel ect a PHB
in a PHB gr oup, accor di ng t o t he s t at e of a met er .
Me t e r s : Af t er bei ng classified at t he i nput of t he bounda r y r out er , t raffi c
f r om each class is t ypi cal l y passed t o a met er . The me t e r is used t o mea-
sure t he r at e ( t empor al pr oper t i es) at whi ch t raffi c of each class is bei ng
s ubmi t t ed for t r ansmi ssi on whi ch is t hen compar ed agai nst a t raffi c pro-
file specified in TCA ( negot i at ed bet ween t he Di ffServ pr ovi der and t he
DiffServ cust omer ) . Based on t he t he compar i son some par t i cul ar pack-
et s are consi der ed conf or mi ng t o t he negot i at ed profi l e (IN-profi l e) or
non- conf or mi ng (OUT-of-profi l e). When a met er passes t hi s s t at e i nfor-
mat i on t o ot her condi t i oni ng funct i ons, an appr opr i at e act i on is t r i gger ed
for each packet which is ei t her IN or OUT- of - pr of i l e (see Tabl e 1).
S h a p e r s : Shaper s del ay some packet s in a t raffi c s t r eam usi ng a t oken bucket
in or der t o force t he st r eam i nt o compl i ance wi t h a t raffi c profile. A shaper
usual l y has a finite-size buffer and packet s are di scar ded if t her e is not
sufficient buffer space t o hol d t he del ayed packet s. Shaper s ar e gener al l y
pl aced af t er ei t her t ype of classifier. For exampl e, shapi ng for EF t raffi c
at t he i nt er i or nodes hel ps t o i mpr ove end t o end per f or mance and also
pr event s t he ot her classes f r om bei ng st ar ved by a bi g EF bur st . Onl y
ei t her a pol i cer or a shaper is supposed t o appear in t he same t raffi c
condi t i oner.
Po l i c e r " When classified packet s ar r i ve at t he pol i cer it moni t or s t he dy-
nami c behavi or of t he packet s and di scar d or r e- mar k some or all of
t he packet s in or der t o force t he s t r eam i nt o compl i ance (i.e. force t he m
t o compl y wi t h confi gured pr oper t i es like r at e and bur s t size) wi t h a
traffic profile. By set t i ng t he shaper buffer size t o zer o (or a few pack-
ets) a pol i cer can be i mpl ement ed as a special case of a shaper . Like
shaper s pol i cers can also be pl aced af t er ei t her t ype of classifier. Po-
licers, in general , are consi dered sui t abl e t o police t raffi c bet ween a site
and a pr ovi der ( edge r out er ) and af t er BA classifiers ( backbone r out er ) .
However, most researchers agree t ha t pol i ci ng shoul d not be done at t he
i nt eri or nodes since it unavoi dabl y involves flow cl assi fi cat i on. Pol i cers
are usual l y pr esent in ingress nodes and coul d be based on si mpl e t oken
bucket filters.
3 Re a l i z i ng PHBs : T h e Q u e u i n g Compone nt s
45
Since differentiated service is a kind of service discrimination, some traffic
need to be handled with priority, some of the traffic needs to be discarded
earlier than other traffic, some traffic needs to be serviced faster, and in
general, one type of traffic always needs to better t han the other. In earlier
sections we have discussed about service profile and PHBs. It was made clear
t hat in order to conform to the contracted profile and implement the PHBs,
queuing disciplines play a crucial role. The queuing mechanisms typically
need to be deployed at the output port of a router.
Since we need different kinds of differentiation under specific situations,
the right queuing component (i.e PHB) needs to be invoked by the use of
a particular code point. In this section, therefore, we will describe some of
the most promising mechanisms which have already been or deserve to be
considered for implementation in varieties of DS routers.
3.1 Abs ol ut e Pri ori t y Que ui ng
In absolute priority queuing (Figure 3), the scheduler gives higher-priority
queues absolute preferential treatment over lower priority queues. Therefore,
the highest priority queue receives the fastest service, and the lowest priority
queue experiences slowest service among the queues.
The basic working mechanism is as follows: the scheduler would always
scan the priority queues from highest to lowest to find the highest priority
packet and then transmit it. When that packet has been completely served,
the scheduler would start scanning again. If any of the queues overflows,
packets are dropped and an indication is sent to the sender.
While this queuing mechanism is useful for mission critical traffic (since
this kind of traffic is very delay sensitive) this would definitely starve the
lower priority packets of the needed bandwidth.
3.2 WFQ
WFQ [Kes91](Figure 4)is a discipline t hat assigns a queue for each flow. A
weight can be assigned to each queue to give a different proportion of the
network capacity. As a result, WFQ can provide protection against other
flows.
WFQ can be configured to give low-volume traffic flows preferential treat-
ment to reduce response time and fairly share the remaining bandwidth be-
tween high volume traffic flows. With this approach bandwidth hungry flows
are prevented from consuming much of network resources while depriving
other smaller flows.
WFQ does the job of dynamic configuration since it adapts automatically
to the changing network conditions. TCP congestion control and slow-start
46
t~
Priority 1
[ ] D[ ] [ ]
Priority 4
l i e
Scheduler~-----~
Ab~o]ut$ ~.riority
~cneaunng
Fi g. 3. Absolute Pri-
ority Queuing. The
queue with the highest
priority is served at
first
8
Flow 1
Flow 2 _
Flow(n-l)
v
Fl ow n _
v
Queue #1
NNNN
Queue #2
NINN
Queue #(n-l)
Hi
Queue #n
Hi m
Weighted
Round
Robi n
Fig. 4. Weighted Fair
Queuing (WFQ)
features are also enhanced by WFQ, resulting in predictable throughput and
response time for each active flow.
The weighted aspect can be related to values in the DS byte of the IP
header. A flow can be allocated more access to queue resources if it has a
higher precedence value.
3.3 Class Based Queui ng (CBQ)
In an environment where bandwidth must be shared proportionally between
users, CBQ [F J95] (Figure 6) provides a very flexible and efficient approach to
47
first classifying user traffic and then assigning a specified amount of resources
to each class of packets and serving those queues in a round robin fashion.
A class can be an individual flow or aggregation of flows representing
different applications, users, departments, or servers. Each CBQ traffic class
has a bandwidth allocation and a priority. In CBQ, a hierarchy of classes
(Figure 5) is constructed for link sharing between organizations, protocol
families, and traffic types. Different links in the network will have different
link-sharing structures. The link sharing goals are:
9 Each interior or leaf class should receive roughly its allocated link-sharing
bandwidth over appropriate time intervals, given the sufficient demand.
9 If all leaf and interior classes with sufficient demand have received at
least their allocated link-sharing bandwidth, the distribution of any ex-
cess bandwidth should not be arbitrary, but should follow some set of
reasonable guidelines.
25% 10% I ~/ ~X X ~ 5 6 9%
12% 4% 4%
Fi g. 5. Hierarchical
Link-Sharing
The granular level of control in CBQ can be used to manage the allocation
of IP access bandwidth across the departments of an enterprise, to provision
bandwidth to the individual tenants of a multi-tenant facility.
Other than the classifier that assigns arriving packets to an appropriate
class, there are three other main components that are needed in this CBQ
mechanism: scheduler, rate-limiter (delayer) and estimator.
Schedul er : In a CBQ implementation, the packet scheduler can be imple-
mented with either a packet-by-packet round robin (PRR) or weighted
round robin (WRR) scheduler. By using priority scheduling the sched-
uler uses priorities, first scheduling packets from the highest priority level.
Round-robin scheduling is used to arbitrate between traffic classes within
the same priority level. In weighted round robin scheduling the scheduler
uses weights proportional to a traffic class's bandwidth allocation. This
weight finally allocates the number of bytes a traffic class is allowed to
48
set overlimit
9 , ~' ,,
-I i iDa
Robin
Class n Queue ~ I / Fig. 6. Class Based
- 9 9 9 Queuing: Main Corn-
Queue for Class n ~ ponents
send during a round of t he scheduler. Each class at each r ound gets t o
send its weighted share in byt es, including finishing sendi ng t he current
packet. That class' s weighted share for t he next round is decr ement ed by
the appr opr i at e number of byt es. When a packet t o be t r ans mi t t ed by a
WRR traffic class is larger t han t he traffic class' s weight but t hat class
is underl i mi t 1 , t he packet is still sent, allowing t he traffic class t o bor r ow
ahead from its weighted al l ot ment for fut ure rounds of t he round-robi n.
Ra t e - Li mi t e r : If a traffic class is overl i mi t 2 and is unabl e t o bor r ow from
it' s parent classes, t he scheduler st art s t he overl i mi t act i on which mi ght
include si mpl y droppi ng arri vi ng packet s for such a class or r at e- l i mi t
overlimit classes t o their al l ocat ed bandwi dt h. The rat e-l i mi t er comput es
the next t i me t hat an overlimit class is allowed t o send traffic. Unless this
future t i me has arrived, this class will not be allowed t o send anot her
packet unt i l .
Es t i ma t or : The est i mat or est i mat es t he bandwi dt h used by each traffic class
over t he appr opr i at e t i me interval and det ermi nes whet her each class is
over or under its al l ocat ed bandwi dt h.
1 If a class has used less than a specified fraction of its link sharing bandwidth (in
bytes/sec, as averaged over a specified time interval)
2 If a class has recently used more than its allocated link sharing bandwidth (in
bytes/sec, as averaged over a specified time interval)
49
3.4 Random Early Detection (RED)
Random Early Detection (RED) IF J93] is designed to avoid congestion by
monitoring traffic load at points in the network and stochastically discarding
packets when congestion starts increasing. By dropping some packets early
rather than waiting until the buffer is full, RED keeps the average queue size
low and avoids dropping large numbers of packets at once to minimize the
chances of global synchronization.Thus, RED reduces the chances of tail drop
and allows the transmission line to be used fully at all times. This approach
has certain advantages:
9 bursts can be handled better, as always a certain queue capacity can be
reserved for incoming packets.
9 by the lower average queue length real-time applications are better sup-
ported.
The working mechanism of RED is quite simple. It has two thresholds,
minimum threshold X1 and a maximum threshold X2 for packet discarding
or admission decision which is done by a dropper. Referring to Figure 7, when
a packet arrives at the queue, the average queue (av_queue) is computed. If,
av_queue < X1, the packet is admitted to the queue; if av_queue >_ X2, the
packet is dropped. In the case, when the average queue size falls between the
thresholds X1 < av_queue < X2, the arriving packet is either dropped or
queued, mathematically saying, it is dropped with linearly increasing proba-
bility.
When congestion occurs, the probability that the RED notifies a par-
ticular connection to reduce its window size is approximately proportional
to that connection's share of the bandwidth. The RED congestion control
mechanism monitors the average queue size for each output queue and using
randomization choose connections to notify of that congestion.
- - - I D-
IIIIIMIIIIII
T
"~ ~" o i /
[funl
X2 X1
[empt y]
r
.queue Fig. 7. Random Early
length Detection
It is very useful to the network since it has the ability to flexibly specify
traffic handling policies to maximize throughput under congestion conditions.
50
RED is especially able to split bandwidth between TCP dat a flows in a fair
way as lost packets automatically cause a reduction to a TCP dat a flow's
packet rate. More problematic is the situation if non TCP conforming dat a
flows (e.g. UDP based real-time or multicast applications) are involved. Flows
not reacting to packet loss have to be handled by reducing their dat a rate
specially to avoid an overloading of the network.
In general, RED statistically drops more packets from large users than
from small ones. Therefore, traffic sources t hat generate the most traffic are
more likely to be slowed down than traffic sources t hat generate little traffic.
3. 5 RED wi t h In and Out ( RI O)
The queuing algorithm proposed for assured service RIO (RED with In and
Out) [CW97] is an extension of the RED mechanism. This procedure shall
make sure, t hat during overload primarily packets with high drop precedence
(e.g. best-effort instead of assured service packets) are dropped. A dat a flow
can consist of packets with various drop precedences, which can arrive at
a common output queue. So changes to the packet order can be avoided
affecting positively the TCP performance.
For in and out-of-profile packets a common queue using different dropping
techniques for the different packet types is provided. The dropper for out of
profile packets discards packets much earlier (e.g. a lower queue length) than
the dropper for in profile packets. Further more the dropping probability for
out of profile packets increases more than the probability for in packets. So,
it shall be achieved that the probability for dropping in profile packets is kept
very low. While the out-dropper used the number of all packets in the queue
for the calculation of his probability, the in-dropper only uses the number of
in profile packets (see figure 8). Using the same queue both types of packets
will have the same delay. This might be a disadvantage of this concept. By
dropping all out-of-profile packets at a quite small queue length this effect
can be reduced but not eliminated.
~ - - [ c o u n t e r 1 '9 1
~ s s i f i c a ~ [ O ut drO pper ' " - ' ' ' ' - ' ' l ~
i I I I l i l l l l l
I
' I ~ ' 1 in counter I , q l
In:+ 1 In :-1
Fig. 8. RIO-Queuing
4 Di f f e r e nt i a t e d Se r vi c e s i n End- t o- End Sc e nar i os
51
4.1 Premi um Service and Expedi t ed Forwardi ng
Wi t h Pr emi um Service the user negotiates with his ISP a maxi mum band-
wi dt h for sending packets t hrough t he ISP network. Furt hermore, t he aggre-
gat ed flow is described by t he packets' source and dest i nat i on addresses or
address prefixes. In Figure 9 users and ISPs have agreed on a rat e of t hree
packet s/ s for traffic from A t o B. The user configures t he first-hop rout er in
t he individual subnet accordingly. In the exampl e above a packet rat e of two
packet s/ s is allowed in every first-hop rout er as it can be expect ed t hat no
two end systems will use the full bandwi dt h of two packet s/ s at t he same
time.
!!~!:~ !~!e': ~! ~! ! ! ~! ' ~" h~g~ ;~i~ii~il;:;~,]~;~ ~ ~i;~O
. . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i . . . . . . . . . . . ' " ' "
. . . . . . . . " " :
. . , " . . . ' " :
. . ' . . . . . . " : ~
-%
~
. . . . "
. - . . . . . . . . . . . . . . . . . . . ..
': f
Fig. 9. Premium Service
First-hop rout ers have the task to classify t he packets received from t he
end systems, i.e. t o analyze if the Pr emi um Service shall be provi ded t o the
packets or not. If yes, the packets are tagged as Pr emi um Service and t he dat a
st ream is shaped according t o the maxi mum bandwi dt h. The user' s border
rout er re-shapes the st ream (e.g. t hree packets per second) and t ransmi t s
the packets t o t he ISP' s border router, which performs policing functions,
i.e. it checks whet her the user' s border rout er remains below t he negot i at ed
bandwi dt h of t hree packet s/ s. If each of t he two first-hop t out ers allows two
52
packets/s, one packet per second will be dropped by shaping or policing at
the border routers. All first-hop and border-routers own two queues, one for
EF-packets and one for all other (see Figure 9). If the EF-queue contains
packets these are transmitted prior to others. The implementation of two
queues in every router of the network (ISP and user network) equals to the
realization of a virtual network for Premium Service traffic.
Premium Service offers a service corresponding to a private leased line,
with the advantage of making free network capacities available to other tasks,
resulting in lower fees for the users.
4.2 Assur ed Servi ce
A potential disadvantage of Premium Service is the weak support for bursts
and the fact t hat a user has to pay even if he is not using the whole bandwidth.
The Assured Service tries to offer a service which cannot guarantee bandwidth
but provides a high probability that the ISP transfers high-priority-tagged
packets reliably. The definition of concrete services has not yet happened, but
it is obvious to offer services similar to the IntServ controlled load service.
The probability for packets to be transported reliably depends on the network
capacity. An ISP may choose the sum of all bandwidths for Assured Service
to remain below the bandwidth of the weakest link. In this case, only a
small portion of the available capacity may be allocated in the ISP network.
An advantage of the Assured Service is that users do not have to establish
a reservation for a relative long time. With ISDN or ATM, users might be
unable to use the reserved bandwidth because of the burstiness of their traffic,
whereas Assured Service allows the transmission of short time bursts.
With the Assured Service the user negotiates a service profile with his
service provider, e.g. the maximum amount or rate of high priority, i.e. As-
sured Service, packets. The user may then tag his packets as high priority
within the end system or the first-hop router, i.e. assign them a tag for as-
sured forwarding (AF) (see Figure 10). To avoid modifications in the end
systems the first-hop router may analyze the packets with respect to their IP
addresses and UDP-/TCP-Port and then assign them the according priority,
i.e. set the AF-DSCP for conforming Assured Service packets. The maximum
rate of high-priority (AF-DSCP) packets must not be exceeded. This is done
by (re-)classification in the first-hop routers and in the user's border routers
at the border to the ISP network. Nevertheless, the service provider has to
check if the user remains below the maximum rate for high priority packets
and apply corrective actions such as policing if necessary.
For example, the border router at the network entrance will tag the non-
conforming packet as low priority (out of service, out of profile). An alterna-
tive would be to charge higher fees for non-conforming packets by the ISP.
The tagging of low and high priority packets is done by use of the DS byte.
53
::: ...................................... i i
Fig. 10. Assured Service
Bursts are supported by making buffer capacity available for buffering
bursty traffic. Inside the network, especially in backbone networks bursts can
be expected to be compensated statistically.
4.3 Traffic Condi t i oni ng for As s ur ed and Pr e mi um Se r vi c e
The implementation of Assured and Premium Service requires several modi-
fications of the routers. Mainly classification, shaping, and policing functions
have to be performed to the router. These functions are necessary at the
border between two networks, for example at the transition of the customer
network to the ISP or between the ISPs. Service profiles have to be negotiated
between the ISPs similar to the transition to the user.
Fi r s t - hop r out e r Figure 11 shows the first-hop router function for Premium
and Assured Service. Received packets are classified and according to this the
AF or EF-DSCP is set if the packet should be supported with Assured or
Premium Service. As a parameter for the classification, source and destination
addresses or information of higher protocols (e.g. port numbers) may be used.
There are separate queues for each AF class, for EF and best effort traffic.
So, a pure best-effort packet will be forwarded directly to the best-effort RED
queue and the Assured Service packets get to their RED queues. The Assured
Service packets are checked whether they conform to the service profile. The
54
drop precedence will only be kept unchanged if the Assured Service bucket
contains a token. Otherwise the drop precedence will be increased. The RED-
based queuing shall guarantee t hat AF packets with higher drop precedence
are dropped prior to AF packets with lower drop precedence, if the capacity
is exceeded.
Q ~ constant rate
premium
best-effo~ D-~
assured
Fig. 11. First-hop router for Premium, Assured and best effort services
Border router Similar to the first-hop router an intermediate router has
to perform shaping functions in order to guarantee t hat not more than the
allowed packet rate is transmitted to the ISP. This is important since the
ISP will check whether the user remains within the negotiated service profile.
The border router in Figure 12 will therefore drop non conforming Premium
service packets and increase the drop priority of non conforming Assured
Service packets. Packets within an AF class but with different precedence
values share the same queue since both types of packets may belong to the
same source. A common queue avoids re-ordering of packets. This is especially
important for TCP performance reasons.
Fi r st - Hop and Egress Border Rout ers Figure 13 shows the working
principle of a first hop and an egress router for assured service. An egress
border router is the border router, at which the packets are leaving the dif-
ferentiated service domain. Received packets are classified and the AF DSCP
is set, if assured service should be given to the packet. Source and destina-
tion addresses and information of higher protocols (e.g. port numbers) may
55
0 ., constant rate
- IIII - yos
best-effort
assured - 7- 77- ]
ye s ~ / / - - ~ I l l ~-
~- t t oke n ~o/ / ' / ~i nc r e a s e ~' ~
Fig. 12. Policing in a border router
be used as classification parameters.A pure best effort packet will directly be
pushed to the output queue.
The AF-DSCP is set according to the availability of a token and then
written to the AF output queue. Normal best effort traffic is directly pushed
to the best effort queue.
The token buckets are configured according to the SLAs consisting of
bit rates and the burst parameter. The bucket may be capable of keeping
several tokens to support short time bursts. The bucket's depth depends on
the arranged burst properties.
The difference between a first hop and an egress border router is the fact,
that at the first hop router a packet is classified for the first time for this task
information of higher protocols (TCP ports, type of the application) may
be used, whereas the egress border router is capable of changing the drop
precedence to meet the negotiated service profile.
Ingress Bor der Rout er The ISP has to ensure that the user meets the
negotiated traffic characteristics. To achieve this, the ISP has to check in
his ingress border router, which transmits the packets into his DS domain
whether the user keeps the SLA. So the ingress border router of Figure 14
will change the drop precedence of non conforming packets.
56
~ t ~ p r e c e d e n c e ~
= ~ i n g t o ~
best-effort
a s s u r e d
Fi g. 13. First hop and egress border rout er for Assured Service
st~176 0d0o00H ,o
[ i Low Drop Precedence ~ / New cl~lfication~ Yes i Low Drop
] , / w ~ i f NO token / N, ~ P r e c e d e n c e
I ~ ~/ Medium Drop Precedence ~ s i f i y a , . i o n ' ~ Yes ~ ] I I I ] ] I L - -
i \ High Drop Precedence ~- ~ ~ n l g h Drop
; Precedence
, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fi g. 14. Ingress border router with three drop precedences for Assured Service
4. 4 Us e r - Sha r e Di f f e r e nt i a t i o n
Based upon packet t aggi ng Pr e mi u m and Assur ed Servi ce model s can fulfill
t he s t i pul at ed servi ce pa r a me t e r s like bi t r at es wi t h a hi gh degr ee of pr obabi l -
i t y onl y if t he I SP net wor k is di mensi oned a ppr opr i a t e l y and non bes t - ef f or t
t raffi c is t r a ns mi t t e d bet ween cer t ai n known net wor ks only.
I f for i nst ance t wo users have cont r act ed a bi t r at e of 1 Mbps for As s ur ed
Servi ce packet s wi t h an I SP and bot h wi sh t o recei ve d a t a s i mul t aneous l y at
a r at e of 1 Mbps each f r om a WWW ser ver whi ch is connect ed t o t he ne t wor k
wi t h a 1.5 Mbps link, t he r equest ed qual i t y of servi ce cannot be pr ovi ded.
The Us er - Shar e Di f f er ent i at i on a ppr oa c h [Wan97] avoi ds t hi s pr obl e m
by cont r act i ng not absol ut e ba ndwi dt h p a r a me t e r s but r el at i ve ba ndwi dt h
shares. A user will be guar ant eed onl y a cer t ai n r el at i ve a mo u n t of t he avai l -
abl e ba ndwi dt h in an I SP net wor k. I n pr act i ce, t he size of t hi s s har e will be
in di rect r el at i on t o t he char ged cost s.
In Fi gur e 15, user A has al l ocat ed onl y hal f of t he ba ndwi dt h of user B
and one t hi r d of t he bandwi dt h of user C. I f A and B access t he ne t wor k on
57
Fig. 15. User Share Differentiation (USD)
bottleneck link
low bandwidth links with a capacity of 30 kbps at the same time, e.g. user
B will receive a bandwidth of 20 kbps but user C will get merely 10 kbps. If
B and C access the same or possibly a different network via a common high
bandwidth link with a capacity of 25 Mbps, B will receive 10 Mbps and C
only 15 Mbps.
Simpler router configuration is an important advantage of the USD ap-
proach. However, absolute bandwidth guarantees cannot supported. An ad-
ditional drawback is that not only edge routers must be configured (as in
the case of Premium or Assured Service) but also interior routers must be
configured with the bandwidth shares.
5 Conc l us i on a nd Out l ook
Standardization of Differentiated services is still under discussion. So far most
of discussions have been centered around RED and Assured Service. Virtual
Leased Line (or Premium Service) and it's implementations by EF PHB has
been recently been discussed in [JNP98] which would require implementa-
tion of Priority Queuing, WFQ, CBQ etc. It is not clear where the policing
and shaping should take place. Although, both AF and EF PHBs have been
proposed, interaction between these two is a debatable issue.
58
RED and it's variants are complimentary to different scheduling algo-
rithms, and fit very nicely with CBQ. RED is designed to keep queue sizes
small (smaller than their maximum in a given implementation), and thus
avoid tail drop and global TCP resynchronization. It is, therefore, expected
that in router implementation all these service discipline need to coexist and
some of those be complementary to each other. Nevertheless, new propos-
als for both AF and EF PHB strongly suggests t hat Class Based Queuing
(CBQ), WFQ, and their variants will play stronger roles in the implementa-
tion of DiffServ.
Regarding interaction between the PHBs the EF draft says t hat other
PHBs can coexist at the same DS node given t hat the requirements of AF
classes are not violated. These requirements include timely forwarding which
is at the heart of EF. On the other end, the AF PHB group distinguishes
between the classes based on timely forwarding. The AF draft also says t hat
"any other PHB groups may coexist with the AF group within the same DS
domain provided that the other PHB groups do not preempt the resources
allocated to the AF classes". The question here is: If they coexist should EF
have more timely forwarding than the highest timely forwarded AF class by
preempting any AF class as the EF document basically states?
What is needed here is EF must leave AF whatever has been allocated
for AF.This would mean EF can actually preempt forwarding resources for
AF. For example, one could take a 1.5 Mbps link and allow for 64 Kbps of
it to be available to EF, with the remaining capacity available to AF. One
could also state that EF has absolute priority over AF (up to the 64 Kbps
allocated). In this case, EF would preempt AF (so long as it conforms to the
64 Kbps limit) and AF would always be assured t hat it has 1.5 Mbps - 64
Kbps of the link throughput.
There are lot more issues which are debatable and need attention for fur-
ther research. However, we should always keep in mind t hat the whole point
of DiffServ is to allow service providers to implement QoS pricing strategies
in the first place.
Re f e r e nc e s
[BBC+98]
[BW98]
[cw971
[FJ93]
S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weis. An
architecture for differentiated services. Request for Comments 2475,
December 1998.
Marty Borden and Christoph White. Management of phbs. Internet
Draft drafz-i et f-di ffserv-phb-mgmt -00. t xt , August 1998. work in
progress.
D. Clark and J. Wroclawski. An approach to service al-
location in the internet, work in progress. Internet Draft
dr af t - cl ar k- di f f - svc- al l oc- 00. t xt , Juli 1997. work in progress.
Sally Floyd and Van Jacobson. Random early detection gateways for
congestion avoidance. I EEE/ ACM Transactions on Networking, Au-
gust 1993.
[FJ95]
[HBWW99]
[JNP98]
[Kes91]
[NBBB98]
[SCFJ96]
[Wan97]
59
Sally Floyd and Van Jacobson. Link-sharing and resource management
models for packet networks. IEEE/A CM Transactions on Networking,
3(4), August 1995.
Juha Heinanen, Fred Baker, Walter Weiss, and John Wro-
clawski. Assured forwarding phb group. Int ernet Draft
d r a f t - i e t f - d i f f s e r v - a f - O 6 . t x t , February 1999. work in progress.
Van Jacobson, K. Nichols, and K. Poduri. An expedi t ed forwarding
phb. Int ernet Draft d r a f t - i e t f - d i f f s e r v - a f - 0 2 . t x % Oct ober 1998.
work in progress.
S. Keshav. Congestion Control in Computer Networks. PhD thesis,
Berkeley, Sept ember 1991.
K. Nichols, S. Blake, F. Baker, and D. Black. Definition of t he differ-
entiated services field (ds field) in t he ipv4 and ipv6 headers. Request
for Comment s 2474, December 1998.
H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. Rt p: A
t ransport protocol for real -t i me applications. Request for Comment s
1889, Januar y 1996.
Z. Wang. User-share differentiation (usd) scalable band-
width allocation for differentiated services. Int ernet Draft
d r a f t - wa n g - d i f f - s e r v - u s d - 0 0 . t x t , November 1997. work in
progress.
A Po r t a bl e S u b r o u t i n e Li brary for S o l v i n g
Li near Co nt r o l Pr o b l e ms on Di s t r i b u t e d
Me mo r y Co mp u t e r s *
Pe t e r Be nne r ~, En r i q u e S. Qu i n t a n a - Or t f 2, a nd Gr e gor i o Qu i n t a n a - Or t l 3
Zent r um fiir Technomat hemat i k, Fachbereich 3 - Mat hemat i k und I nf or mat i k,
Universit/it Bremen, D 28334 Bremen, Ger many; benner ~mat h. uni - br er nen. de.
2 Depar t ament o de InformXtica, Uni versi dad Jai me I, 12080 Castelldn, Spain;
q u i n t anaOi nf , uj i . e s .
3 Same address as second aut hor; gqui nt a nOi nf , u j i . e s .
Ab s t r a c t . Thi s paper describes t he design of a software l i brary for sol vi ng t he
basic comput at i onal probl ems t hat arise in analysis and synt hesi s of linear cont rol
syst ems. The l i brary is i nt ended for use in hi gh per f or mance comput i ng envi ron-
meat s based on parallel di st ri but ed memor y archi t ect ures. The por t abi l i t y of t he
l i brary is ensured by using t he BLACS, PBLAS, and SeaLAPACK as t he basic
layer of communi cat i on and comput at i onal routines. Pr el i mi nar y numeri cal resul t s
demons t r at e t he per f or mance of t he devel oped codes on parallel comput er s.
1 I n t r o d u c t i o n
I n r ecent year s, ma n y new a nd r el i abl e nume r i c a l me t h o d s have be e n devel -
oI)(' d for anal ys i s and s ynt hes i s of mo d e r a t e si ze l i near t i me - i nva r i a nt ( LTI )
s ys t e ms . I n gener al i zed s t a t e - s pa c e f or m, s uch s ys t e ms ar e de s c r i be d by t he
f ol l owi ng model s .
Co n t i n u o u s - t i me L T I s ys t em:
E.~,(t) = Ax ( t ) + B u ( t ) , t > O, x(O) = x (~
y ( t ) = Cx ( t ) , t >_ O. (1)
Di s c r e t e - t i me L T I s ys t em:
Ex k +l = Axe: + Bu k , k = 0, 1, 2 . . . . , x0 = x (~ (2)
y~. = Cx~., k = O, 1, 2 . . . . .
I n b o t h cases, A , E C IR '~ B E ~.'~ a nd C E IR p Her e we a s s u me
t h a t E is nons i ngul a r . De s c r i pt or s ys t e ms wi t h s i ngul a r E al so l ead - - a f t e r
* Par t i al l y suppor t ed by t he DAAD pr ogr amme Acciones Integradas Hispano-
AIemanas. Enri que S. Qui nt ana- Or t i and Gregori o Qui nt ana- Or t i were also sup-
por t ed by t he Spani sh CI CYT Pr oj ect TIC96-1062-C03-03.
62
a ppr opr i a t e t r ans f or mat i ons t o t he above pr obl em f or mul at i on wi t h a non-
si ngul ar ( and usual l y, di agonal or t r i angul ar ) ma t r i x E; see, e.g., [42,55].
The t r adi t i onal appr oach t o desi gn a r egul at or for t he above LTI s ys t e ms
i nvol ves tile mi ni mi zat i on of a cost f unct i onal of t he f or m
0 0
1 /
Jc(Xo, u) = ~ ( y( t ) TQy( t ) + 2y ( t ) TSu( t ) + u ( t ) TRu ( t ) ) dt
o
(3)
in t he cont i nuous - t i me case, and
1 s T 2y TSuk T
Yd(xo, U) = ~ (YkQYk + +' a k Ru k ) (4)
k=O
in t he di scr et e- t i me case. The mat r i ces Q 6 IR pxp, S 6 IR px' *, and R 6
11:/''* ar e chosen in or der t o wei ght i nput s and out put s of t he s ys t em. The
linear-quadratic regulator problem consi st s in mi ni mi zi ng (3) or (4) s ubj ect
t o t he dynami cs (1) or (2), respect i vel y. I t is wel l -known t ha t t he sol ut i on of
t hi s opt i ma l cont r ol pr obl em is gi ven by t he cl osed-l oop cont r ol
u*(t) = - R - I ( B T X c E + s T C) x ( t ) =: F~x(t ), t > O, (5)
in t he cont i nuous - t i me case, and
'a~*. = - ( R+ B T X a B ) - 1 ( B T X d A + S T C) x k = : Fdxk, k = 0 , 1 , 2 , . . . , ( 6 )
in t he di scr et e- t hne case. See, e.g., [1,42,38] for det ai l s and f ur t her references.
The mat r i ces X~ and Xd ill ( 5 ) and (6) denot e par t i cul ar sol ut i ons of t he
(generalized) cont i nuous- t i me algebraic Riccati equation ( CARE)
0 = ~. c ( X ) : = c T Q c -}- A T X E + E T X A -
- (ETXB + c T s ) R - I ( B T X E + S TC) ,
( 7 )
and t he (generalized) discrete-time algebraic Riccati equation ( DARE)
0 = 7</ ( X) : = CT QC + A T X A - E T X E - (8)
- ( A T X B + C T s ) ( R + B T X B ) I ( B T X A + S T C ) .
The opt i mal cont r ol in (5) and (6) is obt ai ned f r om t he stabilizing sol ut i ons
of (7) and (8). Th a t is, we need t o c omput e Xc and Xd such t ha t tile r esul t i ng
closed-loop i nat r i ces
A c : = E-~(A + BF~), F~ : = - R- ' ( BTXcE + sTc), ( 9 )
and
Aa : = E - ~ ( A + B F d ) , Fd : = - - ( R- i - BTXdB) - I ( BTXd A-i -~TC) ( 1 0 )
63
have st abl e spect r a. In t he cont i nuous- t i me case this means t ha t Ac has all its
ei genval ues in t he open left hal f pl ane while in t he di scr et e- t i me case all t he
eigenvalues of Ad are of modul us less t han one. We will call mat r i ces ( mat r i x
pencils) wi t h all eigenvalues in t he open left hal f pl ane c stable and t hose wi t h
s pect r a inside t he open uni t disk will be called d stable. The mat r i ces F~. and
F,t are called t he opt i mal feedback gain matrices. Under s t andar d as s umpt i ons
on LTI syst ems and t he wei ght i ng mat r i ces in t he cost f unct i onal s J~. and
,7~l, t he st abi l i zi ng sol ut i ons of t he CARE and DARE exist, are uni que, and
symmet r i c. See [38] for a det ai l ed account on condi t i ons for exi st ence and
uni queness of sol ut i ons t o (7) and (8).
The al gebrai c Ri ccat i equat i ons in (7) and (8) can be f or mul at ed in t he
mor e general forms
0 = Tgc(X) = Q, + ] t T X E + E T X f t - X G X (11)
for t he CARE, and
o = 7 r = 0 + A r x A - ~ , r x f , - (12)
- ( Xr x # + S)(R + # r x b ) - ~ ( # r X. Zl + S)
for t he D A R E . Wr i t t en in this form, t hey also i ncl ude t he al gebrai c Ri ccat i
equat i ons arising in many areas of moder n cont r ol t heor y like r obus t cont r ol ,
H2- and H~- c ont r ol , model r educt i on, etc.; see, e.g., [27,47,48,56]. The algo-
r i t hms used here for t he mt meri cal sol ut i on of equat i ons of t he forms (11) and
(12) do not depend on t he par t i cul ar f or m gi ven in (7) and (8) and hence can
be used t o solve any al gebrai c Ri ccat i equat i ons given as in (11) and (12).
Thr oughout this paper we will assume t hat st abi l i zi ng sol ut i ons of t he CARE
(11) and t he DARE (12) exist and hence t hat t he cl osed-l oop mat r i ces .4,
and A~l defi ned above are c st abl e and (t s t abl e, respect i vel y.
In t he course, of solving t he al)ove nonl i near syst ems of (~quations vi a
Newt on' s met hod and in many ot her anal ysi s and synt hesi s prol fl ems for LTI
cont r ol probl ems, linem' mat r i x equat i ons of t he fornl
~4x#' + d XD + # = o (13)
have t o be solved. Here A, C e IR '~xn, s e IR "~xm, and # , X E IR ' '
Li near syst ems of equat i ons as in ( 1 3 ) are called generalized Syl vcst er equa-
tions. Some par t i cul ar i nst ances of (13) are given below:
d x +x D+# =o ,
d x # - x + # = o,
and for /~ = E T
Ax + x . 4 r + # = 0,
A x d ' T + d x A r + E = o,
A x A T - x + [~ = o,
A x A T - d x d T + # = o,
( Syl vest er equation) (14)
("di scret e" Syl vest er equation) (15)
( Lyapunov equation) (16)
(generalized Lyapunov equation) (17)
( St ei n equation) (18)
(generalized St ei n equation) (19)
64
St ei n equat i ons are oft en also r ef er r ed t o as discrete Lyapunov equations.
In addi t i on t o t he above, we will consi der speci al cases of ( gener al i zed)
Lya punov and St ei n equat i ons wher e E is semi def i ni t e and f act or ed as E =
+ E I E ~ . In t hi s case, if A - $C is a s t abl e ma t r i x penci l ( t he gener al i zed
ei genval ues of t he ma t r i x penci l are st abl e) , t hen t he sol ut i on of t he cor r e-
s pondi ng Lya punov or St ei n equat i on is al so semi def i ni t e and can be f act or ed
as X = Thi s is t he case, e.g., when c omput i ng t he controllability
Gramian W,. and observability Gramian Wo of a cont i nuous - t i me LTI s ys t e m
vi a t he Lya punov equat i ons
AWc E T + EW~A T + BB T = O,
ATWoE + E:rWoA + C:rC = O.
(20)
( 2 ~ )
I n t he di s cr et e- t i me case t hese Gr a mi a ns ar e gi ven by t he cor r es pondi ng St ei n
equat i ons
AW~A r + EW~E T + BB r = O,
Ar WoA - Er WoE + Cr C = O.
(22)
(23)
The Gr a mi a ns of LTI s ys t ems pl ay a f unda me nt a l role in ma n y anal ysi s and
desi gn pr obl ems of LTI s ys t ems as c omput i ng bal anced, mi ni mal , or par t i al
r eal i zat i ons, t he Hankel si ngul ar val ues and Hankel nor m of LTI syst enl s,
and nl odel r educt i on. Oft en, t he Cholesky factors X1 of t he sol ut i ons t o
t he above equat i ons are needed. Hence, speci al al gor i t hms ar e desi gned t o
c omput e t hese f act or s wi t hout ever f or mi ng t he sol ut i on ma t r i x expl i ci t l y.
We consi der speci al al gor i t hI ns for all t he above equat i ons. The s ubr ou-
t i nes r esul t i ng f r om i mpl ement i ng t hese al gor i t hms will be used in or der t o
t ackl e some c omput a t i ona l pr obl ems for LTI syst ems:
C1 st abi l i ze an LTI syst em, i.e., find F E IR. "*x'* such t ha t E - )~(A + BF)
is a st abl e ma t r i x pencil;
C2 l nodel r educt i on, i.e., find l ow- or der mat r i ces ( Er , At , B,., C,.) such t h a t
tile LTI s ys t e m defined by t hese mat r i ces a ppr oxi ma t e s t he i n p u t - o u t p u t
behavi or of t he ori gi nal syst em;
C3 solve t he l i near - quadr at i c opt i mi zat i on pr obl ems di scussed above usi ng
(5) and (6);
C4 c omput e t he opt i mal H9 cont rol l er;
C5 c omput e a s ubopt i ma l H~ cont rol l er.
In addi t i on t o t he c omput a t i ona l s ubr out i nes pr ovi ded by t he PBLAS and
Sc a LAPACK [15] and t he solvers for t he above l i near and nonl i near ma t r i x
equat i ons we will al so need t ool s for t he s pect r al decompos i t i on of mat r i ces
and ma t r i x penci l s in or der t o accompl i sh Tas k C1.
The need for paral l el comput i ng in t hi s ar ea can be seen f r om t he f act t h a t
a h e a d y for a s ys t e m wi t h s t at e- s pace di mensi on n = 1000, t he cor r es pondi ng
Syl vest er , Lyapunov, St ei n, or Ri ccat i equat i ons r epr es ent a set of l i near or
65
nonl i near equat i ons wi t h one mi l l i on unknowns. Sys t ems of such a di men-
sion dr i ven by or di nar y di f f er ent i al ( - al gebr ai c) equat i ons ar e not u n c o mmo n
in chemi cal engi neer i ng appl i cat i ons and ar e s t a nda r d for second or der sys-
t ems ari si ng f r om model i ng mechani cal mul t i body s ys t ems or l ar ge fl exi bl e
space st r uct ur es. We as s ume her e t ha t t he coefficient mat r i ces ar e dense and
n < 6000. Lar ger syst ems, as t hose ari si ng f r om t he di s cr et i zat i on of par -
t i al di fferent i al equat i ons, usual l y i nvol ve spar se mat r i ces . I f s par s i t y is t o be
expl oi t ed, ot her c omput a t i ona l t echni ques have t o be empl oyed [34,45]. Ti m
al gor i t hms consi der ed here are i mpl ement ed in For t r an 77 usi ng t he kernel s in
l i br ar i es BLACS, PBLAS, and ScaLAPACK. The r esul t i ng s ubr out i nes will
fornl a s ubr out i ne l i br ar y wi t h t ent at i ve name P L I L CO, Pa r a l l e l Sof t war e
Li b r a r y for l i near Cont r ol t heory.
Thi s pr os pect us of t he f ut ur e PLI LCO is or gani zed as follows. In Sect i on 2
we will r evi ew t he basi c numer i cal al gor i t hms t ha t can be empl oyed in or der
t o sol ve t he c omput a t i ona l pr obl ems needed t o accompl i sh t he r equi r ed t asks.
In or der t o obt ai n a hi gh por t abi l i t y of t he s ubr out i nes t o be i mpl ei nent ed,
we will follow t he gui del i nes and c omput a t i on model used in Sc a LAPACK
[15] as well as t he i mpl e me nt a t i on and doc ume nt a t i on s t a nda r ds gi ven in
[16]. A shor t revi ew of t he paral l el comput i ng pa r a di gms used and a s ur vey
of t he desi gn and cont ent s of t he pr os pect i ve l i br ar y will be gi ven in Sec-
t i on 3. Pr el i mi nar y resul t s in Sect i on 4 will de mons t r a t e t he pe r f or ma nc e of
t he devel oped subr out i nes in several paral l el comput i ng envi r onment s wi t h
s har ed/ di s t r i l ) ut ed memor y. An out l ook on fl l t urc act i vi t i es is gi ven in Se(>
l i on 5.
2 Nu me r i c a l Al g o r i t h ms
2.1 The QR and QZ al gori t hms
Ti m t r adi t i onal appr oaches t o sol vi ng t he c omput a t i ona l t uobl enl s i nt r oduced
in t he pr ecedi ng sect i on i nvol ve t he c omput a t i on of i nvar i ant / def l at i ng sub-
spaces by means of tile QR/ QZ al gor i t hms; see, e.g., [26,49].
The QR al gor i t hm consi st s of an initial r educt i on st ep whi ch t r ansf l ) r ms
a gi ven ma t r i x A C g{'~ t o upt )er Hessenber g f or m, i.e.,
wher e Uo is or t hogonal . Af t er war ds, a sequence of si mi l ar i t y t r a ns f or ma t i ons
Aj +I : = U~r+tAjUj+I for j = 0, 1 , 2 , . . . is per f or med. The t r a ns f or ma t i on
mat r i ces Uj are chosen such t ha t all i t er at es Aj a r e upper Hes s enber g mat r i ces
and conver ge t o upper quas i - t r i angul ar form. Th a t is, if A-I. = l i mj _+~ A j ,
t hen A. is upper t r i angul ar wi t h 1 x I and 2 x 2 bl ocks oil tile di agonal . The i x
1 bl ocks cor r es pond t o real ei genval ues of A while 2 x 2 bl ocks r epr es ent pai r s
of compl ex conj ugat e ei genval ues of A. Usually, conver gence t akes pl ace in
6 6
COOt ) i t er at i ons. The si mi l ar i t y t r a ns f or ma t i ons wi t h U~ can be i mpl e me nt e d
at a c omput a t i ona l cost of O( n '~) such t ha t tile overal l c omput a t i ona l cost of
t hi s al gor i t hm is (,9(7~:~). I f we set D : = limj-~oo J 1-Ik=0 Uj , t hen , 4. = UTAD,
Ti l e upper quas i - t r i angul ar ma t r i x t i . is called t he (real) Schur forru of A,
Appl yi ng a finite sequence of or t hogonal si mi l ar i t y t r a ns f or ma t i ons t o 4 . ,
t he di agonal bl ocks can be s wapped such t ha t t he uppe r k x k bl ock of
tile t r a ns f or me d ma t r i x cont ai ns t hose ei genval ues of A t ha t ar e i nsi de s ome
subset of t he compl ex pl ai n which is closed under compl ex conj ugat i on. I f we
denot e t he accumul at ed t r a ns f or ma t i on mat r i ces t ha t achi eve t hi s r e- or der i ng
by U and set U : = DU t hen t he first k col umns of U s pan t he A- i nvar i ant
s ubs pace cor r es pondi ng t o t hese ei genval ues.
The QZ al gor i t hm appl i ed t o ma t r i x penci l A - AE c omput e s or t hogonal
mat r i ces ~, 2 E IR '~ such t hat UT( A - A E ) 2 = A. - AE. , wher e A. is
upper quas i - t r i angul ar and E. is upper t r i angul ar . Agai n t he mat r i ces ~r, 2
(' an be chosen such t hat t he first k cohnnns of Z :-- 22 s pan a par t i cul ar
ri ght def l at i ng s ubs pace of A - AE cor r es pondi ng t o s ome desi red subset of
ei genval ues of A - AE. The QZ al gor i t hm is equi val ent t o appl yi ng t he Q1R
al gor i t hnl t o A E 1 wi t hout ever f or mi ng t he pr oduc t or t he i nverse expl i ci t l y.
The ul at r i x penci l A. - AE. is cal l ed t he generalized (real) Schur f or m of
A - AE.
In or der t o c omput e a spect r al decompos i t i on of a ma t r i x or ma t r i x penci l
as r equi r ed, e.g., in Task C1, t he QR (QZ) al gor i t hm can be appl i ed t o
t he ma t r i x (penci l ). The r e- or der i ng i nust t hen be per f or nl ed such t ha t t he
s pe c t r um of t he l eadi ng k x ~: di agonal bl ock of A. ( A. - AB. ) cor r es ponds
to t he ei genval ues on t he one side of t he line di vi di ng tile s pect r unl whi l e
t he t r ai l i ng di agonal bl ock cor r esponds t o t he ei genvahms of t he ot her side of
this line. For (' (mt i nuous-t i nm syst ems, usual l y a s pect r al divisi(m ah)ng t he
i magi nar y axi s is needed while tbr di scr et e- t i me syst ems, t he usual s pect r al
di vi si on line is tile uni t circle.
When sol vi ng t he s ymmet r i c l i near ma t r i x equat i ons (16) (19) wi t h t he
mos t wi del y used met hod, t he Bart el s- St ewart method, t he Q1R. and QZ al-
gor i t hms are used for iuitial r educt i ons of tile i nvol ved ma t r i x ~4 or I nat r i x
penci l ~J,- A6' to upper quas i - t r i angul ar form. Thi s i ni t i al st age is followed by
a backs ubs t i t ut i on process in or der t o solve t he r esul t i ng t r i angul ar syst ems.
Not e t ha t t he mai n c omput a t i ona l wor k is done dur i ng t he i ni t i al r educt i on.
Thi s a ppr oa c h is used, e.g., in [5,22,23,45] for t he equat i ons (16) (19) and
also when sol vi ng semi defi ni t e Lya punov and St ei n equat i ons of t he f or m
(20) ( 23) i n [30,54,45].
For t he nons ymme t r i c equat i ons (14) and (15), it. is usual l y sufficient t o
t r ans f or m one of tile coefficient nl at r i ces t o uppe r quas i - t r i angul ar f or m and
t he ot her one t o Hessenber g form. Thi s a ppr oa c h is cal l ed t he Hesse'uberg-
Sch, ur met hod following [25] and is ext ended t o (13) in [22,23].
The al gebr ai c Ri ccat i equat i ons (11) and (12) can be sol ved vi a tile
Ql q/ QZ al gor i t hms usi ng t he r el at i on t o cer t ai n i nvar i ant or def l at i ng sub-
spaces of t he cor r es pondi ng ma t r i c e s / ma t r i x penci l s. I f t he st abl e ri ght t i e -
67
flating subspace of
[ 0 1
is spanned by [z 'x lz 2 1 ' Zl l , Z21 9 ~, z and Zl l is invertible, t he stabilizing
~ J
solution of (11) is given by X~ = - Z 1 2 Z ~ I E - 1 . Hence the CARE (11) can be
solved applying the QZ algorithm to H - A_K and re-ordering the eigenvalues
such t hat the stable eigenvalues (i.e., those with negative real parts) appear
in the upper n x n diagonal block of the generalized Schur form of H - )~K.
Th e n the first n c o l u mn s of the mat ri x Z c o mp u t e d by the QZ al gori t hm
span the required stable right deflating subspace of H - )~K. Note t hat the
optimal control u . ( t ) can be comput ed using Fc = R - : ( B T Z v 2 Z l a - s T c )
without solving the CARE explicitly. In case E = I,~, it is sufficient to appl y
the QR algoritiun to the H a m i l t o n i a n ' ma t r i x H from (25) and to order the
Schur form of H accordingly. This approach was first suggested in [39] and
outlined in [3] ['or E r L~. The, resulting met hods are calh'(t the (ge'n, e r ( di z e d)
Sch, wr veer, or me t h o d s .
Similar observations as in the continuous-time case lead to Schur vector
met hods for DAREs as given in (12). Here the QZ algorithm and an appro-
priate re-ordering are to be applied to
M - ,~L = - t ? r ~ - A - , U . ( 2 6 )
~r 0 ~ -,O 'r
If the generalized Schur form of 3,1 - AL is ordered such t hat the lea(ling n. x n.
(liagonal block contains the eigenvalues inside the unit disk, then the first n
(:ohmms of the Z-mat ri x comput ed by the QZ algorithln span the stabl(' (with
respect to the unit circle) right deflating subspace of M - AL. Part i t i oni ng
these ,,. colunms of Z as [ Z[ ] , Z~i , z T ] T, where Z,I,Z.2i E ~, , x, , Z:~, 9
~,,,x,~, and assuming Zll nonsingular, Xa = Zz~ Z~I/ ~ -~ and Fd = Z:sl Zll l
[42]. Note t hat using this approach it is possible to comput e the opt i mal
control u~! directly without solving the DARE explicitly. The comput at i onal
cost of this approach can be lowered if R is invertible and well-conditioned
by applying the QZ algorithm to
/~f - AL = LQ - ~T/ ~- , ~ / ) r -- A 0 (.4 -- /~.--1s)TJ " (27)
If E = I,~, l ~J- AL is a symplectic matrix pencil. These Schur vector met hods
tbr the discrete-time case have been proposed in [3,44,53].
If tile st andard approaches to the spectral division problem and to the
solution of the linear and nonlinear matrix equations described above are
68
t o be us e d for c o mp u t a t i o n s o n par al l e l d i s t r i b u t e d me mo r y c o mp u t e r s , we
wi l l n e e d ef f i ci ent i mp l e me n t a t i o n s o f t h e QR a n d QZ a l g o r i t h ms f or t h e s e
comput i ng envi r onment s. In S c a L AP ACK, onl y t he QR al gor i t hm is avai l abl e
so far. However, in or der t o solve t he l i near mat r i x equat i ons consi der ed here,
tile QR al gor i t hm can onl y be used for (14)-(16) and (18). In all ot her cases,
t he QZ al gor i t hm has t o be empl oyed in t he initial st age when sol vi ng t hese
equat i ons via t he most f r equent l y used Hessenber g- Schur and Bar t el s - St ewar t
met hods as descrit)ed above. Solving (11) and (12) by tile (general i zed) Schur
vect or met hods, agai n t he QR. al gor i t hm can only be used in tile CARE case
wi t h E r I,,; for all ot her cases, t he QZ al gor i t hm is needed.
A di fferent appr oach t o solving t he al gebrai c Ri ccat i equat i ons (11) a n d
(12) is t o consi der t hese equat i ons as nonl i near sets of equat i ons. Fr om t hi s
per spect i ve, t he most obvi ous choice t o solve al gebrai c Ri ccat i equat i ons
is Newt on' s met hod. In each i t er at i on st ep of Newt on' s met hod appl i ed t o
CAREs or DAREs [ 3 , 3 3 , 3 7 , 3 8 , 4 2 ] , a (general i zed) Lyapunov or St ei n equat i on
of t he form (16) (19) has t o be solved; see Sect i on 2.4 below. Thus a par al -
lel i mpl ement at i on of Newt on' s met hod also depends heavi l y on t he paral l el
per f or mance of t he Lyapunov or St ei n solver empl oyed, i.e., if t he Bar t el s-
St ewar t i net hod is t o be used, once mor e on tile efficiency of tile paral l el i zed
QR/ Qz al gori t hms.
e'.From t he above consi derat i ons we can concl ude t hat in or der to use t he
t r adi t i onal al gor i t hms fl)r solving linear and al gebrai c Ri ccat i mat r i x equa-
tions, it is necessary t o have efficient paral l el i zat i ons of tile QR and QZ
al gori t hms. However, several exper i ment al st udi es r epor t tile difficulties in
1)arallelizing t he doubl e implicit shi ft ed QR al gor i t hm on par al M di s t r i but ed
mul t i t )rocessors (see, e.g., [17,24,31,51]). The al gor i t hm present s a fine gr an-
ul ar i t y whi(:h introdu(:es perfi )rmance losses due t o eonmnmi cat i on s t ar t - up
over head (l at ency). Besides, t r adi t i onal da t a l ayout s ( c ohmm/ r ow }flock scat -
t(,re(l) lead t o an unbal ance(t di st r i but i on of t he comput at i onal load. A ( l i f
f(,rent at)t~roa(:h relies on a block Hankel di st r i but i on, whicll i mp, oves t he
bal anci ng of t he comt mt at i onal load [ 3 1 ] . At t empt s t o i ncrease t he gr mml ar -
i t y by empl oyi ng mul t i shi ft t echni ques have heen r ecent l y I)roposed in [32].
Nt wert hel ess, t he paral l el i sm and scal abi l i t y of t hese al gor i t hms are still far
fl'om t hose of mat r i x i nul t i pl i eat i ons, mat r i x f act or i zat i ons, t r i angul ar l i near
syst ems solvers, etc.; see, e.g., [15] and tile references given t her ei n.
Al t hough tile paral l el i zat i on of tile QR al gor i t hm has been t hor oughl y
st udi ed, ill cont r ast , tile paral l el i zat i on of t he QZ al gor i t hm r emai ns unex-
I)lored t o tile best of our knowledge. Moreover, since bot h t he QR and t he
QZ al gor i t hms are composed of tile same t ype of fi ne-grai n comput at i ous ,
sinfilar or even worse paral l el i sm and scal abi l i t y resul t s are t o be expect ed
fl'om tile QZ al gori t hm.
In or der to avoid t he t)roblems ari si ng from tile difficult par al l el i zat i on of
t he QR and QZ al gori t hms, we will use a di fferent comput at i onal appr oach
here. It is well-known t hat under sui t abl e assumpt i ons, t he above mat r i x
69
equat i ons can be sol ved vi a t he sign f unct i on nl et hod. I t has l ong been ac-
knowl edged t ha t al gor i t hms based on t he sign funct i on are r el at i vel y eas y
t o paral l el i ze. The met hods t ha t will be empl oyed in t he PLILCO will lie
consi der ed in tile next sect i ons.
2 . 2 T h e S i g n F u n c t i o n M e t h o d a n d t h e S m i t h I t e r a t i o n
Ti l e sign f unct i on me t hod was first i nt r oduced in 1971 by Robe r t s [46] for
sol vi ng al gebr ai c Ri ccat i equat i ons of tile f or m (11) wi t h E = I .... Robe r t s
also shows how t o solve st abl e Syl vest er and Lya punov equat i ons vi a t he
ma t r i x sign funct i on. Ti l e appl i cat i on t o CAREs and DAREs wi t h E 7 L I,, is
investigate(1 in [20,21] while tile appl i cat i on t o (16) wi t h E # I,, is examin(~d
in [ 1 3 ] .
The c omput a t i on of t he sign f unct i on requi res basi c numer i cal l i near al ge-
br a t ool s like ma t r i x mul t i pl i cat i on, i nversi on a nd/ or sol vi ng l i near s ys t ems .
These (: omt )ut at i ons are i mpl ement ed efficiently on mos t par al l el ar chi t ec-
t ur es and, in par t i cul ar , Sc a LAPACK [15] pr ovi des easy t o use and por t a bl e
c omput a t i ona l kernel s for t hese oper at i ons . Hence, t he sign f l mct i on met ho( l
is an a ppr opr i a t e t ool t o desi gn and i mpl ement efficient and por t a bl e i mmer -
ical sof t war e for di s t r i but ed me mo r y paral l el comput er s .
Let. Z C g/'~x'~ have no ei genval ues oi1 t he i magi nar y axi s and denot e
[ , , ]
l)y Z = S ]7 .1 + its J or da n deconl posi t i on wi t h J - E C A' .1+ E
ff;(,,-t.) x (,,-t.) cont ai ni ng t he .]or(lan bl ocks correst )ondi ng t o t l m (,ig(mvahws
in tho op(~ll left and ri ght hal f pl anes, r/~st)(;('tively. Th(m t he mat'ri:r s i . q ' l t
.fl,nction of Z is defi ned as
si gn (Z ) : = S [ -I ~'O I ,,() ] S - ' _ ~ , . ( 2 8 )
Not e t ha t sign ( Z) is uni que and i ndependent of t he or(ler of t he ei g( mvahws
in t he J or da n decomt )osi t i on of Z (see, e.g., [38, Sect i on 22.1]). Ma ny ot her
equi val ent defi ni t i ons for sign ( Z) can be given; see, e.g., t he r ecent s ur vey
pa pe r [35].
Ti m appl i cat i on of t he ma t r i x sign f l mct i on metho(1 t o a ma t r i x penci l
Z - AY as gi ven in [ 2 0 ] ill case Z an(1 }~ ar e nonsi ngul ar can be 1)IUs(mted as
1
z , , : = z , z ~ . + , : - % ( z ~ + , . ~ . ~ z ~ . ~ ) , a: = 0 , 1 , 2 , . . . , ( 2 9 )
where, c~, is a scal i ng par amet er . E. g. , for det er mi nant al scal i ng, c~. is giv(m
as c~, = (1 det ( Zk) t / [ det ( Y) l ) 88 [20]. Thi s i t er at i on is equi val ent t o c omput i ng
t he sign f unct i on of t he ma t r i x Y- I Z vi a t he s t a nda r d Newt on i t er at i on as
pr opos ed in [46]. The pr ope r t y needed her e is t ha t if Zoo : = l i mk~o~ Zk, t hen
(Zoo. - Y) / 2 (or ( Z~ + Y) / 2) defines t he skew pr oj ect i on ont o t he s t abl e ((,i"
ant i - st abl e) ri ght defl at i ng s ubs pace of Z - AY paral l el t o tile ant i - s t abl e (or
st abl e) def l at i ng suhspace.
7O
In [20] t he i t er at i on (29) is used t o c omput e t he st abi l i zi ng sol ut i on of t he
CARE (11) and t he DARE (12) usi ng t he ma t r i x penci l s (25) and (27). The
al gebr ai c Ri ccat i equat i ons (11) can be sol ved by appl yi ng (29) t o Z - AY =
H - AK and t hen f or mi ng tile r esul t i ng pr oj ect or Z~o - Y ont o tile s t abl e
def l at i ng s ubs pace of H - AK. A basi s of t hi s s ubs pace is t hen gi ven by t he
r ange of t ha t pr oj ect or . Thi s s ubs pace is usual l y not c omput e d expl i ci t l y as
XE : = Xc E can be obt ai ned by sol vi ng t he over det er mi ned but cons i s t ent
set of l i near equat i ons
1
X E
Z . ~ 2 + E j [ Z21
[ z , ,
z12]. see [18,20,38,46]. The ma t r i x X~ can be obt a i ne d by wher e Z ~ = [z21 z ~ ,
sol vi ng XE = X(. E while t he opt i ma l gai n ma t r i x and t her ef or e t he opt i ma l
cont r ol is obt ai ned di r ect l y usi ng F(, = - R, -1 ( BTx E + s Tc ) .
The DARE (12) can not be sol ved di r ect l y usi ng t he sign f l mct i on me t h o d
as we n e e d t h e d s t a b l e d e f l a t i n g s u b s p a c e o f k T / - / , f l ' om ( 2 7 ) . On e p o s s i b i l i t y
t o s w i t c h b a c k - a n d - f o r t h b e t w e e n c a n d d s t a b l e ma t r i x p e n c i l s A - AB ( o r
c a n d d st abl e defl at i ng subspaces) is tile Cayley transformation
C, , ( . 4- AB) = ( # A+ B) - A( . 4 - ~ B) , I / * l = l , d e t ( A- t t B) # 0 .
In or der t o keel) comput at i ons real, one has t o choose # = her e we r es t r i ct
our sel ves t o It = 1. I t is wel l -known (see, e.g., [40,43]) t ha t if A - AB is r
s t abl e (d st abl e) , t hen C~,(A - AB) is d s t abl e (c st abl e) and t he c s t abl e
((l st abl e) ri ght defl at i ng subsl mce of A - AB is tim (l st abl e (c st abl e) r i ght
def l at i ng sul)sI)ace of Ct , (A - AB). t l ence, t he DARE (12) C}Ul be sol ved wi t h
l he sign f l mct i on me t hod at)plied t o C,~(A;I - AL,). The sol ut i on X,t is t hen
obt ai ned f l om (30) r epl aci ng X~. by Xd.
Not e t hat none of t he me t hods consi der ed so far can be used t o soh' e
( 1 2 ) v i a ( 2 6 ) : as we need tile d s t abl e def l at i ng s ubs pace of M - AL, the. si gn
f l mct i on me t hod can not be appl i ed di rect l y. Though t hi s subst )ace is gi ven
1) 3' t he c st abl e ri ght defl at i ng s ubs pace of t he Cayl ey t r a ns f or me d ma t r i x
penci l CI , ( M- AL), t he sign f l mct i on me t hod can in gener al not be used her e
as M + L and M - L ma y be si ngul ar.
A di fferent appr oach t o solve tile s pect r al di vi si on pr obl em and t he con-
si(lered ma t r i x equat i ons is revi ewed in Sect i on 2.3. Thi s a ppr oa c h will al so
(~vercome t he I uobl ems for t he DARE (12) ment i oned above.
The (general i zed) Lyapunov and St ei n equat i ons (16) ar e speci al i ns t ances
of t he CARE (11) and DARE (12), respect i vel y. Thi s i mpl i es t ha t one can
solve (16) and (17) by means of t he sign f unct i on me t hod appl i ed t o t he
ma t r i x penci l in (25) which t hen t akes t he f or m
H - - , / r ( = _ ~ ' i ' -- A 0 d T " ( 3 1 )
71
For st abl e ma t r i x penci l s . 4 - AC, H- , k K is r egul ar and has an n- di mens i onal
s t abl e def l at i ng s ubs pace such t ha t t he sol ut i on of (16) can be obt a i ne d anal -
ogousl y t o t ha t of (11).
Ill [13] it is obser ved t ha t appl yi ng tile gener al i zed Newt on i t er at i on (29)
t o t he ma t r i x t)encil H - ,XK in (31) and expl oi t i ng tile bl ock- t r i angul ar
s t r uct ur e of all mat r i ces i nvol ved, (29) boils down t o
1 ( )
A0 : = A, A~.+, : = 5 A k + O A ~ . l O ,
k = 0 , 1 , 2 , . . .
1 (E~: +C T A [ . T E k A ~ . I c ) ,
E0 : = / ), Ek+l :=
(32)
l ~ - T ( l i mk _ ~ Ek I n case =
and t ha t X = 5C ) ~ - 1 . C I,~, t he i t er at i on in (32)
has al r eady been deri ved by Robe r t s [46]. The semi def i ni t e Lya punov equa-
t i ons as in (20) (23) call be sol ved usi ng a f act or ed versi on of t he i t er at i on
for t he /~k' s in (32), i.e., t he i t er at i on is per f or med s t ar t i ng wi t h tile f act or
of / ) = FTI 6. Thi s i t er at i on t hen conver ges to 2 Xl C if t he sol ut i on is
f act or ed as X = x T x I . Det ai l s of t hi s al gor i t hm call be f ound ill [13] and
its appl i cat i on to comput i ng tile s ys t em Gr a mi a ns for cont i nuous - t i me LTI
.systems as gi ven in (20), (21) is descr i bed in [11].
In case t he s pect r a of A- AC and / ) - A/ ) sat i sf y a ( A, C) C C- and
cr ( / ) , / ) ) C C- , t he Sylw~.ster equat i on (14) (:an al so be sol ved usi ng the. sign
f unct i on me t hod appl i ed t o
Usi ng agai n t he bl ot : k- t r i angul ar s t r uct ur e of t he ma t r i x penci l H - Al (, t he
i t er at i on can be l mr f or med on t he bl ocks a,s follows:
. 40 : = . ~i , D o : = D , E 0 : = k ,
9 - ' ( A . + C A ~, ~C)
- 4A. +1 . - - 5
k = 0 , 1 , 2 . . . . . ( 3 4 )
' (Dk + BD;ID)
Dk+I : = ~
z~.+, .-.- ~' (E~: + CAklZ~.D;:'~)
The sol ut i on of'^(13) is t hen gi ven by t he sol ut i on of t he l i near s ys t e m of
e(l uat i ons 2 C X B = limk-,oo Ek.
In case C = I , and / ) = I .... ot her i t er at i ve schemes for c omput i ng t he
sign f unct i on like t he Newt on- Schul z i t er at i on or Hal l ey' s me t hod can al so be
i mpl ement ed efficiently t o solve tile cor r es pondi ng Lya punov and Syl vest er
equat i ons (16) and (14); det ai l s of t he r esul t i ng al gor i t t uns will be r e por t e d
in [14].
So far we have only (:onsidered t he l i near ma t r i x equat i ons for cont i nu( ms-
t.ime cont r ol probl el ns. That. is, we have as s umed st abi l i t y wi t h r es pect t o
t he i magi nar y axis. In di scr et e- t i me cont rol pr obl ems , st abi l i t y pr ot ) er t i es ar e
72
gi ven wi t h r espect t o t he uni t circle. The l i near ma t r i x equat i ons encount er ed
in di scr et e- t i me cont rol pr obl ems ar e (15), (18), and (19). Let us first c o n s i d e r
( 1 5 ) o f w h i c h ( 1 8 ) i s a s p e c i a l i n s t a n c e . I f w e r ewr i t e t he equat i on in fixed
poi nt form, X = / i , X/ ) + / 9 a n d f or m t he fixed poi nt i t er at i on
X0 : = / ~ , X~:+i = / ) + AXk B, k = O , 1 , 2 , . . . .
t hen t hi s i t er at i on converges t o X if A and B are d- st abl e. The conver gence
r at e of t hi s i t er at i on is linear. A quadr at i cal l y conver gent versi on of t he fixed
poi nt i t er at i on is suggest ed in [19,50],
A0 : = ~4, B0 : = / ), Xo : = E,
X~.+I : = A ~ . X k B k + X k , k = 0 , 1 , 2 , . . . . (35)
Aa : + i : = A~. , Bk+l : = B 2,
The above i t er at i on is r ef er r ed t o as t he S mi t h i t erat i on. We empl oy it t o
solve (18) and (15). In case (19) is t o be sol ved wi t h t he Smi t h i t er at i on, one
has t o appl y (35) t o (Ad-~)Tx(AO -~) - x + 0 - r E 0 - * = 0. Thi s has t he
di s advant age t ha t t he i t er at i on is s t a r t e d wi t h da t a t ha t is al r eady cor r upt ed
r oundof f er r or s basi cal l y det er mi ned by cond ( C) , i.e., t he condi t i on of by
\ /
wit h r espect t o ma t r i x i nversi on defi ned by cond ( d ' ) -- I l d l l l l d ' - ' l l .
One possi bi l i t y t o avoi d t he i ni t i al i nversi on of C when sol vi ng (19) by
t he Smi t h i t er at i on is to t r ans f or m (19) t o a gener al i zed Lya punov equat i on
wi t hout i nver t i ng any mat r i ces usi ng t he Cayl ey t r a ns f or ma t i on and t hen
appl yi ng (32) t o t he t r ans f or med equat i on
(~4 + c ) T x ( f l - C) + (,~, - C ) 7 X ( , 4 + C) + 2(2 = 0 (36)
which has t he s ame sol ut i on as (19). Of course, t he s ame a ppr oa c h can be
used for (18) set t i ng C = In. But t hi s yi el ds a gener al i zed Lya punov equat i on.
In or der t o obt ai n a s t andar d Lya punov equat i on of t he f or m (16) one has
t o nml t i pl y (36) fronl t he left, by (,4 - d ) - T and (~4 - d ) -1 f r om t he ri ght .
Thi s i nt r oduces agai n unnecessar y r oundi ng er r or s and we will t her ef or e not
fbllow t hi s appr oach here.
2 . 3 T h e D i s k F u n c t i o n M e t h o d
Let Z - AY, Z, Y 6 gt. '~x'~, be a r egul ar ma t r i x penci l havi ng no ei genvahms
on t he uni t circle. Suppose t he Wei erst raf l ( Kr onecker ) canoni cal f o r m of
Z - AY is gi ven by
Z - AY = T [ J ~ - AI O ] , ] ~ - A N
wher e J or da n bl ocks cor r espondi ng t o ei genval ues inside t he uni t di sk ar e
col l ect ed in .10 while J ~ cor r es ponds t o ei genval ues out si de t he uni t di sk
73
and N cont ai ns ni l pot ent bl ocks cor r es pondi ng t o infinite ei genval ues. The
matrix pencil disk flmction is defi ned in [6] as
d i s k ( Z ' Y ) := S ( [ Ik 0 0 - A [ 0 0 ] ) S- 1 =: D z - A D Y " I, , -k
A matrix disc function was al so i nt r oduced in [46] usi ng a di fferent appr oach.
In [6] it is shown t ha t t hi s is a speci al case of t he above defi ni t i on usi ng Y =
I,,. / . From t he di sk funct i on, we can obt ai n t he d- s t a bl e def l at i ng s ubs paee
of Z - AY as Dz is a skew pr oj ect or ont o t hi s subspace. Hence, a basi s for
t hi s subst )ace is gi ven by a basi s of t he col umn space of Dz .
The di sk f unct i on has recei ved some i nt er est in r ecent year s as it pr ovi des
t he ma t he ma t i c a l f r amewor k for an al gor i t hm pr opos ed in [41] and ma de
feasi bl e for pr act i cal comput at i ons in [4] for sol vi ng t i m s pect r al di vi si on
pr obl em. Tiffs inverse free spectral division algorithm can be gi ven as follows:
Zo := Z, Yo : = Y,
[ r g l l U12] [ ~ 0 k ]
--Zl,.J : = [U21 U22J ( Q R d e c o m p o s i t i o n ) , k = 0 , 1 , 2 , . . . .
Zk+l := u r z k , ~'kq-1 : = U~Yk,
(38)
I t follows t ha t di sk (Z, Y) = (Zoo + } J ~ ) - i ( } ~ _ AZoo). Hence a basi s for
t he d st abl e ri ght def l at i ng subspace of Z - AY can be c omput e d vi a a r ank-
r eveal i ng Q R decomt msi t i on of (Zoo + } ~ ) - l } ~ , wher e limA.~ oo (Z~., }~. ) =:
(Zoo, 1~, ). Not e t hat t hi s QR (t ecomposi t i on can be c omput e d wi t hout ex-
pl i ci t l y i nver t i ng ( Z~ + }~. ); see [4] for det ai l s. Mor eover , a compl et e s pect r al
decompos i t i on of Z - AY al ong t he uni t circle can be c omput e d usi ng onl y
one i t er at i on of t he f or m (38); see [52].
/ . From t he above consi der at i ons we can concl ude t ha t t he DARE (12) can
be sol ved appl yi ng i t er at i on (38) t o M- AL f l om (26). I t is shown in [6] t ha t it
is not necessar y t o c omput e a basi s for t he d st abl e def l at i ng s ubs pace expl i c-
itly. Usi ng t he r el at i on bet ween t he nul l spaces Ke r ( Dz ) = Ker ( Zoo) , and t he
fact t hat if t he st abi l i zi ng sol ut i on Xd of (12) exi st s, t hen [I.~ ( XdE) T ~T] E
Ke r ( Dz ) , one can show (see [6]) t ha t t he sol ut i on of t he DARE and t he opt i -
mal gai n ma t r i x of t he di scr et e- t i me opt i mal cont r ol prol )l em can be obt a i ne d
f r om t he sol ut i on of t he over det er mi ned l i nt consi st ent set of l i near equat i ons
F z , , l
(39)
Z32 Z33 J LZ3i J
Here, t he Zkj, k , j = 1, 2, 3, define a bl ock par t i t i oni ng of Zoo conf or mal t o
t he par t i t i oni ng in (26). Sol vi ng (12) wi t h t hi s a ppr oa c h will be r ef er r ed t o
as t he disk function method.
Not e t ha t t he CARE (11) can also be sol ved wi t h t he disk f unct i on me t hod
by appl yi ng t he i t er at i on (38) t o C~(H - AK) wi t h H - AK as in (25). The
74
sol ut i on X~ of (11) as well as t he opt i ma l gai n ma t r i x /Pc. can be obt a i ne d
from t he over det er mi ned but consi st ent set of l i near equat i ons
[ , . . 1 r , . . 1 [ , . , . . ]
Z22j XE = [Z21 , Zoo : = 221 Z22 ,
and Xc = XE j ~- l , Fc = --(BT XE n t- STc ) .
The l i near ma t r i x equat i ons (18) and (19) can al so be sol ved vi a t he di sk
funct i on me t hod appl i ed t o (27) not i ng t ha t (18), (19) ar e speci al i nst ances
of (12). The cor r es pondi ng ma t r i x penci l t hen t akes tile f or m
(4O)
Equat i ons (16) and (17) ar e speci al i nst ances of (11) and hence can t)e sol ved
vi a t he disk funct i on me t hod appl i ed t o C~,(H - .~K) for H - )~t( as in (31).
Unf or t unat el y, t he i t er at i on in (38) can not be decompos ed i nt o i t er a-
t i ons on t he ma t r i x bl ocks such t ha t no c omput a t i ona l savi ngs is obt a i ne d
compar ed t o t he sol ut i on of t he DARE. Hence, t he c omput a t i ona l cost for
solving l i near ma t r i x equat i ons wi t h t he di sk f l mct i on me t hod is in genc' ral
prohi bi t i ve.
Mor e det ai l s and c omput a t i ona l aspect s of t he di sk f l mct i on me t hod can
be found in [4,6,7,52]. Though a gener al scal i ng s t r a t e gy in or der t o accel er at e
conw~.rgence in (38) is ye.t not known, an i ni t i al scal i ng of H - A/ ( in (25)
{:an si gni fi cant l y i mpr ove tile di sk f l mct i on me t hods for CAREs ; see [6].
2 . 4 Ne wt o n ' s Me t h o d
The met hods pr esent ed so far have addr es s ed tile al gebr ai c Ri ccat i equat i ons
by t hei r r el at i on t o ei genpr obl ems. By nat ur e, t hey ar e s ys t ems of nonl i near
equat i ons. I t is t her ef or e s t r ai ght f or war d t o appl y me t hods for sol vi ng non-
l i near equat i ons. In [37], Kl ei nman shows t ha t Newt on' s me t hod, appl i ed t o
t he CARE (11) wi t h E = I~ and pr oper l y i ni t i al i zed, conver ges t o tile desi red
st abi l i zi ng sol ut i on of t he CARE. The appl i cat i on t o t he gener al i zed equa-
t i on (11) is consi dered in [3,42]. Gi ven some i ni t i al guess X0, t he r esul t i ng
al gori t hl n (:an be s t at ed in di fferent ways. We have chosen her e t t w var i ant
t hat is mos t r obust wi t h r espect t o accumul at i on of r oundi ng errors.
FOR k = 0, 1, 2 , . . . "unt i l conver gence"
1. Ak : = J--(~Xk.
2. Solve for N~. in t he gener al i zed Lya punov equat i on
r ~
0 = 7r + A k Nk E +/~N~:AA:.
3. Xk+l : = X~ : + Nk .
75
The mai n c omput a t i ona l cost in t hi s al gor i t hm comes f r om t he sol ut i on of
t he ( gener al i zed) Lya punov equat i on in each i t er at i on st ep. I f 0 is posi t i ve
semi def i ni t e and under t he a s s umpt i on used t hr oughout t hi s paper , i.e., X~
exi st s such t ha t cs ( / ) - I ( , ~_ GX~) ) C C- , it can be shown t ha t conver gence
t o X~. is gl obal l y quadr at i c if X0 is chosen such t ha t / ) - I A 0 is c st abl e.
Remark 1. Conver gence of Ne wt on' s me t hod for al gebr ai c Ri ccat i equat i ons
can al so be pr oved under sl i ght l y mor e gener al as s umpt i ons t ha n used her e
[29,28]: s uppos e t her e exi st s a gai n ma t r i x/ ~c such t ha t ~i. c in (9) is c- s t abl e,
t hat is, t he under l yi ng LTI s ys t e m (1) is c stabilizable. Fur t her mor e, as s ume
t hat G is posi t i ve semi def i ni t e and t her e exi st s a ma xi ma l s ymme t r i c solu-
t i on X+ of (11), i.e., X+ > X for any ot her s ymme t r i c sol ut i on of (11).
Then A+ : = / ) - 1 ( ~ _ GX+) has all i t s ei genval ues in t he closed left hal f
pl ane. The ma t r i x X+ is t her ef or e cal l ed almost stabilizing. I t is t he uni que
sol ut i on of (11) wi t h t hi s pr ope r t y and coi nci des wi t h Xc if t he l at t er exi st s
[38, Cha pt e r 7]. Then t he Newt on i t er at i on conver ges t o X+ f r om any st a-
bilizing i ni t i al guess X0 [29,38]. The conver gence r at e is usual l y l i near if X+
is not st abi l i zi ng, but quadr at i c conver gence ma y still occur. A si mpl e t r i ck
pr es ent ed in [29] can i mpr ove t he conver gence in t he l i near case si gni fi cant l y.
Anal ogous obser vat i ons hol d in t he di s cr et e- t i me case [28].
Fi ndi ng a st abi l i zi ng X0 usual l y is a difficult t as k and requi res t he st abi l i za-
t i on of an LTI s ys t em, i.e., t he sol ut i on of Tas k C1. The c omput a t i ona l cost
is equi val ent t o one i t er at i on st ep of Ne wt on' s met hod; see, e.g., [49] and t he
references t her ei n. Moreover, X0 det er mi ned by a st abi l i zat i on pr ocedur e ma y
lie far fron: X,.. Though ul t i mat el y quadr at i c conw~rgent, Newt on' s me t hod
may i ni t i al l y conver ge slowly. Thi s can be due t o a l arge er r or I I x 0 - x , . l l
or to a di s as t r ous l y bad first st ep, l eadi ng t o a l arge er r or IIX1 - x c l l ; see,
e.g., [36,6,8]. Due t o t he i ni t i al slow conver gence, Newt on' s me t hod oft en
requi res t oo ma n y i t er at i ons t o be compet i t i ve wi t h ot her Ri ccat i sol vers.
Ther ef or e it is mos t f r equent l y onl y used t o refine an a ppr oxi ma t e CARE
sol ut i on c omput e d by any ot her met hod.
Recent l y an exact line search pr ocedur e was suggest ed t ha t accel er at es t he
initial conver gence and avoi ds " ba d" first st eps [6,8]. Specifically, St ep 3. of
Newt on' s me t hod gi ven above is modi f i ed t o Xk+l = Xk + tkNk, wher e tt. is
chosen in or der t o nfi ni mi ze t he Fr obeni us nor m of t he resi dual Tic(Xk +tNk).
As c omput i ng t he exact mi ni mi zer is ver y cheap c ompa r e d t o a Newt on st ep
and usual l y accel er at es t he i ni t i al conver gence si gni fi cant l y while benef i t i ng
fl' om t he quadr at i c conver gence of Ne wt on' s me t hod close t o t he sol ut i on,
t hi s me t hod becomes at t r act i ve, even as a sol ver for CAREs (at l east in
some cases), see [6,8,9] for det ai l s. Mor eover , for some i l l -condi t i oned CAREs ,
exact line sear ch i mpr oves Newt on' s me t hod also when used onl y for i t er at i ve
r ef i nement . Not e t ha t t he line search s t r a t e gy di scussed in [6,8,9] al so i ncl udes
t he t r i ck descr i bed in Re ma r k (1) for accel er at i ng t he l i near conver gence in
case Xc does not exist.
76
Si mi l arl y, Newt on' s me t hod can be appl i ed t o t he DARE (12). The re-
s u l t i n g a l g o r i t h m i s de s c r i be d i n [33] for /9 = I,~ a n d i n [3, 42] for /9 r I , , .
T h e ma i n c o mp u t a t i o n t he r e is a g a i n t h e s o l u t i o n o f a l i ne ar ma t r i x e q u a t i o n
which i s i n t hi s c a s e a (general i zed) St ei n equat i on. Agai n, line sear ches can
be empl oyed t o ( par t i al l y) over come t he difficulties ment i oned above as t hey
appl y anal ogousl y t o DAREs [6].
In bot h t he cont i nuous- and di scr et e- t i me case, in each i t er at i on s t ep of
Newt on' s me t hod, a l i near ma t r i x equat i on has t o be solved. Hence t he key
t o an efficient par al l el i zat i on of Newt on' s me t hod is an efficient sol ver for t he
l i near ma t r i x equat i on in quest i on. Hence we empl oy t he i t er at i ve schemes dis-
cussed above (sign funct i on me t hod or Smi t h i t er at i on) . Not e t ha t all ot her
c omput a t i ons requi red by Newt on' s me t hod a pa r t f r om sol vi ng Lya punov
equat i ons basi cal l y consi st of ma t r i x mul t i pl i cat i ons and can t her et br e be i m-
pl ement ed efficiently on paral l el comput er s . The par al l el i zat i on of Ne wt on' s
me t hod wi t h exact line search based upon sol vi ng t he gener al i zed Lya punov
equat i ons vi a (32) is di scussed in [9] wher e al so several numer i cal e xpe r i me nt s
ar e r epor t ed.
3 P r o s p e c t u s o f t h e P L I L C O
In t hi s sect i on we first descri be tile Sc a LAPACK l i l )rary [15] whi ch is used
as t he paral l el i nf r as t r uct ur e for our PLI LCO r out i nes. We t hen descr i be
t he specific r out i nes in PLI LCO, i ncl udi ng bot h t he avai hd)l e r out i nes and
t hose t ha t will be i ncl uded in a near f ut ur e t o ext end t he f l met i onal i t y of t he
l i brary.
3 . 1 T h e S c a L A P A C K l i b r a r y
The Sc a LAPACK (Seal abl e LAPACK) l i br ar y [15] is desi gned as an ext en-
sion of t he suceesstifl LAPACK l i br ar y [2] for pa r a l M di s t r i but ed ul enl ol ' v
multiI>r()cessors. SeaLAPACK mi nl i es tile LAPACK, t)oth in s t r uc t ur e and
not at i on. The paralh' .l kernel s in t hi s l i br ar y rel y on t he use of t hose in t he
PBLAS ( Par al l el BLAS) l i br ar y and t he BLACS (Basi c Li near Al gebr a Com-
muni cat i on Subr out i nes) . The seri al c omput a t i ons are per f or med by calls t o
r out i nes f l om t he BLAS and LAPACK l i brari es; t he c omnmni c a t i on r out i nes
in BLACS are usual l y i mpl el nent ed on t op of a s t a nda r d c ommuni c a t i on
l i br ar y as MPI or PVM.
Thi s s t r uct ur ed hi er ar chy of dependences (see Fi gur e 2) enhances t he
por t abi l i t y of t he codes. Basically, a paral l el al gor i t hm t ha t uses ScaLA-
PACK r out i nes can be mi gr at ed t o any vect or pr ocessor , s uper s cal ar pr oces-
sor, shar ed me mor y mul t i pr ocessor , or di s t r i but ed me mor y nml t i c omput e r
where t he BLAS and t he MPI (or PVM) are avai l abl e.
Sc a LAPACK i nl pl ement s paral l el r out i nes fbr sol vi ng l i near syst ems, lin-
ear l east squar es pr obl ems, ei genval ue pr obl ems, and si ngul ar val ue prol )l ems.
77
The performance of these rout i nes depends on those of the serial BLAS and
the communi cat i on l i brary ( MPI or PVM).
ScaLAPACK empl oys t he so-called message-passi ng par adi gm. That is,
t he processes col l aborat e in solving the probl em and explicit communi cat i on
requests are performed whenever a process requires a dat um t hat is not stored
in its local memory.
In ScaLAPACK the comput at i ons are performed by a logical grid of
P,. x P(. processes. The processes are mapped ont o t he physical processors,
dependi ng on the available number of these. All dat a (mat ri ces) have to
be di st ri but ed among the process grid prior to t he i nvocat i on of a ScaLA-
PACK routine. It is the user' s responsibility to perform this dat a distribution.
Specifically, in ScaLAPACK t he mat ri ces are part i t i oned into nb x 7tb square
blocks and these blocks are di st ri but ed (and stored) among t he processes in
col umn- maj or order. A graphi cal represent at i on of t he dat a l ayout is given
in Fi gure 1 for a logical grid of 2 x 3 processes.
. . . . i
p i:i~i~i~i~iiiiiiiiii!i!iii
m - i
r - -
e , , , iiiiiiiiiiiiii~
Po i
m
P~2
,ii i
Pm P~2 ::: :
P()I i Po2
i
i
, I
. . . . n . . . . . . . .
!
~176
Fig. 1. Data layout in a logical grid of 2 x 3 processes.
Al t hough not strictly part of ScaLAPACK, t he l i brary also provides rou-
tines for di st ri but i ng a mat r i x among t he process grid. The communi cat i on
overhead of this initial di st ri but i on is well bal anced in most medi um and
large-scale appl i cat i ons by t he i mpr ovement s in performance achieved with
parallel comput at i on.
78
3 . 2 S t r u c t u r e o f P L I L CO
PLI LCO heavily relies on the use of the available parallel i nfrast ruct me in
ScaLAPACK (see Figure 2). Although ScaLAPACK is incomplete, the ker-
nels available in the current version (1.6) allows us to implement most of
our PLI LCO routines. PLI LCO will benefit from future extensions and de-
velopments in the ScaLAPACK project. Improvement s in performance of the
PBLAS kernels will also be specially welcome.
APACK
PBLAS
_ ~ r H, , h, , I
Fig. 2. Structure of PLILCO.
In PLI LCO the routines are named, following the convention in LAPACK
and ScaLAPACK, as PDxxyyzz. The PD- prefix in each name indicates t hat
this is a Parallel routine with Double-precision arithmetic. The following two
letters, xx indicate the type of LTI system addressed by the routine. Thus, GE
or GG indicate, respectively, a st andard LTI system (E = In) or a generalized
LTI system (E ~ I,~). The last four letters in the name indicate the spe-
cific problem (yy), and the met hod employed for t hat problem (zz). In most
cases, a sequence Cyzz indicates t hat the routine deals with a continuous-time
problem while a sequence Dyzz indicates a discrete-time problein.
The PLI LCO routines can be classified in 4 groups according to their
functionality (comput at i on of basic mat ri x functions, linear mat ri x equation
solvers, optimal control, and feedback stabilization). Two more groups will
be included in the near future. We next review the routines in each of these
4 groups.
Group MTF: Basic matrix functions.
The routines in this group implement iterative schemes to comput e functions
7 9
of matrices or matrix pencils. For instance, three routines are available for
comput i ng the sign function of a matrix. These routines employ different
variants for the matrix sign function iteration:
- PDGESGNW. The Newton iteration.
- PDGESGNS. The Newton-Schulz iteration.
- PDGESGHA. The Halley iteration.
Two more routines are designed for comput i ng the sign function or the
disk flmction of matrix pencils:
- PDGGSGNW. The generalized Newton iteration for the mat ri x sign function.
- PDGGDKMA. The iteration (38) for comput i ng the disk function.
Note t hat the iteration (38) for the disk function only deals with mat ri x
pencils and therefore PLI LCO does not provide any routine for the st andard
problem. The disk function of a matrix Z is obtained by appl yi ng routine
P D G G D K M A to Z -- A[.
Table 1 lists the PLI LCO routines in the MTF group.
Problem
Type of Function Standard I Generalized
Matrix sign function PDGESGNW PDGGSGNW
PDGESGNS
PDGESGHA
Ma t r i x p e n c i l d i s k f u n c t i o n PDGGDKMA
Table 1. PLILCO routines in MTF group.
Group LME: Linear matrix equation solvers.
The routines in this group are solvers for several particular instances of gen-
eralized Sylvester matrix equations, see (13) (19).
All the solvers for those equations arising in continuous-time LTI systems
are based on tile matrix sign flmction and require t hat tile coefficient matrices
of the equation are stable.
PLI LCO includes three solvers for stable Sylvester equations t hat differ
in tile iteration used for tile comput at i on of the matrix sign flmction:
- PDGECSNW. The Newton iteration.
- PDGECSNS. The Newton-Sctmlz iteration.
- PDGECSHA. The Halley iteration.
In the generalized problem a solver for generalized Sylvester equations is
also included:
8 0
- PDGGCSNW. The gener al i zed Newt on i t er at i on as in (34).
All t hese sol vers have t hei r anal ogous r out i nes for st abl e Lya punov equa-
t i ons (t hree r out i nes) and st abl e gener al i zed Lya punov equat i ons (one r ou-
tine):
- PDGECLNW. The Newt on i t er at i on.
- PDGECLNS. The Newt on- Schul z i t er at i on.
- PDGECLHA. The Hal l ey i t er at i on.
- PDGGCLNW. The gener al i zed Newt on i t er at i on as in (32).
Fur t her mor e, in case t he cons t ant t e r m/ ) is semi def i ni t e it is al so possi bl e
t o obt ai n t he Chol esky f act or of t he sol ut i on di r ect l y by means of r out i nes
- PDGECLNC. The Newt on i t er at i on for t he Chol esky fact or.
- PDGGCLNC. The gener al i zed Newt on i t er at i on for t he Chol esky f act or .
In t he di scr et e- t i me case, t he i t er at i ve sol vers in PLI LCO ar e bas ed on
t he Smi t h i t er at i on and requi re t he coefficient mat r i ces t o be s t abl e (in
tile di scr et e- t i me sense). So far PLI LCO onl y i ncl udes t wo sol vers, for t he
di scr et e- t i me Syl vest er equat i on (15) and t he St ei n (or di s cr et e- t i me Lya-
t mnov) equat i on (18), respect i vel y:
- PDGEDSSM. The Smi t h i t er at i on for (15).
- PDGEDLSN. The Smi t h i t er at i on for (18).
Special versi ons of t he Smi t h i t er at i on for c omput i ng tile Chol esky f act or s of
semi defi ni t e St ei n equat i ons as in (22) and (23) ar e t o be devel oped in t he
fltture.
The sol ut i on of gener al i zed di s cr et e- t i me l i near equat i ons can be obt a i ne d
by t r ans f or mi ng t hi s equat i on i nt o an s t a nda r d one, as t her e is no geuer al i zed
versi on of t he Smi t h i t er at i on. Not e t ha t t hi s t r a ns f or ma t i on i nvol ves expl i ci t
i nversi on of t he coefficient mat r i ces in tile gener al i zed equat i on.
Tabl e 2 s ummar i zes t he PLI LCO r out i nes in t he LME gr oup.
Gr oup RIC: Ri ccat i ma t r i x equat i on sol vers.
We include in t hi s gr oup sol vers bot h for CARE and DARE.
In t he cont i nuous- t i me case, PLI LCO pr ovi des sol vers based on t hr ee
different met hods, t hese are, Newt on' s met hod, t he ma t r i x sign f l mct i on,
and t he ma t r i x disk funct i on. Mor eover , in t he s t a nda r d case, t hr ee di fferent
var i ant s are pr opos ed for Newt on' s me t hod dependi ng on t he Lya punov sol ver
t hat is empl oyed. Thus, we have t he following CARE solvers:
- PDGECRNW. Newt on' s me t hod wi t h t he Newt on i t er at i on for sol vi ng t he
Lyapunov equat i ons.
- PDGECRNS. Newt on' s me t hod wi t h t he Newt on- Schul z i t er at i on for sol vi ng
t he Lyapunov equat i ons.
- PDGECRHA. Newt on' s me t hod wi t h tile Hal l ey i t er at i on for sol vi ng t he
Lyapunov equat i ons.
t Problem
Type of equation ~ Generalized
Syl vester PDGECSNW PDGGCSNW
PDGECSNS
PDGECSHA
L y a p u n o v PDGECLNW PDGGCLNW
PDGECLNS
PDGECLHA
PDGECLNC PDGGCLNC
Di scret e- t i me Syl vest er PDGEDSIqW
Stei n PDGEDLSM
Tabl e 2. PLILCO routines in the LME group.
81
- PDGECRSG. The matrix sign function met hod.
- PDGECRDK. The matrix disk function met hod.
Similarly, we have the following generalized CARE solvers:
- PDGGCRNW. Newt on' s met hod with the generalized Newton iterative scheme
for solving the generalized Lyapunov equations.
- PDGGCRSG. The generalized mat ri x sign function method.
PDGGCRDK. The matrix disk flmction met hod.
PLI LCO also includes the following solvers for DARE (two routines) and
generalized DARE (two routines):
- PDGEDRSM. Newt on' s met hod with the Smith iteration for solving the
discrete-time Lyapunov equations.
- PDGEDRDK. The matrix disk function met hod for DARE.
- PDGGDRSM. Newt on' s met hod with the Smith iteration for solving the
discrete-time generalized Lyapunov equations.
- PDGGDRDK. Tile matrix disk function met hod for generalized DARE.
Table 3 lists the PLI LCO routines ill tile RIC group.
Group STF: Feedback stabilization of LTI systems.
This group includes routines for partial and complete (state feedback) sta-
bilization of LTI systems. The routines in this group use the linear mat ri x
equations solvers in group LME to deal with the different equations arising
in st andard and generalized, continuous-time and discrete-time LTI systems.
PLI LCO thus includes several state feedback stabilizing routines, which differ
in the linear mat ri x equation t hat has to be solved and, therefore, the iter-
ation employed. The feedback stabilization of continuous-time LTI systems
(:an be obtained by means of routines:
82
Type of Equat i on
C A R E
Pr obl e m
St andar d]
PDGECRNW
PDGECRNS
PDGECRHA
PDGECRSG
PDGECRDK
Gener al i zed
PDGGCRNW
PDGGCRSG
PDGGCRDK
D A R E PDGEDRSM PDGGDRSM
PDGEDRDK PDGGDRDK
T a b l e 3. PLI LCO r out i nes in t he RI C gr oup.
PDGECFNW. T h e Ne wt o n i t e r a t i o n .
PDGECFNS. Th e Ne wt o n - Sc h u l z i t e r a t i o n .
- PDGECFHA. Th e Ha l l e y i t e r a t i o n .
PDGGCFNW. Th e g e n e r a l i z e d Ne wt o n i t e r a t i o n .
hi t h e d i s c r e t e - t i me case, t he u n i q u e r o u t i n e a v a i l a b l e so f ar is t h e f ol l ow-
i ng:
- PDGEDFSM. Th e S mi t h i t e r a t i o n .
Ta b l e 4 l i s t s t h e n a me s of t he r o u t i n e s ill g r o u p S TF .
Type of LTI s ys t em
cont i nuous - t i me
Pr obl e m
St andar d[ Gener al i zed
PDGECFNW
PDGECFNS
PDGECFHA
di s cr et e- t i me PDGEDFSM
PDGGSTNW
T a b l e 4. PLI LCO r out i nes in STF group.
F u t u r e e x t e n s i o n s of P L I L C O wi l l i n c l u d e a t l e a s t t wo mo r e g r o u p s :
Gr o u p MRD: Mo d e l r e d u c t i o n of LTI s y s t e ms .
Gr o u p H2I : C o mp u t a t i o n of H2- a n d Ho o - c o n t r o l l e r s .
83
4 P r e l i mi n a r y Re s u l t s
In this section we present some of the preliminary results obtained with the
PLI LCO routines on different parallel architectures. Specifically, we report
results for the Lyapunov equation solver PDGECLNW and the generalized Lya-
punov equation solver PDGGCLNW.
As target parallel distributed memory architectures we evaluate our al-
gorithms on an IBM SP2 and a Cray T3E. In bot h cases we use the na-
tive BLAS, the MPI communication library, and the LAPACK, BLACS, and
ScaLAPACK libraries [2,15] to ensure the portability of the algorithms.
The IBM SP2 t hat we used consists of 80 RS/ 6000 nodes at 120 MHz,
and 256 MBytes RAM per processor. Internally, the nodes are connect ed by
a TB3 high performance switch. The Cray T3E-600 has 60 DEC Al pha EV5
nodes at 300 MHz, and 128 Mbytes RAM per processor. The communi cat i on
network has a bidimensional torus topology.
Table 5 reports the performance of tile Level-3 BLAS mat ri x product (in
Mflops, or millions of floating-point arithmetic operations per second), and
tile latency and bandwi dt h for the communication system of each platform.
IBM SP2
C r a y T3E
DGEMM
(Mflops)
200
400
Latency
(sec.)
30 x 10 -~
50 x 10 -~
Bandwith
(Mbit/sec.)
90
166
Tabl e 5. Basic performance parameters of the parallel architectures.
Ii1 bot h matrix equations, the coefficient matrix A is generated with ran-
dora mfifolm entries. This matrix is stabilized by a shift of the eigenvalues
(.4 := A- HA[[FI,~ ill the continuous-tinm case and A :-- A/J[AIIF ill the
discrete-time case!. In case a LTI system is required, A - AE is obt ai ned
fronl as .4 = R, E = QH where Q and R are obtained from a QR factor-
ization A. Tile sohltion matrix X is set to a matrix with all entries equal to
one and matrix Q is then chosen to satisfy the corresponding linear mat ri x
equation.
All experiments were performed using Fortran 77 and mEE double-precision
arithmetic ( em 2.2 x 10 16). In our examples, the solution is obt ai ned with
the accuracy t hat could be expected from the conditioning of the problem. A
inore detailed st udy of the accuracy of these solvers is beyond the scope of this
paper. For details and numerical examples demonst rat i ng the performance
and numerical reliability of the proposed equation solvers, see [9 14].
The figures show the Mflops ratio per node when the number" of nodes
is increased and the ratio n/p is kept constant. Thus, we are measuri ng the
84
scal abi l i t y of our paral l el r out i nes. The r esul t s in t he fi gures ar e aver aged for
5 execut i ons on di fferent r a ndoml y gener at ed mat r i ces . I n t hese fi gures t he
solid line i ndi cat es t he ma x i mu m a t t a i na bl e real pe r f or ma nc e ( t hat of DGEMM)
and t he dashed line r epr esent s t he per f or mance of t he cor r es pondi ng l i near
mat r i x equat i on solver.
Fi gur e 3 r epor t s t he Mfl ops r at i o per node for r out i ne PDOECLNW on t he
Cr ay T3E pl at f or m and r out i ne PDGGCLNW on t he I BM SP2 pl at f or m.
40(
35(
30c
c
~ 25C
~o
o~20C
~15C
10C
-x. . . . . . . . . .
0
5 10 15 20 25 30 2 4 6 8 10 12 14 16
Number of nodes Number of nodes
Fi g. 3. Mflop ratio for routine PDGECLNW on the Cray T3E with n/p = 750 (left),
and routine PDGGCLNW on the IBM SP2, with n/p = 1000 (right).
Bot h figures show si mi l ar resul t s. The per f or mance per node of t he al-
gor i t hms decreases when t he numbe r of pr ocessor s is i ncr eased f r om 1 t o 4
due to t he communi cat i on over head of t he paral l el al gor i t hm. However , as
t he number of pr ocessor s is f ur t her i ncr eased, t he pe r f or ma nc e onl y decr eases
sl i ght l y showi ng t he scal abi l i t y of t he sol vers.
5 Co n c l u d i n g Re ma r k s
\'~> have descr i bed t he devel opment of a sof t war e l i br ar y for sol vi ng tile
( : omput at i onal pr obl ems t ha t ari se in anal ysi s and synt hesi s of l i near con-
t rol syst ems. The l i br ar y is i nt ended for sol vi ng medi um- s i ze and l ar ge- scal e
pr obl ems and t he numer i cal r esul t s de mons t r a t e its pe r f or ma nc e on s har ed
and di st r i but ed me mor y paral l el ar chi t ect ur es. The por t abi l i t y of t he l i br ar y
is ensured by usi ng tile PBLAS, BLACS, and Sc a LAPACK. I t is hoped t ha t
t hi s hi gh- per f or mance comput i ng a ppr oa c h will enabl e users t o deal wi t h
l arge scale pr obl ems in l i near cont r ol t heory.
85
Re f e r e n c e s
1. B. D. O. Ander s on and J. B. Moore. Optimal Control - Linear Quadratic Meth-
ods. Pr ent i ce- Hal l , Engl ewood Cliffs, N J, 1990.
2. E. Ander son, Z. Bai, C. Bischof, J. Demmel , J. Dongar r a, J. Du Croz, A. Gr een-
baum, S. Hammar l i ng, A. McKenney, S. Ost r ouchov, and D. Sor ensen. LA-
PACK Users' Guide. SI AM, Phi l adel phi a, PA, second edi t i on, 1995.
3. W. F. Ar nol d, I I I and A. J. Laub. Gener al i zed ei genpr obl em al gor i t hms and
soft ware for al gebr ai c Ri ccat i equat i ons. Proc. IEEE, 72:1746-1754, 1984.
4. Z. Bai, J. Demmel , and M. Gu. An i nverse t ree par al l el s pect r al di vi de and
conquer al gor i t hm for nons ymmet r i c ei genpr obl ems. Numer. Math., 76(3):279
308, 1997.
5. R. H. Bar r el s and G. W. St ewar t . Sol ut i on of t he ma t r i x equat i on AX+XB = C:
Al gor i t hm 432. Comm. ACM, 15:820-826, 1972.
6. P. Benner. Contributions to the Numerical Solution of Algebraic Riccati Equa-
tions and Related Eigenvalue Problems. Logos Verl ag, Berl i n, Ger many, 1997.
Also: Di s s er t at i on, Fakul t ~t ffir Mat hemat i k, TU Che mni t z Zwi ckau, 1997.
7. P. Benner and R. Byers. Di sk f unct i ons and t hei r r el at i ons hi p t o t he ma t r i x
si gn f unct i on. In Proc. European Control Conf. ECC 97, Pa pe r 936. BELWARE
I nf or mat i on Technol ogy, ~,Vaterloo, Bel gi um, 1997. CD- ROM.
8. P. Benner and 1~.. Byers. An exact line sear ch me t h o d for sol vi ng gener al i zed
cont i nuous - t i me al gebr ai c Ri ccat i equat i ons. IEEE Trans. Automat. Control,
43(1):101 107, 1998.
9. P. Benner , IR. Byers, E.S. Qui nt ana- Or t f , and G. Qui nt a na - Or t i . Sol v-
ing al gebr ai c Ri ccat i equat i ons on par al l el comput er s usi ng Newt on' s
me t hod wi t h exa(:t line search. Beri cht e aus der Te c hnoma t he ma -
t i k, Hepor t 98 05, Uni ver si t ~t Br emen, Augus t 1998. Avai l abl e from
h t t p : / / www. mat h, u n i - b r e me n , d e / z e ' c e m/ b e r i c h t e , ht ml .
10. P. Benner, M. Cast i l l o, V. Her nandez, and E.S. Qui nt ana- Or t / . Par al l el par t i al
st abi l i zi ng al gor i t hms for l arge l i near cont r ol syst ems, d. Supereomputing, t o
appear .
11. P. Benner, J. M. Cl aver, and E.S. Qui nt ana- Or t f . Efficient sol ut i on of coupl ed
Lyat mnov equat i ons vi a ma t r i x sign f l mct i on i t er at i on. I n A. Dour a do et al. , ed-
i t or, Proc. 3 "d Portuguese Conf. on Automatic Control CONTROLO' 98, Coi n>
br a, pages 205 210, 1998.
12. P. Benner, .1.M. Cl aver, and E. S. Qui nt ana- Or t i . Par al l el di s t r i but e d sol vers
fl)r l arge st abl e gener al i zed Lyapunov equat i ons. Parallel Proecssing Letters, t o
appear .
13. P. Benner and E.S. Qui nt ana- Or t i . Sol vi ng st abl e gener al i zed Lya punov equa-
t i ons wi t h t he ma t r i x sign funct i on. Numer. Algorithms, t o appear .
14. P. Benner , E. S. Qui nt a na - Or t i , and G. Qui nt a na - Or t i . Sol vi ng l i near ma t r i x
equat i ons vi a r at i onal i t er at i ve schemes. I n pr epar at i on.
15. L.S. Bl ackford, J. Choi , A. Cl eary, E. D' Azevedo, ,I. Demmel , I. Dhi l l om J. Don-
gar r a, S. Hammar l i ng, G. Henry, A. Pet i t et , K. St anl ey, D. Wal ker , and R. C.
\ Vhal ey. ScaLAPACK Users' Guide. SI AM, Phi l adel phi a, PA, 1997.
16. I. Bl anquer , D. Guer r er o, V. Her nandez, E. Qui nt a na - Or t / , and P. Rui z.
Pa r a l l e l - SLI COT i mpl ement at i on and doc ume nt a t i on s t a nda r ds . SLI COT
Wor ki ng Not e 1998 1, h t t p : / / www. wi n. r ue . n l / n i c o n e t / , Se pt e mbe r 1998.
86
17. D. Bol ey and R. Mai er. A par al l el QR al gor i t hm for t he uns ymme t r i c ei genval ue
pr obl em. Techni cal Re por t TR- 88- 12, Uni ver s i t y of Mi nnes ot a at Mi nneapol i s ,
De pa r t me nt of Comput e r Sci ence, Mi nneapol i s, MN, 1988.
18. R. Byers. Sol vi ng t he al gebr ai c Ri ccat i equat i on wi t h t he ma t r i x si gn f unct i on.
Linear Algebra Appl., 85:267-279, 1987.
19. E. J. Davi son and F. T. Man. The numer i cal sol ut i on of A' Q+QA = - C. I EEE
Trans. Automat. Control, AC-13: 448 449, 1968.
20. J. D. Gar di ner and A. J. Laub. A gener al i zat i on of t he ma t r i x- s i gn- f unc t i on
sol ut i on for al gebr ai c Ri ccat i equat i ons. Internat. J. Control, 44:823 832, 1986.
21. J. D. Gar di ner a nd A. J. Laub. Par al l el al gor i t hms for al gebr ai c Ri ccat i equa-
t i ons. Internat. J. Control, 54:1317-1333, 1991.
22. J. D. Gar di ner , A. J. Laub, J. J. Ama t o, and C. B. Moler. Sol ut i on of t he Syl ves t er
ma t r i x equat i on AXB + CXD = E. ACM Trans. Math. Software, 18:223 231,
1992.
23. J. D. Gar di ner , M. R. Wet t e, A. J. Laub, J. J. Ama t o, and C.B. Mol er. Al gor i t hm
705: A For t r an- 77 soft ware package for sol vi ng t he Syl ves t er ma t r i x e qua t i on
AXB T + CXD T = E. ACM Trans. Math. Software, 18:232-.238, 1992.
24. G. A. Gei st , R. C. War d, G. J. Davi s, and R. E. Funder l i c. Fi ndi ng ei genval ues
and ei genvect ors of uns ymme t r i c mat r i ces usi ng a hype r c ube mul t i pr oces s or .
In G. Fox, edi t or , Proc. 3rd Conference on Hypercube Concurrent Computers
and Appl., pages 1577-1582, 1988.
25. G. H. Gol ub, S. Nash, and C. F. Van Loan. A Hes s enber g- Schur me t h o d for
t he pr obl em AX + XB = C. I EEE Trans. Automat. Control, AC-24: 909 913,
1979.
26. G. H. Gol ub and C. F. Van Loan. Matrix Computations. Johns Hopki ns Uni -
ver si t y Press, Bal t i mor e, t hi r d edi t i on, 1996.
27. M. Gr een and D. J. N Li mebeer . Linear Robust Control. Pr ent i ce- Hal l , Engl e-
wood Cliffs, N J, 1995.
28. C. -H. Guo. Newt on' s me t hod for di scr et e al gebr ai c Ri ccat i equat i ons when t he
cl osed-l oop ma t r i x has ei genval ues on t he uni t circle. SI AM J. Matrix A'nal.
Appl., 20:279-294, 1998.
29. C.-H. Guo and P. Lancast er . Anal ys i s and modi f i cat i on of Ne wt on' s me t h o d
for al gebr ai c Ri ccat i equat i ons. Math. Comp., 67:1089 1105, 1998.
30. S. J. Hammar l i ng. Numer i cal sol ut i on of t he st abl e, non- negat i ve def i ni t e Lya-
punov equat i on. IMA J. Numer. Anal., 2:303 323, 1982.
31. G. Henr y and R. van de Gei j n. Par al l el i zi ng t he QR al gor i t hm for t he uns ym-
met r i c al gebr ai c ei genval ue pr obl em: myt hs and real i t y. SI AM J. Sci. Comput.,
17:870-883, 1997.
32. G. Henry, D.S. Wat ki ns , and J. J. Dongar r a. A par al l el i mpl e me nt a t i on of t he
nons ymmet r i c QR al gor i t hm for di s t r i but e d memor y ar chi t ect ur es . LAPACK
Wor ki ng Not e 121, Uni ver si t y of Tennessee at Knoxvi l l e, 1997.
33. G. A. Hewer. An i t er at i ve t echni que for t he c omput a t i on of s t e a dy s t a t e gai ns
for t he di scr et e opt i ma l r egul at or . I EEE Trans. Automat. Control, AC-16: 382
384, 1971.
34. A.S. Hodel and K. R. Pol l a. Heur i st i c appr oaches t o t he sol ut i on of ver y l arge
spar se Lyapunov and al gebr ai c Ri ccat i equat i ons. I n Proc. 27th I EEE Conf.
Decis. Cont., Aust i n, TX, pages 2217-2222, 1988.
35. C. Kenney and A. J. Laub. The ma t r i x si gn f unct i on. I EEE Trans. Automat.
Control, 40(8):1330-1348, 1995.
87
36. C. Kenney, A. J. Laub, and M. Wet t e. A s t a bi l i t y- e nha nc i ng scal i ng pr ocedur e
for Schur - Ri ccat i solvers. Sys. Control Lett., 12:241-250, 1989.
37. D. L. Kl ei nman. On an i t er at i ve t echni que for Ri ccat i equat i on comput at i ons .
IEEE Trans. Automat. Control, AC-13: 114-115, 1968.
38. P. Lancas t er and L. Rodma n. The Algebraic Riccati Equation. Oxf or d Uni ver -
si t y Pr ess, Oxf or d, 1995.
39. A. J. Laub. A Schur me t hod for sol vi ng al gebr ai c Ri ceat i equat i ons. IEEE
Trans. Automat. Control, AC-24:913 921, 1979.
40. A. J. Laub. Al gebr ai c as pect s of gener al i zed ei genval ue pr obl ems for sol vi ng
Ri ccat i equat i ons. I n C.I. Byr nes and A. Li ndqui s t , edi t or s, Computational
and Combinatorial Methods in Systems Theory, pages 213-227. El sevi er ( Nor t h-
Hol l and) , 1986.
41. A. N. Mal yshev. Par al l el al gor i t hm for sol vi ng some s pect r al pr obl ems of l i near
al gebr a. Linear Algebra Appl., 188/ 189: 489-520, 1993.
42. V. Mehr mann. The Autonomous Linear Quadratic Control Problem, Theory
and Numerical Solution. Numbe r 163 in Lect ur e Not es in Cont r ol and Infor-
ma t i on Sciences. Spr i nger - Ver l ag, Hei del ber g, J ul y 1991.
43. V. Mehr mann. A s t ep t owar d a uni f i ed t r e a t me nt of cont i nuous and di scr et e
t i me cont r ol pr obl ems. Linear Algebra Appl., 241-243: 749-779, 1996.
44. T. Pappas , A. J. Laub, and N. R. Sandel l . Oi l t i l e numer i cal sol ut i on of t he
di s cr et e- t i me al gebr ai c Ri ccat i equat i on. IEEE Trans. Automat. Control, AC-
25:631-641, 1980.
45. T. Penzl . Numer i cal sol ut i on of gener al i zed Lya punov equat i ons. Adv. Comp.
Math., 8:33 48, 1997.
46. ,I.D. Rober t s . Li near model r educt i on and sol ut i on of t he al gebr ai c Ri ccat i
equat i on by use of t he si gn funct i on. Internat. J. Control, 32:677-687, 1980.
( Repr i nt of Techni cal Repor t No. TR- 13, CUED/ B- Cont r ol , Cambr i dge Uni -
versi t y, Engi neer i ng De pa r t me nt , 1971).
47. A. Saber i , P. Sannut i , and B. M. Chen. H2 Optimal Control. Pr ent i ce- Hal l ,
Her t f or dshi r e, UK, 1995.
48. G. Schcl fl l out . Model Reduction for Control Design. PhD t hesi s, Dept . El ect r i -
c:al Engi neer i ng, KU Leuven, 3001 Leuven- Hever l ee, Bel gi um, 1996.
49. V. Si ma. Algorithms for Linear-Quadratic Optimization, vol ume 200 of Pure
and Applied Mathematics. Mar cel Dekker , Inc. , New York, NY, 1996.
50. R. A. Smi t h. Mat r i x equat i on XA+ BX = C. SIAM J. Appl. Math., 16(1):198
201, 1968.
51. G. W. St ewar t . A par al l el i mpl e me nt a t i on of t he QR al gor i t hm. Parallel Com-
puting, 5:187-196, 1987.
52. X. Sun and E. S. Qui nt ana- Or t f . Spect r al di vi si on me t hods for bl ock general -
i zed Schur decomposi t i ons. PRI SM Wor ki ng Not e #32, 1996. Avai l abl e from
h t t p : / / w w w - c . m c s . a n l . g o v / P r o j e c t s / P R I S M .
53. P. V a n D o o r e n . A g e n e r a l i z e d e i g e n v a l u e a p p r o a c h for s o l v i n g R i c e a t i e q u a t i o n s .
Sl AM J. Sei. Statist. Comput., 2:121 135, 1981.
54. A. Varga. A not e on Ha mma r l i ng' s al gor i t hm for t he di scr et e Lyapunov equa-
t i on. Sys. Control Lett., 15(3):273-275, 1990.
55. A. Varga. Comput a t i on of Kr onecker - l i ke forms of a s ys t em penci h Appl i ca-
t i ons, al gor i t hms and soft ware. In Proe. CA CSD'96 Symposium, Dearborn, MI,
pages 77 82, 1996.
56. K. Zhou, ,J.C. Doyl e, and K. Gl over. Robust and Optimal Control. Pr ent i ce-
Hal l , Uppe r Saddl e Ri ver , N J, 1995.
ParaSt at i on Us er Level Communi c at i on
J oa c hi m M. Bl um and Th o ma s M. War schko and Wal t er F. Ti chy
Inst i t ut ffir Programmst rukt uren und Datenorganisation, Fakults ffir Informat i k,
Am Fasanengarten 5, Universits Karlsruhe, D-76128 Karlsruhe, Germany
S u mma r y . PULC (ParaSt at i on User Level Communication) is a user-level com-
munication library for workstation clusters. PULC provides a multi-user, multi-
programmi ng communication library for user-level communication on top of high-
speed communication hardware. This paper describes the design of the communi-
cation subsystem, a first implementation on top of the ParaSt at i on communi cat i on
adapter, and benchmark results of this first implementation.
PULC removes the operating syst em from the communication pat h and of-
fers a multi-process environment with user-space communication. Additionally, it
moves some operating system functionality to the user-level to provide higher effi-
ciency and flexibility. Message demultiplexing, protocol processing, hardware inter-
facing, and mut ual exclusion of critical sections are all implemented in user-level.
PULC offers the programmer multiple interfaces including TCP user-level sockets,
MPI [CGH94], PVM [BDG+93], and Active Messages [CCHvE96]. Thr oughput and
latency are close to the hardware performance (e.g., the TCP socket protocol has
a latency of less t han 9 #s).
Keywords: Wor ks t at i on Cl ust er, Par al l el and Di s t r i but ed Comput i ng, User-
Level Communi c a t i on, Hi gh- Speed I nt er connect s.
1. I n t r o d u c t i o n
Co mmo n net wor k pr ot ocol s are desi gned for general pur pos e c ommuni c a t i on
in a L AN/ WAN envi r onment . Thes e pr ot ocol s reside in t he kernel of an op-
er at i ng s ys t e m and are bui l t t o i nt er act wi t h di vers c ommuni c a t i on har dwar e.
To handl e t hi s di versi t y, ma n y s t andar di s ed l ayers exi st . Each l ayer offers an
i nt er f ace t hr ough which t he ot her l ayers can access i t s servi ces. Thi s l ayer ed
ar chi t ect ur e is useful for s uppor t i ng di vers har dwar e but l eads t o hi gh and
inefficient pr ot ocol st acks. Pr ot ocol s whi ch are usi ng s t andar di s ed i nt er f aces
of t he oper at i ng s ys t e m are unawar e of super i or har dwar e f unct i onal i t y and
oft en r ei mpl ement f eat ur es in sof t war e even if t he har dwar e al r eady pr ovi des
t hem. Anot her inefficiency is due t o copy oper at i ons bet ween kernel - and user-
space and wi t hi n t he kernel itself. To t r a ns mi t a message t he kernel has t o
copy t he da t a f r om or t o user-space. The copyi ng bet ween pr ot ect ed addr ess
space boundar i es oft en adds mor e l at ency t ha n t he physi cal t r ans mi s s i on of a
message. In addi t i on, t he kernel copies t he da t a several t i mes f r om one buffer
t o anot her while t r aver si ng l ayers of t he pr ot ocol st ack. On t he posi t i ve side,
t he t r adi t i onal c ommuni c a t i on pa t h wi t h t he kernel as si ngl e poi nt of access
t o t he har dwar e ensures correct i nt er act i on wi t h t he har dwar e and mu t u a l
excl usi on of compet i ng processes.
90
For parallel comput at i on on clusters of workstations, many of the proto-
cols which are designed for wide area networks are too inefficient. Therefore,
cluster comput i ng must take new approaches.
The most promising technique is to move prot ocol processing to user-level.
This technique opens up the opport uni t y to investigate opt i mi sed protocols
for parallel processing. Wi t h user-level protocols there is no need t o use the
standardised interfaces between the operat i ng syst em and the device driver.
Thus, the rei mpl ement at i on of services in software which are al ready provi ded
by the hardware can be avoided.
E
Application
User Environment
- - I I
User Libc I [ [ System Library
Sys t e m
TCP/IP
I
I Ethernet
Network Protocols
Device Driver
!
I Network I [Network I Hardware
Fig. 1.1. User-level communication highway
User-level communi cat i on removes the kernel from the critical pat h of
dat a transmission. Figure 1.1 shows how user-level communi cat i on short cut s
the access to the communi cat i on hardware. Hi gh-performance communi cat i on
protocols are based on superior hardware features to speed up communi ca-
tion. Copying dat a between kernel- and user-space is avoided and the imple-
ment at i on of t rue zero-copy protocols is possible. These key issues minimise
latency and lead to high t hroughput .
But user-level communi cat i on has also its drawbacks, because now the
single point of access to the communi cat i on hardware, namel y the kernel, is
missing. Therefore many user-level communi cat i on libraries restrict the num-
ber of processes on a node to a single process. Enabling mul t i pl e processes on
one node in user-level raises difficulties, but also offers a lot of benefits. Once
problems, such as demultiplexing of messages and ensuring correct interac-
tion between multiple processes are solved, the high-speed communi cat i on
91
net wor k can be used si mi l ar t o a cl ust er wi t h regul ar communi cat i on chan-
nels such as Uni x sockets.
The goal of PULC is t o pr ovi de a mul t i - user , mul t i - pr ogr ammi ng commu-
ni cat i on l i br ar y for user-level communi cat i on on t op of a hi gh- speed commu-
ni cat i on har dwar e. The first i mpl ement at i on of PULC uses t he Pa r a St a t i on
communi cat i on adapt er , whi ch is descri bed in sect i on 3.. Sect i on 4. pr esent s
design al t er nat i ves and t he opt i mi s at i on t echni ques used. In sect i on 5., t hi s
paper describes t he i mpl ement at i on of PULC on t op of Par aSt at i on. Per f or -
mance figures for t wo different har dwar e pl at f or ms are pr esent ed in sect i on 6..
The l ast t wo sect i ons present t he concl usi on and t he pl ans for f ut ur e work.
2. Re l a t e d Wo r k
Ther e are several appr oaches t ar get i ng efficient paral l el comput i ng on work-
st at i on clusters. Some of t hem use cust om har dwar e whi ch s uppor t me mor y
ma ppe d communi cat i on. SHRI MP [DBDF97] bui l ds a client server c omput i ng
envi r onment on t op of a vi r t ual shar ed memor y. Si mi l ar t o PULC, SHRI MP
offers st andar di sed i nt erfaces such as Uni x sockets. Di gi t al ' s Me mor y Chan-
nel [FG97] is pr opr i et ar y t o DEC Al phas and uses address space ma ppi ng t o
t r ansf er dat a f r om one process t o anot her . On t op of t hi s low level mecha-
ni sm Memor y Channel offers MPI and PVM. Many recent paral l el machi nes,
e.g. IBM SP2, are a col l ect i on of regul ar wor kst at i ons connect ed wi t h a hi gh
speed i nt er connect .
Ot her s use commodi t y har dwar e t o i mpl ement communi cat i on subsys-
t ems. OSCAR (e.g. [JR97]) i mpl ement s MPI on t op of SCI cards. Fast Mes-
sages [ CPL+97] and Act i ve Messages [CCHvE96] are appr oaches for MPP
syst ems por t ed t o wor kst at i on clusters. Bot h offer low l at ency pr ot ocol s whi ch
can be used t o bui l d ot her communi cat i on l i brari es on t op. As an exampl e
t he Berkeley Fast Socket pr ot ocol [SR97] is bui l d on t op of Act i ve Messages.
Si mi l ar t o PULC, it provi des an obj ect code compat i bl e socket i nt erface. I t ' s
l at ency is about 75 ps and i t ' s t hr oughput reaches 33 MByt e / s on Myr i net .
But in cont r ast t o PULC it has some rest ri ct i ons in t he use of f o r k ( ) and
e xe c ( ) calls. Di fferent l y f r om t he cur r ent PULC i mpl ement at i on, it pr ovi des
i nt er oper abi l i t y bet ween Fast Socket and ot her appl i cat i ons on t he same clus-
t er whereas PULC onl y provi des it for out - of - cl ust er communi cat i on.
BI P [PT97] and Myr i com GM [myr] i mpl ement low level i nt erfaces t o
t he Myr i net har dwar e. The y are compar abl e wi t h t he PULC har dwar e ab-
st r act i on l ayer but lack on hi gher prot ocol s. Ga mma [CC97] bui l ds Act i ve
Messages on t op of Fast Et her net cards and gets near l y full per f or mance by
addi ng a syst em call and bui l di ng a speci al pr ot ocol in t he Li nux kernel .
UNet [WBvE97] uses Fast Et her net and ATM t o bui l d an abs t r act i on of t he
net wor k i nt erface. Dependent on t he har dwar e suppor t , t hey use kernel or
user-level communi cat i on. The y' ve even bui l t a me mor y ma na ge me nt s ys t em
92
t o enabl e DMA t r ansf er t o previ ousl y unpi nned pages. Such a me mo r y man-
agement is not i mpl ement ed in PULC, but coul d be done as soon as har dwar e
wi t h DMA t r ansf er and on- boar d processors are used.
In t he Berkel ey NOW pr oj ect [ACP95], GLUni x offers a t r ans par ent
global view of a cluster. As in PULC t he net wor k of wor kst at i ons can be
used si mi l ar t o a single paral l el machi ne. Thei r mai n focus is on Act i ve Mes-
sages and t her ef or e no ot her pr ot ocol s are i mpl ement ed.
3. P a r a S t a t i o n Ha r d wa r e
The first i mpl ement at i on of PULC uses t he Par aSt at i on hi gh-speed commu-
ni cat i on card as communi cat i on har dwar e. Par aSt at i on is t he r eengi neer ed
MPP- net wor k of Tr i t on/ 1 [ HWTP93] , an MPP- s ys t e m bui l t at t he Uni ver-
si t y of Karl sruhe. Wi t hi n a wor kst at i on cl ust er t he Pa r a St a t i on har dwar e is
dedi cat ed t o paral l el appl i cat i ons while t he oper at i ng syst em cont i nues t o use
s t andar d har dwar e (e.g., Et her net ) .
The net wor k t opol ogy is based on a t wo- di mensi onal t or oi dal mesh. Tabl e-
based, sel f-rout i ng packet swi t chi ng t r ans por t s da t a using vi r t ual cut - t hr ough
rout i ng. The size of a packet can var y f r om 4 t o 508 byt es. Packet s are
delivered in or der and no packet s are lost. Flow cont rol is pr ovi ded at link level
and t he uni t of flow cont rol is one packet . These f eat ur es enabl e t he sof t war e
t o use a si mpl e f r agment at i on/ def r agment at i on scheme. The c ommuni c a t i on
processor used involves a r out i ng del ay of about 250as per node and offers a
ma xi mum t hr oughput of 16 MByt e / s per link.
The Par aSt at i on har dwar e resides on an i nt erface car d whi ch pl ugs i nt o
t he PCI - bus of t he host syst em. Thus, it is possible t o use Pa r a St a t i on on a
wide range of machi nes f r om different vendors. A mor e det ai l ed descr i pt i on
of t he har dwar e is gi ven in [WBT97].
4. De s i g n o f P UL C
A new communi cat i on subsyst em has t o fulfil several issues t o be hel pful for
paral l el comput i ng. Fi rst , paral l el comput i ng is hi ghl y dependent on ver y low
l at ency and hi gh t hr oughput . The per f or mance avai l abl e for t he user has t o
be close t o t he har dwar e l i mi t s. Ther ef or e, deep pr ot ocol st acks are deadl y
for paral l el comput i ng.
Second, communi cat i on har dwar e is get t i ng fast er and mor e i nt el l i gent .
New approaches, such as DMA t r ansf er s and communi cat i on processors on
t he i nt erface cards enabl e high per f or mance and flexible pr ot ocol processi ng.
A new communi cat i on pr ot ocol has t o be wel l -sui t ed for t hese t echnol ogi es.
Thi r d, communi cat i on l i brari es offer different i nt erfaces and semant i cs
t o t he pr ogr ammer . Not each communi cat i on l i br ar y is wel l -sui t ed for all
93
users of a cl ust er of wor kst at i ons. Ther ef or e, a new c ommuni c a t i on s ubs ys t e m
has t o offer di fferent i nt erfaces ( c ommuni c a t i on l i brari es). I t shoul d al so be
ext ensi bl e for new appr oaches in t hi s field.
Four t h, wor ks t at i on cl ust ers are of t en used by several peopl e for par al l el
comput i ng. Havi ng user-level access t o t he har dwar e usual l y pr ohi bi t s si mul -
t aneous use of one node by several processes. A new appr oach shoul d s uppor t
a mul t i - pr oces s envi r onment .
Ther ef or e t he mai n goal was t ha t PULC s uppor t s fine gr ai ned par al l el
p r o g r a mmi n g on wor ks t at i on cl ust ers whi l e still pr ovi di ng t he benef i t s of
mul t i - pr ocess envi r onment s .
The mos t chal l engi ng pr obl e m in a mul t i - pr oces s e nvi r onme nt is t he de-
mul t i pl exi ng of i ncomi ng messages. Gener al l y t her e are t hr ee possi bl e pl aces
where message demul t i pl exi ng can t ake pl ace:
1. In t he oper at i ng syst em: The oper at i ng s ys t em ei t her checks per i odi cal l y
t he har dwar e for pendi ng messages or it is i nt er r upt ed by t he ha r dwa r e
when a message has arri ved. The oper at i ng s ys t e m unpacks t he mes s age
header and st ores t he mes s age da t a in a cor r espondi ng queue in kernel
space. Fr om t he vi ewpoi nt of t he kernel it does n' t ma t t e r i f t he mes s age
is for t he cur r ent l y r unni ng process or for any ot her process.
2. In t he c ommuni c a t i on processor: Each c ommuni c a t i ng pr ocess has a
me mo r y ar ea which is accessi bl e by t he c ommuni c a t i on har dwar e. The
c ommuni c a t i on processor checks t he header and deci des wher e t he mes -
sage f r a gme nt shoul d be st or ed. The numbe r of accessi bl e me mo r y ar eas
is l i mi t ed, however. To sol ve t hi s pr obl e m t he c ommuni c a t i on s ys t e m can
ei t her l i mi t t he numbe r of c ommuni c a t i ng processes or it buffers t he mes -
sage i nt er medi at el y, where t he processes can access t he d a t a (in kernel
space, c o mmo n message area, or a t r us t ed pr ocess' addr ess space) .
3. In t he low level c ommuni c a t i on sof t war e in user-space: A user process
per i odi cal l y checks t he har dwar e (or get s i nt er r upt ed) , and recei ves t he
message. I f t he message is not addr essed t o t he recei vi ng process, t he
process st or es t he message in a message pool accessi bl e by t he des t i nat i on
process.
In all cases t he dest i nat i on process execut es a receive call and get s t he d a t a
f r om t he i nt er medi at e st or age and st or es it i nt o t he final dest i nat i on. I f t he
final des t i nat i on is known and accessi bl e at t he t i me of message demul t i pl ex-
ing, t he message can be st or ed di r ect l y in t hi s area. Thi s is known as true
zero copy message r ecept i on [BBVvE95].
PULC di vi des t he message demul t i pl exi ng and t he message r ecept i on in
t wo di fferent modul es. The PULC message handler demul t i pl exes i ncomi ng
messages. Thi s message handl er can ei t her r un on t he c ommuni c a t i on pr o-
cessor or it can be linked t o each user process. The PULC inlerface recei ves
t he message for t he process. I t al ways r uns in t he addr ess space of t he com-
muni cat i ng process. Bot h modul es c ommuni c a t e by cal l i ng each ot her or by
upda t i ng queues in a shar ed message area.
94
Another challenging task is resource management . Resources (buffers,
sockets, etc.) are usually managed by the operating system. When movi ng the
communication out of the kernel, this task can be accomplished by a regular
user process. The resource manager has to control access to the hardware
and cleans up after application shutdowns. In PULC, this task is performed
by the PULC resource manager .
Figure 4.1 gives an overview of the maj or parts of PULC.
PULC Interface
. . . . ti--
PULC Protocol Switch
,_ pv __ C Message H d!e
~ so f ' ~ ~' ' ~
#
, \
:" PULC Resource",,
, i
i: Manager (PS1D~
Fig. 4.1. PULC Architecture
PULC Programmi ng Interface: This module acts as programmi ng interface
for any application. The design is not restricted to a particular interface
definition such as Unix sockets. It is possible and reasonable to have sev-
eral interfaces (or protocols) residing side by side, each accessible t hrough
its own API. Thus, different APIs and protocols can be i mpl ement ed to
support a different quality of service, ranging from standardised interfaces
(i.e. TCP or UDP sockets), widely used programmi ng environments (i.e.
MPI or PVM), to specialised and proprietary APIs (ParaSt at i on ports
and a true zero copy protocol called Rawdata). All in all, the PULC in-
terface is the programmer- visible interface to all implemented protocols.
PULC Message Handler: The message handler is responsible to handle all
kind of (low level) dat a transfer, especially incoming and outgoing mes-
95
sages, and is t he onl y par t t o i nt er act di r ect l y wi t h t he har dwar e. It
consists of a pr ot ocol - i ndependent par t and a specific i mpl e me nt a t i on
for each pr ot ocol defined wi t hi n PULC. The pr ot ocol - i ndependent par t
is t he protocol switch which di spat ches i ncomi ng messages and demul -
t i pl exes t hem t o pr ot ocol speci f i c receive handlers. To get hi gh- speed
communi cat i on, t he pr ot ocol s have t o be as lean as possible. Thus, PULC
pr ot ocol s are not l ayered on t op of each ot her ; t hey reside side by side.
Sendi ng a message avoids any i nt er medi at e buffering. Af t er checki ng t he
da t a buffer, t he sender di r ect l y t r ansf er s t he da t a t o t he har dwar e. The
specific pr ot ocol s inside t he message handl er are responsi bl e for t he cod-
ing of t he pr ot ocol header i nf or mat i on.
PULC Resource Manager: Thi s modul e is i mpl ement ed as a Uni x da e mon
process ( PSI D) and supervi ses al l ocat ed resources, cleans up af t er ap-
pl i cat i on shut downs, and cont rol s access t o common resources. Thus , it
t akes care of t asks usual l y managed by t he oper at i ng syst em.
To be por t abl e among di fferent har dwar e pl at f or ms and oper at i ng sys-
t ems, PULC i mpl ement s all har dwar e and oper at i ng syst em specific par t s in
a modul e called har dwar e abs t r act i on l ayer ( HAL) . Choosi ng an i nt er con-
nect i on net work wi t h different qual i t y of services woul d force t he adopt i on
of t he PULC message handl er t o t hese services t he communi cat i on har dwar e
provi des. E.g. if t he har dwar e doesn' t pr ovi de i n-order delivery, t he message
handl er has use t he PULC f unct i ons which pr ovi de a r eor der i ng of f r agment s .
4. 1 Re s o u r c e s p r o v i d e d b y P UL C
PULC suppor t s t he i mpl ement at i on of different pr ot ocol s by offeri ng a va-
ri et y of resources t oget her wi t h associ at ed i nt erfaces t o access t hem. The
pr ot ocol i ndependent resources are message fragments, communication ports,
semaphores, and process control blocks. A Message fragment consists of a frag-
ment cont rol block and t he message da t a and several f r agment s are concat e-
nat ed t o f or m a messages. Fr agment at i on is essential, because t he under l yi ng
har dwar e has l i mi t ed packet size. Ther ef or e, PULC f r agment s have fi xed sizes
in me mor y and f r agment s are al l ocat ed as fixed sized me mor y blocks. Thi s
ma y wast e memor y, but al l ocat i ng and managi ng vari abl e sized chunks of
me mor y is t i me consumi ng. Several messages t oget her f or m a message queue
of a port. The por t is t he basic addressabl e el ement in PULC communi ca-
t i on. Di fferent pr ot ocol s use t he por t s as t he channel s t o t hei r c ommuni c a t i on
par t ner s. The resource manager frees a por t and all f r agment s inside i t s mes-
sage queue when no process is usi ng it anymor e. For t he T CP / UDP pr ot ocol ,
anot her resource called socket is provi ded. A socket uses a por t as i t s commu-
ni cat i on channel and st ores addi t i onal socket specific i nf or mat i on. To know
about all t he resources which are al l ocat ed by a specific process, PULC keeps
i nf or mat i on about a process in a process control block. These i nf or mat i on are
use t o clean up t he al l ocat ed resources when t he process exi t s.
9 6
If the PULC message handler runs on the host processor, several pro-
cesses can access common resources. To ensure mut ual exclusion of processes
to protect critical sections (mani pul at i ng queues or ot her resources), PULC
provides user-level semaphores. Processor specific at omi c operations, such as
t es t and set or l oad/ s t or e locked, are used to i mpl ement them.
For an easy i mpl ement at i on of the protocols, PULC offers support func-
tions to access the resources. E. g., PULC provides routines to store fragment s
into the message queue of a port. There are only three different strategies
to store fragments in a message queue. PULC classifies the ports and the
protocol calls its appropriate routine. In general, message queues of a port
can be classified in the following way:
- Single stream: All fragments are stored in a single queue disregarding any
message boundaries or message sources.
- Multiple Stream: All fragments of a the same source are stored in a queue.
Fragments of different sources are stored in different queues.
- Dat agrams: Fragments of different messages and different sources are stored
in different queues. Each message has its own queue.
In addition to this classification these routines have to know if the hardware
delivers the fragments of a message in order or if a reordering of the fragment s
is necessary. Fort unat el y the ParaSt at i on hardware provides in-order delivery.
The same holds for our HAL i mpl ement at i on for the Myrinet card.
4. 2 P S I D : T h e P U L C Coor di na t or
Since PULC is fully implemented in user-space, the operat i ng system does
not manage the resources. This task is done by a resource manager (PSID:
ParaSt at i on Daemon). It cleans up resources of dead processes and organises
access to the message area. Before a process can communi cat e with PULC,
the process has to register with the PSID. The PSID can grant or deny access
to the message area and the hardware.
The PSID also checks if the version used by the PULC interface and the
PULC message handler are compatible. The version check makes corrupt i on
of dat a impossible. The PSID can restrict the access to the communi cat i on
subsystem to a specific user or a maxi mum number of processes. This enables
the cluster to run in an optimised way, since multiple processes slow down
application execution due to scheduling overhead.
All PSIDs are connected to each other. They exchange local i nformat i on
and t ransmi t demands of local processes to the PSID of the destination node.
Wi t h this cooperation, PULC offers a distributed resource management . The
single system semantic of PULC is ensured by the PSIDs. They spawn and
kill client processes on demand of other processes. PULC transfers remot e
spawning or killing requests to the PSID of the destination node. PULC uses
operating system functionality to spawn and kill the processes on the local
node. The spawned process runs with same user id as the spawning process.
97
PULC r edi r ect t he out put of s pawned process on t he t e r mi na l of t he mo t h e r
process. Ther ef or e it offers a t r a ns pa r e nt vi ew of t he cl ust er.
The PSI Ds per i odi cal l y exchange l oad i nf or mat i on. Wi t h t hi s i nf or ma t i on
PULC pr ovi des l oad bal anci ng when s pawni ng new t asks. Ther e ar e sever al
spawni ng st r at egi es possi bl e:
- Spawn a new t as k on t he speci fi ed node: No sel ect i on is done by PULC.
The spawn request is t r ans f er ed t o t he r e mot e PSI D, whi ch cr eat es t he
new t ask. A new t as k i dent i fi er is r et ur ned in t he resul t .
- Spawn a t as k on t he next node: PULC keeps t r ack of t he node whi ch was
used t o spawn t he l ast t as k on. Thi s s t r at egy sel ect s t he next node by
i ncr ement i ng t he node numbe r .
- Spawn a t as k on a unl oaded node: Before spawni ng, PULC or der s t he
avai l abl e nodes by t hei r l oad. Af t er t hat , PULC spawns on t he nodes wi t h
t he l east heavy l oad.
These st r at egi es allow a PULC cl ust er t o r un in a bal anced f ashi on, whi l e
still al l owi ng t he p r o g r a mme r t o speci fy t he exact node, when pr obl e m sol ved
requi res a specific c ommuni c a t i on pa t t e r n.
4. 3 T h e P UL C Me s s a g e Ha n d l e r
The PULC message handl er is r esponsi bl e for recei vi ng and sendi ng messages.
4. 3. 1 S e n d i n g me s s a g e s . Sendi ng a message avoi ds any i nt e r me di a t e buffer-
ing. Af t er checki ng t he buffer, t he sender di r ect l y t r ansf er s t he d a t a t o t he
har dwar e. The specific pr ot ocol s inside t he message handl er are r esponsi bl e
for t he codi ng of t he pr ot ocol header i nf or mat i on. PULC doe s n' t r est r i ct t he
l engt h or f or m of t he header. PULC j us t specifies t he f or m of t he ha r dwa r e
header wi t h i t s pr ot ocol id. The rest of t he message header mu s t be i nt er -
pr et abl e by t he pr ot ocol specific receive handl er . I f t he recei ver is on t he l ocal
node, t he receive handl er opt i mi s es message t r ansf er by di r ect l y cal l i ng t he
a ppr opr i a t e receive handl er of t he pr ot ocol .
4. 3. 2 Re c e i v i n g a me s s a g e . I f t he har dwar e s uppor t s a demul t i pl exi ng of
messages, t he PULC message handl er r uns on t he c ommuni c a t i on pr ocessor
of t he har dwar e. I t has s ome me mo r y c o mmo n wi t h each r ecei vi ng process.
The da t a can di r ect l y be t r ans f er r ed t o t hi s me mo r y area.
The first gener at i on of t he Pa r a St a t i on car d does not s uppor t any mes s age
demul t i pl exi ng at har dwar e level and so t he PULC mes s age handl er has t o
be par t of a process and r uns in t he addr ess space of i t s host process. Dur i ng
r ecept i on of a message t he PULC message handl er can det ect t ha t it is not
addr essed t o i t s own host process. I t has t o st or e t he message in a c o mmo n l y
accessi bl e message ar ea ( SHM) where t he des t i nat i on process can r ead t he
message. Whe t he r a message is recei ved wi t h t r ue zero copy, or it is s t or ed
i nt er medi at el y, depends on t he used pr ot ocol .
98
The PULC protocol switch reads only the hardware header of the message
and the protocol identifier. After decoding the id, the protocol switch directly
transfers control to the receive handler of the protocol, which reads the rest
of the message. This header forwarding is extremely fast and does not do any
unnecessary copy of the data. The protocols can store the dat a directly in
user dat a structures, as it is done in the rawdat a protocol, or queue the dat a
in the a message queue ( TCP, UDP, PORT- M/ S/ D) . Other protocols can do
it in their specific way.
PULC allows multiple processes to communi cat e concurrently since differ-
ent processes can use different communi cat i on ports. The protocol interface
and the protocol receive handler have to ensure the correct cooperat i on while
receiving a message.
In a hardware-support ed PULC message handler a shared port must re-
side in an area where bot h processes can access it. If bot h processes t rust
each other, the port can reside in a message area which is mapped in bot h
processes. If they do not trust each other, the message handler has to protect
the port in its own memor y area. Both processes would have to access the
message in the port t hrough the message handler API. This is much slower
t han the solution with a direct access.
4. 4 PULC I nt e r f a c e
Each protocol in the message handler can have its own interface. The inter-
face is the count erpart of the message handler. The message handler receives
a message and puts it in the message area whereas the interface functions
get these messages as soon as they are received completely. The cooperat i on
between the interface functions and the receive handler of the protocol in-
cludes correct locking of the port and its message queues. Correct interaction
is necessary since PULC doesn' t have control of the scheduling decisions of
the Operating System. Thus the receive handler could be in a critical section
while the Operat i ng System switches to a process which conflicts with this
critical section. This could destroy consistency.
A process can use several interfaces at the same time. E. g., it can use the
sockets for regular communi cat i on and PULC' s ability to spawn processes
t hrough the Port-M interface.
The socket interface to PULC is the same as for BSD Unix sockets. This
interface allows easy porting of applications and libraries to the fast com-
munication protocols. Destinations which are not reachable inside the PULC
cluster are redirected to regular operating system calls. All communi cat i on
in Unix is based on the socket interface. By providing a compat i bl e interface,
porting applications to PULC is j ust a relinking.
PULC sockets use specially tuned met hods with caching of recently used
structures. This allows an extremely fast communi cat i on with mi ni mal pro-
tocol overhead. Each socket has a port as its communi cat i on channel. The
99
socket receive handl er onl y knows a bout t he por t s and uses di fferent enqueu-
ing st r at egi es for UDP ( d a t a g r a m por t s ) and T CP socket s (si ngl e s t r e a m
por t s) . The socket i nt er f ace pr ovi des t he i nt er act i on bet ween t he c ommuni -
cat i on por t s and t he socket descr i pt or . Socket s can be s har ed a mo n g di fferent
processes due t o a f o r k ( ) call a nd can be i nher i t ed by a e x e c ( ) call. Dur i ng
f o r k ( ) , t he socket is dupl i cat ed but bot h socket s shar e t he s a me c ommu-
ni cat i on por t ( t he count a t t r i but e of t he por t is i ncr ement ed) . Thus , bot h
processes have access t o t he message queue of t he socket . Af t er an e x e c ( )
and a r econnect i on t o PULC t he socket s and t he por t s of t he mes s age ar ea
are i nser t ed i nt o t he pr i vat e socket and por t descr i pt or t abl es. Ther ef or e t he
process has access t o t hese abs t r act i ons agai n.
4. 4. 1 Co mmu n i c a t i o n L i b r a r i e s o n To p o f P UL C. The r e are several
c ommuni c a t i on l i brari es bui l t on t op of PULC. Most of t h e m are j us t t he
s t a nda r d Uni x di st r i but i ons on t op of socket s. The appl i cat i ons whi ch use
t hese l i brari es j us t have t o be l i nked wi t h t he PULC socket s. Thes e l i br ar i es
i ncl ude P4 [BL92] and t cgms g [Har91]. Ot her s such as PVM [ BDG have
been changed [ BWT96] in a way t ha t t hey can be used s i mul t aneous l y t o t he
s t andar d socket s. Thi s enabl es a di rect compar i s on of t he ope r a t i ng s ys t e m
c ommuni c a t i on and PULC. The i mpl e me nt a t i on shows t ha t PVM adds a
si gni fi cant over head t o t he r egul ar socket communi cat i on. Thi s i s n' t obvi ous
when PVM is used wi t h r egul ar socket s (see sect i on 6.). Thi s l ead t o a new
appr oach[ OBWT97] , which opt i mi s ed PVM on t op of t he por t - D i nt erface.
PULC al r eady provi des efficient and flexible buffer ma n a g e me n t and t her e-
fore t hi s f unct i onal i t y coul d be el i mi nat ed in t he PVM source. Thi s PSPVM2
is still i nt er oper abl e wi t h ot her PVMs r unni ng on any ot her cl ust er or su-
per comput er . PSPVM2 vi ews t he whol e PULC cl ust er as one si ngl e par al l el
syst em.
The PULC MPI i mpl e me nt a t i on is based on MPI CH. MP I CH pr ovi des a
channel i nt er f ace whi ch har dwar e ma nuf a c t ur e r s can use t o por t MP I CH t o
t hei r own c ommuni c a t i on s ubs ys t em. Thi s channel i nt er f ace is i mpl e me nt e d
on t op of PULC' s por t - D pr ot ocol . MPI CH on PULC uses P ULC' s dyna mi c
process cr eat i on at s t ar t up. The i mpl e me nt a t i on is wel l -sui t ed for MPI - 2,
which is s uppor t i ng dyna mi c process cr eat i on at r un- t i me. I t is possi bl e t o
s uppor t MPI di r ect l y as an i nt er f ace t o PULC. Most of t he f unct i onal i t y is
al r eady pr ovi ded in t he Por t pr ot ocol .
5. I mp l e me n t a t i o n
Ther e exi st s t wo i mpl e me nt a t i ons of PULC, one for I nt el - PCs r unni ng Li nux
and t he ot her for DEC- Al pha wor ks t at i ons r unni ng Di gi t al Uni x. Bot h of
t he m use t he Pa r a St a t i on hi gh- speed c ommuni c a t i on car d as c ommuni c a t i on
har dwar e. As descri bed in sect i on 3., Pa r a St a t i on offers ma n y useful servi ces
t o t he sof t war e pr ot ocol s, but unf or t unat el y, it has no c ommuni c a t i on proces-
sor on boar d. Thus, t he i mpl e me nt a t i on uses a c o mmo n l y accessi bl e s har ed
100
memory area (see figure 5.1) to store messages and control information. The
PULC library itself, in particular the PULC message handler, acts as the
trusted base within the whole system. The library is statically linked to each
application and ensures correct interaction between all parts of the system.
The operating system is only invoked at system and application st art up.
I Application A, t Application B
Operating System
(Kernel mode)
I Paral ;tafion Library
Application Startup ] (User mode)
a
t
I Driver
Message
Buffer
| Control
Llnformation
System Startup / Initialization I I Normal Operation
I ParaStation Hardware I
Fig. 5.1. ParaStation User-Level Communication
Operating system and hardware specific parts of the library are placed in
a separate module (the HAL). Therefore only this module has to be change
when porting PULC to another platform. This is currently done for the
Myrinet communication card.
Since the message handler is part of each process, the message area is
mapped into each communicating process. This enables the message handler
to receive messages for different processes and to demultiplex t hem to the
correct receiving port. The multi-process ability of this solution is quite ex-
pensive due to the locking of ports, as well as locking dat a transmission to
and from the hardware.
Using a commonl y accessible message area suffers from a (minimal) lack of
protection. The implemented message demultiplexing implies t hat all com-
municating processes trust each other. A malfunctioning process accessing
the common message area directly is able to corrupt dat a owned by another
process and can possibly crash the system. But the risk is mi ni mal since the
address space of Alpha processors (64 bit addresses) is approxi mat el y 252
times larger t han the size of the message area (configuration dependent). If
a wrong address is produced once a second, a corruption of dat a in the mes-
sage area could happen approximately every 227 years. On the other hand,
101
t he t r us t ed syst em is open for mal i ci ous hackers wi t h access t o t he cl ust er,
but t hi s is a t ol er abl e di sadvant age when compar ed t o t he per f or mance ben-
efits gai ned f r om t hi s policy. I f t hi s lack of pr ot ect i on is consi dered har mf ul ,
PULC can be confi gured t o allow onl y a specific number of process or onl y a
specific user access t o t he communi cat i on syst em concur r ent l y.
6. P e r f o r ma n c e E v a l u a t i o n
Thi s sect i on shows t he efficiency of t he PULC i mpl ement at i on. The per f or -
mance of t he different prot ocol s is pr esent ed and t he resul t s are expl ai ned.
Per f or mance is measur ed on a cl ust er where each node in t he cl ust er is a ful l y
confi gured wor kst at i on.
6. 1 Co mmu n i c a t i o n Be n c h ma r k
Communi cat i on subsyst ems can be compar ed by eval uat i ng t he l at ency and
t hr oughput of t he syst ems. PULC offers several i nt erfaces and r uns on several
har dwar e/ oper at i ng syst em envi r onment s. Our t est clusters consi st of t wo
Pent i um PCs (166MHz) r unni ng Li nux 2.0 and t wo Al phas 21164 ( 500MHz)
r unni ng Di gi t al Uni x 4.0.
PULC' s resul t s are compar ed wi t h t he oper at i ng syst em per f or mance
whenever possible. The t est consists of a pai rwi se exchange pr ogr am t o mea-
sure t he t hr oughput and a pi ng- pong t est t o measur e t he l at ency, whi ch is
cal cul at ed by t he r ound t r i p t i me di vi ded by two.
/ * S e n d e r c o d e * /
S t a r t T i m e r ( )
f o r ( i = O ; i < L O O P S ; i + + )
S e n d M e s s a g e ( b u f f e r , s i z e )
R e c e i v e M e s s a g e ( b u f f e r , s i z e )
e n d
E n d T i m e r ( )
/ * R e c e i v e r C o d e * /
S t a r t T i m e r ( )
f o r ( i = O ; i < L O O P S ; i + + )
S e n d M e s s a g e ( b u f f e r , s i z e )
R e c e i v e M e s s a g e ( b u f f e r , s i z e )
e n d
E n d T i m e r ( )
Fi g. 6.1. Pairwise exchange test program
In t he exchange pr ogr am( see figure 6.1) bot h processes send a message
t o t he ot her and wai t for t he receive of t he ot her . Ther ef or e bot h processes
execut e always t he same command.
In t he pi ng- pong t est (see figure 6.2), one pro: 2ss sends and t he ot her
receives, af t er recei vi ng t he message, t he recei ver sends t he message back t o
t he sender.
Surpri si ngl y, t he slower Pent i um syst em per f or ms bet t er t han t he Al pha
syst em in bot h l at ency and t hr oughput at lower l ayers (see Tabl e 6.3). Thi s
102
/ * S e n d e r c o d e * /
S t a r t T i m e r ( )
f o r ( i = 0 ; i < L O O P S ; i + + )
S e n d M e s s a g e ( b u f f e r , s i z e )
R e c e i v e M e s s a g e ( b u f f e r , s i z e )
e n d
E n d T i m e r ( )
/ * R e c e i v e r C o d e * /
S t a r t T i m e r ( )
f o r ( i = 0 ; i < L 0 0 P S ; i + + )
R e c e i v e M e s s a g e ( b u f f e r , s i z e )
S e n d M e s s a g e ( b u f f e r , s i z e )
e n d
E n d T i m e r ( )
Fi g. 6.2. Pingpong test program
p r o t o c o l -
layer
h a r d wa r e
r a w d a t a
p o r t - M
s o c k e t
P VM
P VM
( p o r t - M)
s oc ke t
( sel f )
Al p h a 2 1 1 6 4 , 5 0 0 MI-Iz
ParaStation OS/Ethernet
latency b a n d - l a t e n c y b a n d -
wi d t h wi d t h
[/~s] [ MB/ s ] [ # , ] [ MB/ s ]
4. 2 12. 4
5. 1 11. 9
8. 9 9. 5
9. 0 9. 6 115 1.1
78. 0 8 . 7 289 1. 0
11. 5 9. 4
m~ : J| - ' l - ' ! :Pit :~l~
P e n t i u m, 1 6 6 MHz
P a r a S t a t i o n OS / E t h e r n e t
l a t e n c y b a n d - l a t e n c y b a n d -
wi d t h wi d t h
[ p , ] [ MB/ s ] [ p , ] [ MB/ s ]
3. 4 15. 6
6. 4 14. 8
14. 6 11. 8
13. 9 11. 9 3 0 8 1. 0
158 7. 8 776 0. 8
27. 2 11. 5
[ ~ ' ] ( o n i B D Y ~ 5 7 8 30. 0
Fi g. 6.3. Communication Performance of the PULC system
is due t o t he ar chi t ect ur al differences bet ween t he t wo syst ems. In par t i cul ar
Al pha' s capabi l i t y t o combi ne wri t es t o t he same me mor y l ocat i on requi res
addi t i onal synchr oni sat i on. As t he Par aSt at i on communi cat i on i nt er f ace is
i mpl ement ed as a FI FO buffer, me mor y bar r i er i nst r uct i ons ( MB) are in-
sert ed af t er each wri t e t o t he FI FO. The MB i nst r uct i on i t sel f wai t s for all
out s t andi ng read and wri t e oper at i ons and t hus l i mi t s t he per f or mance. In
addi t i on t o t he wri t e combi ni ng bot t l eneck, t he s emaphor e mechani s m whi ch
is used in t he Al phas is not as fast as t he semaphor es on t he Pent i um. A
lock oper at i on on t he Al phas t akes about 1 #s whereas a Pent i um pr ovi des
mut ual excl usi on wi t hi n 200 ns. The s emaphor e bot t l eneck is also visible in
mul t i -process pr ot ocol s .
The line t i t l ed hardware in t he t abl e above shows t he per f or mance of t he
har dwar e abst r act i on l ayer descri bed in sect i on 5. and reflects t he ma x i mu m
per f or mance one can get using Par aSt at i on on t he st at ed wor kst at i on.
The addi t i onal l at ency of 0.9 #s on t he Al pha (3 #s on t he Pe nt i um)
i nt r oduced by t he r awdat a pr ot ocol is due t o guar ant ee mut ual excl usi on
and correct i nt er act i on bet ween concur r ent processes. Mul t i pl e por t s are ad-
dressed by t he por t pr ot ocol . Thi s mul t i - pr ogr ammi ng envi r onment adds ad-
di t i onal 3.8 #s (8.2 #s on Pent i um) t o t he r awdat a pr ot ocol . Pr ovi di ng full
TCP socket f unct i onal i t y wi t hi n 9 #s opens up a wide r ange of fine gr ai ned
paral l el pr ogr ams on t op of sockets. As r epor t ed in [BWT96] s t andar d pro-
gr ammi ng envi r onment s, such as PVM, add a huge a mount of l at ency t o
103
t he socket s. Thi s is not not i ceabl e when slow oper at i ng s ys t e m socket s ar e
used. When r unni ng PVM on t op of PULC socket s 89 % (91% on a Pe n t i u m)
of t he l at ency is caused by t hese packages. Thes e numbe r s show t h a t t hese
s t a nda r d envi r onment s do not a dopt well t o hi gh speed pr ot ocol s. Thi s l ead
t o an opt i mi s at i on of PVM on t op of por t s. As r epor t ed, t he por t - M pr ot o-
col al r eady pr ovi des mos t of t he f unct i onal i t y t ha t PVM has t o i mp l e me n t
on t op of socket s, e.g. a ver y i neffi ci ent l y i mpl e me nt e d buffer ma n a g e me n t .
Usi ng t he whol e f unct i onal i t y of PULC, PVM onl y adds 2.5 #s ( 13.4 ps
on t he Pent i um) t o t he por t - M pr ot ocol l at ency. Thi s shows t ha t even wi t h
s t andar di s ed i nt erfaces, PULC offers gr eat per f or mance.
6 . 2 A p p l i c a t i o n B e n c h m a r k
A user does n' t focus on t he pur e message passi ng number s . The mor e i m-
por t a nt f act is how t he s ys t e m behaves wi t h real appl i cat i ons. Thi s sect i on
pr esent s per f or mance me a s ur e me nt s of t he s ys t e m in t wo di fferent ar eas wi t h
t wo di fferent c ommuni c a t i on l i brari es. Fi rst , t he PVM i mp l e me n t a t i o n is
meas ur ed wi t h a wi del y used l i near al gebr a package and second, t he NAS
paral l el be nc hma r k is used t o c ompa r e t he s ys t e m t o t he Cr a y T3E, a dedi -
cat ed par al l el syst em.
6. 2. 1 L i n e a r Al g e b r a P a c k a g e o n t o p o f P VM. Thi s t est uses a lin-
ear equat i on sol ver for dense s ys t ems , cal l ed xs l u, whi ch is pa r t of ScaLA-
Pack [ CDD+95] , a popul ar l i near al gebr a package. ScaLAPack uses BLACS
as a c ommuni c a t i on i nt er f ace t o di fferent c ommuni c a t i on l i br ar i es such as
MPI or PVM. In t hi s t est PVM act s as t he under l yi ng s ubs ys t e m. The t est is
r un on up t o 8 Al phas (500 MHz, 256 MB Ra m, Di gi t al Uni x 4. 0b) connect ed
wi t h t he Pa r a St a t i on har dwar e.
S c a L AP ACK on 160 M B i t P a r a S t a t i o n wi t h P S P VM2
Problem 1 work-
size (n) station
MFlop
3000 443
4000
5000
6000
7000
8000
9000
10000
2 workstations
Speed MFlop
up
i
1.75 759
1.70 753
1.85 821
4 workstations
Speed MFlop
up
2.18 966
2.47 1093
2.68 1187
2.90 1285
3.11 1379
8 workstations
Speed MFlop
up
2.62 1161
3.23 1431
3.74 1656
4.04 1789
4.37 1939
4.61 2044
5.07 2247
5.22 2312
The t abl e shows t ha t on t op of Pa r a St a t i on t he appl i cat i on scal es good
in t e r ms of pr obl em size and numbe r of processors. A ma x i mu m pe r f or ma nc e
of 2.3 GFl ops is achi eved whi ch c ompa r e qui t e well t o dedi cat ed par al l el ma -
chines. Unf or t unat el y, x s l u depends on hi gh ba ndwi dt h and t hus Pa r a St a t i o n
wi t h a bout 10MByt e/ s t hr oughput is t he real bot t l eneck.
104
6. 2. 2 NAS P a r a l l e l Be n c h ma r k o n t o p o f MP I . The second t est me a -
sures t he per f or mance of t he s ys t em wi t h t he NAS Par al l el Be nc hma r k sui t e.
Thi s sui t e is wi del y used t o compar e di fferent par al l el pl at f or ms . I t is bas ed
on t op of MPI and it r uns wi t hout any source code modi f i cat i ons.
Some t est s requi re a power of t wo numbe r and ot her s a squar e n u mb e r of
processors. Ther ef or e not all col umns are filled in each t est .
The FT be nc hma r k is a 3-D F F T appl i cat i on. MG is a mul t i gri d bench-
mar k. The LU be nc hma r k does a ma t r i x decompos i t i on. I t is t he onl y bench-
ma r k in t he NPB 2.0 sui t e t ha t sends l arge numbe r s of ver y smal l (40 byt e)
messages. Ther ef or e it shows t he per f or mance of t he c ommuni c a t i on subsys-
t e m for fi ne-grai ned appl i cat i ons. EP ( embar r as s i ng par al l el ) usual l y shows
t he per f or mance of a si ngl e node. The c ommuni c a t i on s ubs ys t e m is not used
frequent l y. IS (i nt eger sor t ) sort s a numbe r of i nt egers in par al l el . CG (conj u-
gat e gr adi ent ) and IS exchange a l ot of da t a in huge da t a chunks. All of t hese
codes requi re a power - of - t wo numbe r of processors. The SP( pe nt a di a gona l
sol ver) and BT (bl ock di agonal sol ver) al gor i t hms are mor e coarse gr ai ned
i mpl ement at i ons . The y solve t hr ee set s of uncoupl ed s ys t e ms of equat i on us-
ing mul t i par t i t i on schemes. Bot h t he SP and BT codes requi re a s quar e
numbe r of processors 1 .
As a compar i s on t he number s achi eved by a Cr ay T3E- 900 are pr esent ed,
which has si mi l ar pr ocessor s per node. The c ommuni c a t i on s ubs ys t e m is a
hi ghl y opt i mi s ed t hr ee di mensi onal t orus. The t hi r d level cache is el i mi nat ed
and t her ef or e t est s whi ch are me mo r y i nt ensi ve r un good on t he T3 E and
t est which can mos t l y r un in t he cache pe r f or m worse t ha n on r egul ar work-
st at i ons.
The Cr ay T3E pr ovi des a bandwi dt h of a bout 300 MB/ s and a l at ency
of about 2 #s at har dwar e level . Ther ef or e one coul d expect a c ompa r a bl e
per f or mance for t est s whi ch do not depend on bandwi dt h.
a For a detailed description of the tests please
h t t p : / / s c i e n c e . nas. nasa. gov/ Sof t ware/NPB/
refer to
105
NAS Pa r a l l e l Be n c h ma r k o n P a r a S t a t i o n a n d T3 E
Test on no. of nodes Class A
BT ParaStation
BT Cray T3E-900
CG ParaStation
CG Cray T3E-900
EP ParaStation
EP Cray T3E-900
IS ParaStation
IS Cray T3E-900
LU ParaStation
LU Cray T3E-900
MG ParaStation
MG Cray T3E-900
FT ParaStation
FT Cray T3E-900
SP ParaStation
SP Cray T3E-900
1 2 4 8
n/ a n/ a 144.4 n/ a
n/ a n/ a 226.7 n/ a
19.7 44.5 55.15 75.72
46.5 86.0 241.4
4 31.96
5.2 10.4 20.8
1.46 2.23 2.15 3.72
6.6 12.9 22.1
579.48
134.4 270.4 531.1
299.77
171.5 313.9 720.8
86.02
85.3 169.5 330.4
n/ a 106.55
n/ a n/ a 172.4 n/ a
The table shows the results measured on ParaStation and the results
taken from the NAS homepage for the T3E. Higher numbers mean better
performance.
In some test ParaStation behaves very good compared to the expensive
dedicated system. Unsurprisingly, these are the test with minimal communi-
cation (EP) and the test with many small messages (LU), because the MPI
latency is about the same on both systems.
During the other tests, which depend on high throughput, the ParaStation
system began to swap received messages to user-space due to an overflow of
the message storage. This effect limited the performance and a new version
PULC will optimise this swapping. But even with this swapping effect, the
resulting numbers are even better than other much more expensive dedicated
machines, such as the IBM SP/2, SGI Origin, and Cray T3D 2.
7. Conc l us i on
PULC shows extremely good performance on all protocols. Many programs
benefit from the high speed of the PULC library. PULC' s design offers nearly
the raw performance of high-speed communication cards to the user while
still providing standardised interfaces. The design goal of a multi-user/multi-
programming environment at full speed was reached. PULC is also easily
2 See the performance n u m b e r s at
h t t p ://science. nas. nasa. gov/Sof t w a r e / N P B / N P B 2 R e s u l t s / i n d e x , h t m l
106
adapt ed t o new har dwar e and bri ngs efficient paral l el processi ng t o work-
st at i ons clusters. Pr esent ed per f or mance resul t s compar e well wi t h par al l el
syst ems. PULC is i ncl uded in t he Par aSt at i on syst em, whi ch was i nt r oduced
i nt o mar ket in 19963 and is cur r ent l y por t ed t o t he Myr i net c ommuni c a t i on
adapt er . Fi rst resul t s show t hat T CP t hr oughput will raise up t o 60 MB/ s
while l at ency will i ncrease t o about 20 ~ts. These are first number s, where t he
message handl er still r uns in low level soft ware.
The pur e user-level appr oach in t he Par aSt at i on syst em showed ma ny
drawbacks which coul d onl y be resol ved by i nt r oduci ng some secur i t y holes
and per f or mance l i mi t at i ons. Especi al l y t he per f or mance of mul t i pl e pro-
cesses on one node is dependent on t he coschedul i ng st r at egi es used. Un-
fort unat el y, I coul dn' t find a coschedul i ng s t r at egy whi ch is good f or mul t i -
t hr eaded and i nt erprocess communi cat i on at t he same t i me. Mor e research
has t o be done in t hi s area.
8. F u t u r e Wo r k
In fut ure, t he Par aSt at i on t eam will work on next - gener at i on Pa r a St a t i on
hardware. Cur r ent issues for a new net work design are fi ber opt i c links, op-
t i mi sed packet swi t chi ng, and flexible DMA engi nes t o reach an appl i cat i on-
t o- appl i cat i on bandwi dt h of about 100 MByt e/ s . Si mi l ar t o Myr i net t he new
hardware will be abl e t o r un t he message handl er on boar d. Ther ef or e any
securi t y whole will be el i mi nat ed.
PULC and t he full Par aSt at i on envi r onment is goi ng t o be por t ed t o ot her
syst ems with PCI bus (e.g., Sun/ Sol ar i s, I BM- Powe r PC/ AI X, SGI / I RI X) .
PULC itself will be por t ed t o ot her communi cat i on har dwar e. Addi t i onal
interfaces and prot ocol s, such as Act i ve Messages, are consi dered t o be i m-
pl ement ed as pr ot ocol s inside of PULC. Thi s woul d give t hem a per f or mance
boost over t he cur r ent i mpl ement at i on whi ch are i mpl ement ed on t op of sock-
ets or port s.
Fur t her mor e, t he anal ysi s of t he message demul t i pl exi ng showed t ha t t hi s
task can be done in t he OS, t he communi cat i on har dwar e, and t he low level
software. All t hr ee cases will be i mpl ement ed and eval uat ed.
Re f e r e n c e s
[ACP95] Thomas E. Anderson, David E. Culler, and David A. Patterson. A Case
for NOW (Network of Workstations). IEEE Micro, 15(1):54-64, February 1995.
3 For further information, see h t t p ://wwwipd. i r a . uka. de / Pa r a St a t i on.
107
[BBVvE95] Anindya Basu, Vineet Buch, Werner Vogels, and Thorsten von Eicken.
U-net: A user-level network interface for parallel and distributed computing. In
Proc. of the 15th ACM Symposi um on Operating Syst ems Principles, Copper
Mountain, Colorado, December 3-6, 1995.
[BDG+93] A. Beguelin, J. Dongarra, A1 Geist, W. Jiang, R. Manchek, and V. Sun-
deram. PVM 3 User's Guide and Reference Manual. ORNL/TM-12187, Oak
Ridge National Laboratory, May 1993.
[BL92] Ralph Buttler and Ewing Lusk. User's Guide to the P4 Parallel Program-
mi mg System. ANL-92/17, Argonne National Laboratory, October 1992.
[BWT96] Joachim M. Blum, Thomas M. Warschko, and Walter F. Tichy.
PSPVM:Implementing PVM on a high-speed Interconnect for Workstation
Clusters. In Proc. of 3rd Euro PVM Users' Group Meeting, Munich, Germany,
Oct.7-9, 1996.
[CC97] G. Chiola and G. Ciaccio. Gamma: a low-cost network of workstations
based on active messages. In 5th EUROMI CRO workshop on Parallel and Dis-
tributed Processing, 1997.
[CCHHvE96] Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel, and
Thorsten yon Eicken. Low-Latency Communication on the IBM RISC Sys-
tem/6000 SP. In ACM/ I EEE Supercomputing '96, Pittsburgh, PA, November
1996.
[CDD+95] J. Choi, J. Demmel, I. DhiUon, J. Dongarra, S. Ostrouchov, A. Pe-
titet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable Lin-
ear Algebra Library for Distributed Memory Computers - Design Issues and
Performance. Technical Report UT CS-95-283, LAPACK Working Note #95,
University of Tennesee, 1995.
[CGH94] Lyndon Clarke, Ian Glendinning, and Rolf Hempel. The MPI Message
Passing Interface Standard. Technical report, March 94.
[CPL+97] Chien, Pakin, Lauria, Buchanan, Hane, Giannini, and Prusakova. High
Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs
and Performance. In Eighth SI AM Conference on Parallel Processing f or Sci-
entific Computing (PP97}, 1997.
[DBDF97] Stefanos N. Damianakis, Angelos Bilas, Cezarz Dubnicki, and Ed-
ward W. Felten. Client Server Computing on Shrimp. I EEE Micro, pages
8-'17, January/February 1997.
[FG97] Marco Fillo and Richard B. Gillett. Architecture and implementation of
memory channel 2. Technical report, Digital Equipment Coropration, 9 1997.
[Har91] R. J. Harrison. Portable tools and applications for parallel computers.
International Journal on Quantum Chem., 40:847-863, 1991.
[HWTP93] Christian G. Herter, Thomas M. Warschko, Walter F. Tichy, and
Michael Philippsen. Triton/l: A massively-parallel mixed-mode computer de-
signed to support high level languages. In 7th International Parallel Processing
Symposium, Proc. of 2nd Workshop on Heterogeneous Processing, pages 65-70,
Newport Beach, CA, April 13-16, 1993.
[JR97] H. Jin and W. Rehm. Performance of message passing and shared memory
on sci-based smp cluster. In Proceedings of Fifth High Performance Computing
Symposium, Atlanta, Georgia, April 6-10 1997.
[myr] The GM APL
[OBWT97] Patrick Ohly, Joachim M. Blum, Thomas M. Warschko, and Walter F.
Tichy. PSPVM2:PVM for ParaStation. In Proc. of 1st Workshop on Cluster
Computing, Chemnitz, Germany, Nov.6-7, 1997.
[PT97] Loic Prylli and Bernard Tourancheau. New protocol design for high per-
formance networking. Technical report, LIP-ENS Lyon, 69364 Lyon, France,
1997.
108
[SR97] David Culler Steve Rodrigues, Tom Anderson. High-performance local-area
communication using fast sockets. In USENIX '97, 1997.
[WBT97] Thomas M. Warschko, Joachim M. Blum, and Walter F. Tichy. ParaS-
tation: Efficient Parallel Computing by Clustering Workstations: Design and
Evaluation. Journal of Systems Architecture, 1997. Elsevier Science Inc., New
York, NY 10010. To appear.
[WBvE97] Matt Welsh, Anindya Basu, and Thorsten von Eicken. ATM and Fast
Ethernet Network Interfaces for user-level communication. In roceedings of the
Third International Symposium on High Performance Computer Architecture
(HPGA), San Antonio, 1997.
TOP- C: Tas k- Ori ent ed Paral l el C
Di s t ri but ed and Shared Me mo r y
Gene Cooper man*
College of Computer Science
Northeastern University
Boston, MA 02115
gene~ccs.neu.edu
for
Su mma r y . The "holy grail" of parallel software systems is a parallel programming
language t hat will be as easy to use as a sequential one, while maintaining most of
the potential efficiency of the underlying parallel hardware. TOP- C (Task-Oriented
Parallel C) at t empt s such a model by presenting a task abstraction t hat hides
much of the details of the underlying hardware. DSM (Distributed Shared Memory)
also at t emps such a model, but along an orthogonal direction. By presenting a
shared memory model of memory, it hides much of t he details of message-passing
required by the underlying hardware. This article reviews t he TOP- C model and
then presents ongoing research on combining t he advantages of bot h models in a
single system.
1. I n t r o d u c t i o n
Thi s paper pr oposes t he TOP- C model as a way t o easi l y or gani ze com-
put at i ons on DSM syst ems wi t h ma ny processors, whi l e mai nt ai ni ng hi gh
concur r ency. The pr oposed model allows t he appl i cat i on wr i t er t o i mpl i ci t l y
decl are segment s of his envi r onment t ha t cor r es pond t o t he pr ogr am obj ect s
t hat he is using. The segment s ar e i mpl i ci t in t ha t t he appl i cat i on wr i t er need
onl y decl are t o TOP- C which segment s ar e modi fi ed by a gi ven r out i ne.
TOP- C has been successful in execut i ng ma ny l arge, paral l el appl i ca-
t i ons [4, 8, 10, 11, 12, 17]. TOP- C is i mpl ement ed as a C l i brary, and does
not r equi r e a modi f i cat i on of t he pr ogr ammi ng l anguage of t he appl i cat i on.
As wi t h any C l i brary, t he TOP- C l i br ar y can also be used by a C+ + pro-
gram. One can choose any of t hr ee TOP- C l i brari es t o choose bet ween: SMP
( Symmet r i c Mul t i Pr ocessi ng, or shar ed memor y) ar chi t ect ur es, di s t r i but ed
memor y ar chi t ect ur es, and a sequent i al ar chi t ect ur e. The appl i cat i on wr i t er
may cont i nue t o use his or her f avor i t e pr ogr ammi ng l anguage as l ong as t ha t
l anguage has an i nt er f ace t o C libraries.
It shoul d be not ed t ha t cur r ent hi gh- end SMP ar chi t ect ur es ( many
processors) ar e qui t e si mi l ar t o DSM syst ems wi t h har dwar e suppor t . Hence,
t her e appear s t o be a gr adual pr ogr essi on f r om l ow- l at ency SMP t hr ough
medi um- l at ency DSM syst ems, wi t h no s har p di vi di ng line. Accordi ngl y, we
t al k about t he SMP versi on of t he TOP- C model wi t h t he i nt ent i on t ha t t hi s
also appl i es t o DSM.
* Support ed in part by NSF Grant CCR-9732330.
11o
Sect i on 2. descri bes t he TOP- C model . Sect i on 3. t hen mot i vat es why t he
model needs t o be ext ended when t he envi r onment uses a l ot of memor y.
Sect i on 4. t hen descri bes a nat ur al way t o enhance t he TOP - C model by
pr ovi di ng an appl i cat i on abs t r act i on of segments. I f t he appl i cat i on pr ogr a m
is an obj ect - or i ent ed C+ + pr ogr am, t hen each segment will of t en cor r es pond
t o an obj ect .
Sect i on 5. t hen descri bes how t he enhanced TOP- C model maps ont o
a DSM ar chi t ect ur e. In par t i cul ar , t her e is an i mpor t a nt issue of how t he
mul t i pl e segment s of t he TOP- C envi r onment ma p ont o t he mul t i pl e pages
of a DSM syst em. We ar e still in t he process of obt ai ni ng a sui t abl e DSM,
and so we have not had t he oppor t uni t y t o t est TOP- C in t hi s envi r onment .
Nevert hel ess, a paper anal ysi s descri bes ma ny of t he DSM f eat ur es t ha t we
expect will be necessar y for TOP- C t o r un efficiently on t op of DSM.
2. Th e TOP- C Mo d e l
The TOP- C model has been descr i bed in [7]. The model is suffi ci ent l y fl exi bl e
t o also be easily por t ed t o i nt er act i ve l anguages [5, 6]. The model has also
been appl i ed t o met acomput i ng [9], due t o t he ease of checkpoi nt i ng t he
cur r ent s t at e and sendi ng a copy of t ha t s t at e t o a new process j oi ni ng
t he comput at i on. The model has been successful l y used in a var i et y of
appl i cat i ons [4, 8, 10, 11, 12, 17].
The model allows a single file of appl i cat i on code t o be execut ed as a
sequent i al , SMP, or di st r i but ed me mor y appl i cat i on, by si mpl y l i nki ng wi t h
a different library. Por t abi l i t y is emphasi zed by bui l di ng on t op of a POSI X
t hr eads l i br ar y (for SMP) or MPI [14] (for di s t r i but ed memor y) . MPI was
chosen as a wi del y available message-passi ng s t andar d, wi t h good efficiency.
The TOP- C di st r i but i on also cont ai ns i t s own small, unopt i mi zed subset
i mpl ement at i on of MPI , allowing one t o qui ckl y set up a small, sel f - cont ai ned
appl i cat i on. Fur t her , t he por t abi l i t y of TOP- C makes it easy t o r e- t ar get t o
anot her message-passi ng pl at f or m, such as PVM. TOP- C is freel y di s t r i but ed
at f t p : l l f t p , c c s . n e u . e d u / p u b / p e o p l e / g e n e / t o p - c / .
T h e p r o g r a m m i n g style is S P M D (Single P r o g r a m , Multiple Data). T h i s
is e x e c u t e d in the context of a master-slave architecture a n d a n e n v i r o n m e n t
or g l o b a l state. T h i s e n v i r o n m e n t receives lazy, incremental updates, in a
fashi on t ha t will be made cl ear l at er.
The user i nt er f ace has pur posel y been kept si mpl e by r est r i ct i ng t he
user i nt erface t o a single, pr i mar y syst em call: ma s t e r s l a v e ( ) . Th a t func-
t i on requi res as par amet er s , four appl i cat i on f unct i ons decl ar ed by t he user:
s e t _ t a s k _ i n p u t (), d o _ t a s k (), g e t _ t a s k _ o u t p u t () a n d
u p d a t e _ e n v i r o n m e n t ( ) . T h e philosophy is to present the higher-level task
abstraction to t h e application. T h i s s h o u l d b e contrasted to lower level inter-
faces t hat pr esent ei t her a message-passi ng abs t r act i on or a shar ed me mor y
abst r act i on.
111
The task is the first abstraction. The first two application-defined func-
tions, set t a s k_i nput () and do_t ask (), implicitly define the input-output
behavior of the task. The third function, get _t as k out put ( ) , returns an
action to be taken, based upon the task output. The three primary actions
are N0_ACTION, REDO, and UPDATE. When the application specifies the UPDATE
action, the application-specific function, updat e_envi ronment () is called on
each process (including the master). The routine, updat e envi ronment ()
uses the task output to introduce an incremental update.
The figure below illustrates the flow of control between master and each
of several slaves for a task.
MASTER SLAVE
I~t_t ask_input~
\ output> I
D , T E )
update_environment(input, output) )
)
Figure 2.1. TOP-C Programmer' s Model
A process always completes its current operation, before reading a pending
message for the next operation. A message from the master to a slave
requesting an update to the slave's copy of the environment always takes
precedence over a message specifying a new task. A REDO action results in
112
the original task input being sent back to the same slave, typically after a
message to update the environment.
In addition to the task, the second key to the TOP- C model is the
envi ronment (global state). The environment, like the task, is not explicitly
declared by the application. Rather, it is implicitly defined by the application
routines. Each of the four application routines may read the most recent local
environment. However, only updat e_envi r onment () may modify the dat a in
the environment. The environment is read and written only by the application
routines, and not by any TOP-C system routine.
The most important issue for TOP-C is to allow tasks to concurrently read
and make a request to modify the environment. As seen in figure 2., a decision
to modify the environment can only happen if ge t t a s kout put () returns
an UPDATE action. This action both allows TOP-C to record at what "time"
the environment was last modified, and to then call upda t e e nvi r onme nt ().
In the case of distributed memory, upda t e e nvi r onme nt () is called on each
process, including the master. In the case of shared memory or sequential
code, updat e_envi ronment () is called only on the master.
3. Conc ur r e nc y I s s ue s f or Sha r e d Me mor y
Note that for any shared memory system (not just TOP-C), there is an
inherent reader-writer problem when one thread (in this case the master)
writes to a region of memory while another thread is reading the same region
of memory. The TOP-C methodology reduces this to a single writer-multiple
reader problem. The TOP-C solution is to allow both memory operations
to proceed, but to later detect the memory collision and account for it. The
method is analogous to the method of "optimistic concurrency" in distributed
databases.
Concurrency is maintained in TOP-C in an application-specific manner.
The system provides a utility, i s up_t o_da t e ( ) , callable from within the
application routine, updat e_envi r onment ( ) . This routine will determine
whether the environment was modified on the master after the task input
under consideration was generated on the master, and before the task output
was received by the master. Any memory collisions are a special case of this
more general situation, and so will also be detected.
If the environment was not modified, then the application trivially attains
perfect concurrency. If the environment was modified, then the application
routine, get _t as k_out put (), may either return a REDO action, or employ an
application-specific technique to "patch" the task out put to take account of
the modified environment. The get _t as k out put () routine receives the task
input, in addition to the task output, precisely to make it easier to patch the
output.
The effect of this concurrency strategy is t hat the environment acts as a
single large "page" of memory. If any task causes the page to be "touched",
113
t hen all processes may have t o r ead an upda t e t o t he page. The page upda t e
is handl ed in a l azy manner , pr ovi di ng a t ype of l at ency hi di ng. However,
t he pr esence of onl y a single, at omi c envi r onment effect i vel y means t ha t false
shar i ng of da t a is wi despr ead wi t hi n t he syst em. Thi s is t he cur r ent s t at e of
TOP- C.
The issue of false shar i ng of a single monol i t hi c envi r onment t ends t o es-
peci al l y hur t TOP- C appl i cat i ons t ha t r equi r e a shar ed me mor y model . Thi s
occur s because of a nat ur al di chot omy in TOP- C appl i cat i ons. Appl i cat i ons
t ha t r equi r e onl y a smal l er a mount of me mor y for t he envi r onment t end t o
r un comf or t abl y in t he di s t r i but ed me mor y model , in whi ch t he envi r onment
is r epl i cat ed among many processes. However, appl i cat i ons r equi r i ng a l arge
amount of me mor y for t he envi r onment will pr ef er a shar ed me mor y envi ron-
ment . Ot her wi se, t he cost of physi cal me mor y of t en makes it uneconomi c t o
find a site wi t h sufficient me mor y on each pr ocessor t o allow t he r epl i cat i on
of a l arge envi r onment wi t hi n each process.
Thus, l arge envi r onment s favor a shar ed me mor y model . Thi s soft ware
view of me mor y can be achi eved ei t her by an SMP ar chi t ect ur e or by a
DSM ar chi t ect ur e on t op of ma ny wor kst at i ons. The next sect i on discusses
an exper i ment al versi on of TOP- C t ha t be t t e r accommodat es a shar ed view
of me mor y by pr ovi di ng mul t i pl e pages, or segment s, wi t hi n t he envi r onment .
4. Mu l t i p l e S e g me n t s wi t h i n a n E n v i r o n me n t
In t he exper i ment al TOP- C model , t he envi r onment is r epl aced by mul t i pl e
segment s. The use of mul t i pl e segment s forces us t o change one command
and one act i on in t he TOP- C model : i s _ u p t o _ d a t e ( ) and UPDATE. All
ot her aspect s of t he TOP- C r et ai n t he same simplicity.
Recal l t ha t t he TOP- C envi r onment is never expl i ci t l y decl ared. Rat her
it is i mpl i ci t l y defi ned by t he appl i cat i on pr ogr a mme r as t hose por t i ons
of me mor y wi t hi n a slave process t ha t are r ead by d o _ t a s k ( ) and t hat
are r ead or wr i t t en t o by u p d a t e e n v i r o n me n t ( ) . (In addi t i on, t he mas t er
r out i nes s e t _ t a s k _ i n p u t ( ) and g e t _ t a s k _ o u t p u t ( ) ma y also r ead t he
envi r onment . )
In our i mpl ement at i on of segment s, we r et ai n t hi s i dea t ha t segment s
are i mpl i ci t referenced, but never expl i ci t l y decl ared. Si nce t he envi r onment
is r epl aced by segment s, t he ut i l i t y i s _ u p _ t o d a t e ( ) must be ext ended t o
i ncl ude a single par amet er , speci fyi ng for whi ch segment s t he quer y is bei ng
made. Cur r ent l y, t hi s pa r a me t e r is specified as a st r i ng r epr esent i ng a set of
number s. For exampl e, " 1 , 3 , 5 - 7 " r epr esent s segment s 1, 3, and 5 t hr ough 7.
Second, t he command u p d a t e _ e n v i r o n me n t () is now used t o upda t e one
or mor e segment s. I t woul d be possi bl e t o add an addi t i onal r equi r ement
for t he appl i cat i on pr ogr ammer t o have u p d a t e _ e n v i r o n me n t ( ) r et ur n a
st ri ng, such as " 4- 8" , i ndi cat i ng whi ch segment s ar e bei ng updat es. Thi s
woul d allow TOP- C t o mai nt ai n an i nt er nal t abl e t ha t updat es a t i mes t amp
114
for each segment, and then answer any application queries of the form
i s up_t oda t e (" 1, 3, 5- 7") . However, it was felt to be a simpler syntax to
instead extend the UPDATE action returned by get _t ask_out put (). Since the
application programmer already must return the action UPDATE (implemented
as a C constant), we now require the application programmer to instead
return a parametrized action such as UPDATE("4-8") (implemented as a
C function macro).
It is clear that the internal table of timestamps for each segment
can be maintained only on the master process, since queries of the form
i s _upt oda t e () and updates of the form UPDATE() both originate on the
master process. As each new task originates on the master, a new task ID
is issued as a monotonically increasing sequence. The timestamps for each
segment are then implemented as task ID's.
So an i s up_to dat e( ) query can be answered by TOP-C simply
by determining the task ID of the current task being processed by
get _t ask out put (). That current task ID is compared with the maximum of
the timestamps for each segment being queried by i s upt oda t e (). Those
timestamps are maintained by TOP-C in its internal table, and are task ID's
corresponding to the last updat e_envi ronment () for each queried segment.
If the current task ID is "newer" (larger), then TOP-C returns true. Other-
wise, it returns false.
Thus, the extensions to i s upt oda t e () and UPDATE() impose a mini-
mal additional burden on the TOP-C application programmer, while provid-
ing strong benefits in the form of higher concurrency. The partition of the
environment memory into segments by the application will often be a natural
extension of the application. For example, large application tables or other
arrays can be subdivided by partitioning the index set into equal subinter-
vals. Object-oriented applications will often partition their environment by
associating an object ID with each object, and associating a TOP-C segment
with the memory used by an object. The object ID can then also be used as
a segment number.
5. TOP- C ove r Di s t r i but e d Sha r e d Me mor y
Existing DSM systems primarily provide physical memory management and
memory consistency. TOP-C provides memory management in the form
of implicitly specified TOP-C segments, where the user is responsible for
the memory organization, and the TOP-C framework provides consistency
management for this memory. Therefore the functionality of TOP-C and
a DSM system intersect in the area of memory management. This section
discusses the possible benefits and design of a combined system. There is not
yet an implementation of the ideas in this section.
The introduction of shared memory to TOP-C introduces a new problem
that was not present in the distributed memory of TOP-C. When the master
115
calls u p d a t e _ e n v i r o n me n t ( ) , wri t es on t he mas t er t ake effect i mmedi at el y
on t he slave, due t o t he shar ed memor y. Thi s is handl ed in SMP t hr ough a
s t andar d si ngl e- wr i t er - mul t i pl e- r eader sol ut i on by whi ch r eader s ma y l at er
r e- r ead any modi fi ed segment t hr ough a REDO act i on. Nevert hel ess, t hi s
s t r at egy also i mposes a bur den on t he appl i cat i on wr i t er in t ha t d o _ t a s k ( )
may r et ur n a wr ong answer af t er r eadi ng i nconsi st ent dat a, but it mus t be
guar ant eed never t o hang due t o i nconsi st ent dat a. DSM syst ems can emul at e
t he l azy updat es of TOP- C under di st r i but ed me mor y by i mpl ement i ng l azy
rel ease consi st ency.
Many DSM syst ems, such as Tr eadMar ks [1], Quar ks [16] and t he earl i er
Muni n [2] syst em, s uppor t rel ease consi st ency. Release consistency allows for
a weaker me mor y model in whi ch an acquire oper at i on is r equi r ed bef or e
r eadi ng or wri t i ng a shar ed vari abl e, and a release oper at i on is r equi r ed
before anot her pr ocessor can acqui re a shar ed vari abl e. Rel ease consi st ency
allows i ni t i at i on of a new acqui re oper at i on wi t hout wai t i ng for pendi ng r eads
t o compl et e, and it allows a new wri t e wi t hout wai t i ng for pendi ng rel ease
oper at i ons t o compl et e.
A t ypi cal i mpl ement at i on of rel ease consi st ency is t o i mpl ement t wo
l i br ar y r out i nes, acqui re and release, ( Tr e k _ l o c k _ a c q u i r e ( l o c k _ h a n d l e )
and Trek_l ock r e l e a s e ( l o c k h a n d l e ) in t he case of Tr eadMar ks) , whi ch
oper at e on a lock handl e (an i nt eger in t he case of Tr eadMar ks ) . Af t er
an acqui re oper at i on, all wri t es by t he appl i cat i on ar e not ed by t he DSM
syst em unt i l a cor r espondi ng rel ease oper at i on. ( I nt er cept i on of wri t es can
be i mpl ement ed by t he UNIX syst em call, mp r o t e c t ( ) . ) If a second pr ocess
acqui res t he same lock, t hen all of t he modi fi ed pages will be r epl i cat ed on
t he second process.
Rel ease consi st ency is t ypi cal l y i mpl ement ed in one of t wo vari at i ons.
These t wo vari at i ons differ in how t o handl e wr i t e updat es. The first var i at i on
is l azy release consi st ency. In l azy rel ease consi st ency, a wr i t e upda t e occur s
onl y after t he call t o r e l e a s e ( ) by t he wri t i ng process, and when a second
process t hen calls a c q u i r e ( ) in an a t t e mpt t o access t he same page of
memor y. The second var i at i on is eager rel ease consi st ency. In t hi s var i at i on,
modi fi ed pages ar e upda t e d for all processes hol di ng a copy of t he page at t he
t i me of t he call t o r e l e a s e ( ) . Thi s upda t e can be "bat ched" for efficiency,
but t he ori gi nal call t o r e l e a s e ( ) may not be seen t o compl et e by a second
process unt i l t he second process has recei ved t he "eager" wr i t e updat es.
The pr ef er r ed DSM pol i cy for TOP- C is one of lazy release consistency
in whi ch t her e ar e no page updat es seen by ot her processes and no page
i nval i dat i ons unt i l af t er t he call t o r e l e a s e () and at t he t i me of a second call
t o a c q u i r e ( ) . Thi s mi mi cs t he TOP- C me mor y model of lazy, i ncr ement al
updat es. Thi s fits well wi t h t he TOP- C met hodol ogy, in whi ch wri t es t o any
one TOP- C segment ar e likely t o be i nfrequent .
If TOP- C were i mpl ement ed on t op of a DSM syst em, t hi s woul d re-
qui re appr opr i at e calls of a c q u i r e ( ) and r e l e a s e ( ) by TOP- C t o t he
116
under l yi ng DSM syst em. One woul d call a c q u i r e ( ) bef or e a call t o
u p d a t e _ e n v i r o n me n t ( ) and r e l e a s e ( ) af t er t he call. Bef or e a call t o
d o _ t a s k ( ) (on a sl ave), one woul d call a c q u i r e ( ) i mmedi at el y fol l owed by
r e l e a s e () in or der t o recei ve t he modi fi ed pages.
If one has i mpl ement ed mul t i pl e segment s of t he envi r onment in TOP -
C, one would i nvoke a di fferent l ock handl e for each segment . I t mi ght be-
come necessar y for u p d a t e _ e n v i r o n me n t () t o t ake an addi t i onal ar gument ,
specifying whi ch segment t o updat e. TOP- C woul d t hen guar ant ee t o call
u p d a t e _ e n v i r o n me n t ( ) r epeat edl y, once for each segment t ha t needs t o be
updat ed.
Pl ans ar e under way t o t est TOP- C on t op of a DSM syst em. The
exper i ment al versi on of TOP- C (using shar ed memor y) will be t est ed. Thi s
will pr ovi de i mpor t a nt f eedback a bout mer gi ng t he TOP- C shar ed me mor y
model wi t h t he shar ed me mor y model used by DSM.
1. C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu,
and W. Zwaenepoel, "TreadMarks: Shared Memory Computing on Networks of
Workstations", I EEE Computer, Vol. 29, No. 2, pp. 18-28, February 1996.
2. J. Carter, J. Bennett, and W. Zwanpoel, Implementation and Performance of
Munin, Proc. 13th ACM Syrup. Operating Syst em Principles, 1991, pp. 152-164.
3. R. Chow and T. Johnson, Di st ri but ed Operating Syst ems and Al gori t hms,
Addison Wesley Longman, 1997.
4. G. Cooperman, "Practical Task-Oriented Parallelism for Gaussian Elimination
in Distributed Memory", Linear Algebra and its Applications 275-276, 1998,
pp. 107-120.
5. G. Cooperman, GAP/ MPI : Facilitating Parallelism, Proc. of DI MACS Work-
shop on Groups and Comput at i on H 28, DI MACS Series in Discrete Mat h-
ematics and Theoretical Comput er Science, L. Finkelstein and W.M. Kant or
(eds.), AMS, Providence, RI, 1997, 69-84.
6. G. Cooperman, STAR/ MPI: Binding a Parallel Library to Interactive Symbolic
Algebra Systems, Proc. of International Symposi um on Symbol i c and Algebraic
Computation (ISSAC '95), ACM Press, 126-132.
7. G. Cooperman, TOP-C: A Task-Oriented Parallel C Interface, 5 th Inter-
national Symposi um on High Performance Di st ri but ed Comput i ng (HPDC-
5), 1996, IEEE Press, 141-150 (software at f t p : / / f t p . c c s . n e u . e d u
I p u b l p e o p l e l g e n e / t o p - c / ).
8. G. Cooperman, L.Finkelstein, M.Tselman and B.York, Constructing Permut a-
tion Representations for Matrix Groups, J. Symbol i c Comput at i on 24, 1997,
pp. 1-18.
9. G. Cooperman and V. Grinberg, "TOP-WEB: Task-Oriented Met acomput i ng
on the WEB", International Journal of Parallel and Di st ri but ed Syst ems and
Networks 1, 1998, pp. 184-192; a shorter version appears as: "TOP- WEB:
Task-Oriented Metacomputing on t he Web", G. Cooperman and V. Grinberg,
Proceedings of Ni nt h I ASTED International Conference on Parallel and Dis-
tributed Comput i ng and Syst ems (PDCS-97), I ASTED/ Act a Press, Anaheim,
1997, pp. 279-286.
117
10. O. Cooperman and G. Havas, Pract i cal parallel coset enumerat i on, Proc.
of Workshop on High Performance Computation and Gigabit Local Area
Networks, G. Cooperman, G. Michler and H. Vinck (eds.), Lecture notes in
control and information sciences 226, Springer Verlag, pp. 15-27.
11. G. Cooperman, G. Hiss, K. Lux, and J/irgen M/iller, The Brauer t ree of t he
principal 19-block of t he sporadic simple Thompson group, J. of Experi ment al
Mat hemat i cs 6(4), 1997, pp. 293-300.
12. G. Cooperman and M. Tselman, New Sequential and Parallel Al gori t hms for
Generating High Dimension Hecke Algebras using t he Condensation Technique,
Proc. of International Symposi um on Symbolic and Algebraic Comput at i on
(ISSAC '96), ACM Press, 155-160.
13. G.C. Fox, W. Furmanski, M. Chen, C. Rebbi and J. Cowie, WebWork: Int e-
grat ed Programmi ng Envi ronment Tools for National and Gr and Challenges,
Proc, of Supercomput i ng '95.
14. W. Gropp, E. Lusk and A. Skjellum, Using MPI, MI T Press, 1994.
15. J. Protid, M. Tomagevid, V. Milutinovid, Di st ri but ed Shared Memory: Concepts
and Systems, I EEE Comput er Society Press, 1998.
16. M. Swanson, L. Stoller, J. Carter, "Making Di st ri but ed Shared Memory Sim-
ple, Yet Efficient", Proc. of the 3rd Int ' l Workshop on High-Level Parallel Pro-
gramming Models and Supportive Environments (HIPS' 98), pages 2-13, March,
1998.
17. M. Tselman, Comput i ng per mut at i on representations for mat r i x groups in a
di st ri but ed environment, Proc. of DI MACS Workshop on Groups and Comput a-
tion H 28, DI MACS Series in Discrete Mat hemat i cs and Theoretical Comput er
Science, L. Finkelstein and W.M. Kant or (eds.), AMS, Providence, RI, 1997,
371-382.
Metacomputing in the Gigabit Testbed West
Thomas Eickermann I and Ferdinand Hommes 2
1 Forschungszentrum Jiilich, Germany
2 GMD - Forschungszentrum Informationstechnik, Sankt Augustin, Germany
Abst r act . The 'Gigabit Testbed West' is one of two Testbeds for the upgrade of
the German scientific network to Gigabit capacity, that is planned for the year 2000.
It currently uses a 2.4 Gigabit/second ATM link to connect the Research Centre
Jiilich and the GMD - National Research Center for Information Technology in
Sankt Augustin. The testbed is the basis for several application projects ranging
from metacomputing to multimedia. This contribution gives an overview of the
infrastructure of the testbed and the applications.
1 I nt r oduc t i on
A common definition of met acomput i ng - the shared use of distributed su-
percomput i ng resources - contains different topics like unified access to the
batch systems of different comput i ng centers [1],[2] or the si mul t aneous use
of several supercomput ers by a single application. The first approach aims
to simplify the access to supercomputers, the second should allow the so-
lution of problems t hat could not be treated so far or solve problems more
efficiently. The coupling of supercomput ers offers a way to increase the peak
CPU-performance and main memory accessible by a single application. This
allov~s e.g. particle simulations with large numbers of particles, where mai n
memory is often the limiting resource. Even more appealing is the so called
' heterogeneous met acomput i ng' , which combines comput ers of different ar-
chitecture, massively parallel computers, vect or-comput ers or special-purpose
machines like visualization servers [3].
A serious drawback is t hat the bandwi dt h and latency which are achiev-
able over an external network - - no mat t er if local or wi de-area - - can usu-
ally not compet e with the performance of the internal network of a massively
parallel computer. Because of t hat , only certain classes of applications can
benefit from met acomput i ng. One such class is represented by so-called ' cou-
pled fields' applications. Here, two or more space- and t i me-dependent fields
interact with each other. An i mpl ement at i on of such applications can make
explicit use of the performance hierarchy of the networks in the following way.
The fields are distributed over the machines of the met acomput er and for
each field, a parallelization via domai n composi t i on can be performed. Typ-
ically, the fields have to be exchanged over the network once per simulation
timestep, while the calculation of each field often requires several iterations
per timestep, and commmuni cat i on within each iteration. This means t hat
120
al t hough t he r equi r ement s for t he ext er nal net wor k can be qui t e hi gh, t hey
are usual l y smal l compar ed t o t he i nt er nal communi cat i on needs. A second
class of appl i cat i ons benefits f r om bei ng di s t r i but ed over s uper comput er s of
different ar chi t ect ur e, because t hey cont ai n par t i al pr obl ems t ha t can best be
solved on massi vel y paral l el or vect or - super comput er s. For ot her appl i cat i ons,
r eal - t i me r equi r ement s are t he reason t o connect several machi nes.
2 T h e Gi g a b i t T e s t b e d We s t
In Ger many, t he net work t hat connect s research, science and educat i onal in-
st i t ut i ons wi t h each ot her and t he rest of t he i nt er net is oper at ed by t he
DFN- Ver ei n, an associ at i on of t hese i nst i t ut i ons f ounded in 1984. Since 1996
this net work is based on ATM- t echnol ogy and allows for access capaci t i es
up t o 155 Mbi t / s . An ext ensi on i nt o t he Gbi t / s r ange on a nat i onal basis is
pl anned for t he year 2000. To pr epar e t hi s t r ansi t i on, t wo t est beds have been
set up in t he west ern and sout her n par t s of Ger many. The y will serve t o eval-
uat e new net work t echnol ogy as well as t o gai n exper i ence wi t h appl i cat i ons
requi ri ng bandwi dt hs beyond t he cur r ent l y avai l abl e 155 Mbi t / s . In t he ar ea
of scientific comput at i on, such appl i cat i ons can e.g. be f ound in mul t i medi a,
di st r i but ed access t o huge amount s of da t a and of course in me t a c omput i ng,
which is t he subj ect of this article.
The Gi gabi t Test bed West s t ar t ed as t he first of t he t wo Ge r ma n t est beds
in August 1997. It is a j oi nt pr oj ect of t he Research Cent r e J/ilich and t he
GMD - Nat i onal Research Cent er for I nf or mat i on Technol ogy in Sankt Au-
gust i n close t o Bonn. In t he first year of oper at i on t he t wo l ocat i ons - - whi ch
are appr oxi mat el y 1O0 km apar t - - were connect ed by an OC- 12 ATM link
(622 Mbi t / s ) based upon Synchr onous Di gi t al Hi er ar chy ( SDH/ STM4 ) t ech-
nology. In August 1998 this link has been upgr aded t o OC- 48 (2.4 Gbi t / s ) .
The connect i on is pr ovi ded by o. t el . o Service GmbH and uses t he opt i cal fi ber
i nf r ast r uct ur e inside t he power lines of t he Ge r ma n power suppl i er RWE AG.
In t he f r amewor k of a bet a- t est Fore Syst ems ATM swi t ches ( ASX- 4000) were
used t o connect t he local net works of t he research cent ers t o t he OC- 48 line.
Ini t i al st abi l i t y pr obl ems t hat were observed dur i ng t he t est t ur ned out t o be
rel at ed t o signal at t enuat i on and t i mi ng. Those pr obl ems have been sol ved
and bot h t he SDH link and t he switches are in st abl e oper at i on now.
The appl i cat i on pr oj ect s t hat use t he t est bed can rel y on a solid base
of i nst al l ed s uper comput er capaci t y. J/ilich is equi pped wi t h 512- node Cr ay
T3E- 600 and 256- node T3E- 900 massi vel y paral l el comput er s and a 16-
processor Cr ay T90 vect or - comput er . An IBM SP2 and a 12 processor SGI
Onyx 2 vi sual i zat i on server are i nst al l ed in t he GMD. Besi des several insti-
t ut es in t he research cent ers in Jiilich and Sankt August i n ot her i nst i t ut i ons
par t i ci pat e in t he t est bed wi t h t hei r appl i cat i ons. These are t he Al fred We-
gener I nst i t ut e for Pol ar and Mari ne Research (AWI), t he Ge r ma n Cl i mat e
Comput i ng Cent er ( DKRZ) , t he Uni versi t i es of Col ogne and Bonn, t he Na-
121
t i onal Ge r ma n Aerospace Research Cent er ( DLR) in Col ogne, t he Academy
of Medi a Ar t s in Cologne, and t he i ndust r i al par t ner s Pal l as Gmb H and
echt zei t GmbH.
3 S u p e r c o mp u t e r c o n n e c t i v i t y
A key f act or for t he success of me t a c omput i ng act i vi t i es are communi cat i on
net works t hat pr ovi de hi gh- bandwi dt h and l ow- l at ency connect i ons bet ween
t he component s of t he met acomput er . Compar ed t o t he net wor ki ng equi p-
ment for WAN-backbones, ATM- connect i vi t y for s uper comput er s has evol ved
qui t e slowly. Whi l e 622 Mbi t / s- I nt er f aces are now avai l abl e for all c ommon
wor kst at i on pl at f or ms, sol ut i ons are still out s t andi ng for t he ma j or super-
comput er s used in t he Test bed West. For t he Cr ay T3E, t he Cr ay T90, and
t he IBM SP2 onl y 155 Mbi t / s are avai l abl e ( and will be in t he foreseeabl e
f ut ur e) . For t he SGI Onyx 2 a 622 Mbi t / s ATM- i nt er f aee is expect ed t o be
avai l abl e in earl y 1999. Ther ef or e a di fferent sol ut i on had t o be f ound t o
connect t he super comput er s in Jiilich and Sankt August i n t o t he t est bed.
The best per f or mi ng net worki ng connect i on of t he Cr ay s uper comput er s is
t he ' Hi gh Per f or mance Paral l el I nt er f ace' ( Hi PPI ) whi ch offers a peak perfor-
mance of 800 Mbi t / s when a low-level pr ot ocol and l arge t r ansf er bl ocks (1
MByt e or mor e) are used. Even wi t h T CP / I P communi cat i on, t r ansf er r at es
of mor e t han 400 Mbi t / s can be achi eved wi t hi n t he local Cr ay compl ex in
Jiilich. Thi s is mai nl y due t o t he fact , t hat Hi PPI net wor ks allow I P- packet s
of up t o 64 KByt e size ( MTU size). One way t o i nt er connect I P net wor ks
based on Hi PPI and ATM t echnol ogy is t o use t he ATM/ Hi P P I - g a t e wa y by
Ascend Communi cat i ons. A serious l i mi t at i on of t hi s sol ut i on is t ha t on t he
Hi PPI side onl y MTU sizes up t o 9182 Byt e are suppor t ed. Ther ef or e we fol-
lowed a different appr oach. A wor kst at i on is equi pped wi t h a Fore Syst ems
622 Mbi t / s ATM i nt erface and a Hi PPI i nt erface and act s as an I P- r out er
bet ween t he Hi PPI and t he ATM net work. Since t he Fore ATM a da pt e r sup-
por t s large MTU sizes, I P packet sizes of 64 KByt e are possi bl e on each par t
of t he net work. We are cur r ent l y using an SGI 0200 and a SUN Ul t r a 30 as
dedi cat ed r out er s for t he Cr ay syst ems in Jiilich.
A si mi l ar sol ut i on was chosen t o connect t he IBM SP2 in Sankt August i n
to t he t est bed. 8 SP- nodes are equi pped wi t h 155 Mbi t / s ATM adapt er s and
one wi t h a Hi PPI interface. The ATM adapt er s are connect ed t o t he t est bed
vi a a FORE ASX 1000. The Hi PPI net wor k is r out ed by a SUN E5000 which
has also a FORE 622 Mbi t / s ATM adapt er . Pr el i mi nar y meas ur ement s show
a t hr ouphput of mor e t han 370 Mbi t / s bet ween t he Cr ay T3 E in Jiilich and
t he IBM SP2 in Sankt August i n. The l ayout of t he net wor k as of Sept ember
1998 is depi ct ed in figure 1. Thr oughput values t hat were meas ur ed in t ha t
net work wi t h vari ous har dwar e are shown in t abl e 1. The del ay t ha t is i nt r o-
duced by t he 100 km SDH/ ATM line is about 0.9 msec. Thi s val ue is still
below t he del ay i nt r oduced by t he oper at i ng syst ems of t he T3 E (,-~3 msec)
122
FZJ IGMD /BpM ~ForoASX_100O iForoASX.200
s o ~ g i t r 8 2
- - Fore ASX- 1000
, 1 2 1 ,
1 I
Fore AS,X- 1 0 0 0 i , ~ l g o r e ASX-4000
o e , s , o o o l
I I
I I s G , I 0 2 0 0 ,
ForoASX-1000 SUNUltra60 1 2 P r o c [ @ ] l I PC S~ I N
I HIPPI I Cisco $1010
Cisco A 100 . . . . . . . . Switch
, ' ! ! l l i
256 Pror 512 Proc 16 Proc
F i g . 1. Conf i gur at i on of t he Gi gabi t Te s t be d West in s umme r 1998. Jfi l i ch a nd
Sankt Augus t i n ar e connect ed vi a a 2.4 Gb i t / s ATM- l i nk. The s upe r c omput e r s ar e
a t t a c he d to t he t e s t be d vi a Hi PPI - ATM gat eways , sever al wor ks t at i ons vi a 622 or
155 Mb i t / s ATM i nt erfaces.
a n d t he SP2 ( ~ 2 ms e c ) whi c h wer e me a s u r e d wi t h p i n g - p o n g t e s t s i n l oc a l
n e t wo r k s .
T a b l e 1. TCP t hr oughput in ATM cl assi cal I P net wor ks
a d a p t e r [ Mbi t / s] t hr oughput [ Mbi t / s]
Sun Ul t r a 60, Sol ari s 2.6 622 530
Sun E5000, Sol ari s 2.6 622 501
SP2, Thi n node, AI X 4.1.5 155 118
T3E- 900 155 115
Onyx 2, I RI X 6.4 155 126
2
H
~ 1 m 84
[
~ 0 ~
B
1 44 o 1 ~. o l:,,m.o 1: ~. o 1: ~. o 1: ~. o
i,
m
123
l r ~o I Bo libel
Fig. 2. VAMPIR timeline display of a metacomputing application running on two
SP2 and two T3E nodes. The horizontal axis is the execution time, each horizontal
bar represents a processor. Light parts of the bars depict calculations, dark parts
MPI communication. The black fines represent MPI messages.
4 Tool s
To make met acomput i ng usable for a broader range of users, the availabil-
ity of at least a mi ni mum set of tools is mandat ory. Most i mpor t ant is a
met acomput i ng-aware communi cat i on library. In the Gi gabi t Test bed West,
it was decided to rely mai nl y on MPI [4] which has become the de- f act o
st andard in di st ri but ed memor y parallel comput ers. A couple of features t hat
are useful for met acomput i ng applications are part of the MPI-2 [5] defini-
tion. Dynami c process creation and at t achment e.g. can be used for realtime-
visualization or comput at i onal steering; l anguage-i nt eroperabi l i t y between C
and FORTRAN is needed to couple applications t hat are i mpl ement ed in dif-
ferent programmi ng languages. When the project st art ed no met acomput i ng-
aware MPI-2 i mpl ement at i on was available (this is still t rue today, except for
the LAM i mpl ement at i on, which i mpl ement s the dynami c features of MPI-2
on workstation clusters [6]). Therefore such a devel opment was assigned to
Pallas GmbH. A first prot oype was finished in Sept ember 1998. Until then,
the PACX/ MPI - l i br ar y developed by the University of St ut t gar t was used [7].
It support s a subset of MPI-1 and allows to couple Cray T3Es. Thi s l i brary
has been port ed to the IBM SP2 and opt i mi zed for high-speed networks by
the proj ect part ners in Jiilich and Sankt Augustin. For MPI poi nt -t o-poi nt
communi cat i on, t hroughput values of 73 Mbi t / s with a l at ency of 6 msec have
been observed between the Cray T3E in Jiilich and the IBM SP2 in Sankt
Augustin. For those measurement s the 155 Mbi t / s ATM interfaces have been
124
used. Fi rst exper i ment s wi t h t he Hi PPI / ATM gat eway show si gni fi cant i m-
pr ovement s compar ed t o t hat value.
Also i mpor t a nt are t ool s for per f or mance eval uat i on and t uni ng. For mes-
sage passing appl i cat i ons VAMPI R [8] is a well known pr oduct . It was devel-
oped in t he Research Cent r e Jiilich and is now di s t r i but ed by Pal l as GmbH.
For t he use in t hi s pr oj ect VAMPI R has been ext ended by some me t a c om-
put i ng feat ures. Tracefiles, t hat have been cr eat ed on t he di fferent machi nes
of t he me t a c omput e r can be synchr oni zed and mer ged and vi sual i zed in t he
t i mel i ne display. Fi gure 2 shows an exampl e. A wr apper l i br ar y for t he in-
s t r ument at i on of PACX/ MPI appl i cat i ons for t he use wi t h VAMPI R was
also devel oped. No a t t e mpt has been made t o devel op a met a- debugger .
Wi t h PACX/ MPI messages t hat are exchanged bet ween t he machi nes can
be t raced. For ot her probl ems, paral l el debuggers like Tot al vi ew have t o be
used separ at el y on each machi ne.
5 Ap p l i c a t i o n s
A coupl e of appl i cat i on subpr oj ect s t hat t ouch di fferent aspect s of met acom-
put i ng have been defi ned wi t hi n t he Gi gabi t Tes t bed West. In t he fol l owi ng
t he ai ms and t he st at us in s ummer 1998 of each appl i cat i on is descr i bed
briefly. More det ai l s will be present ed in separ at e publ i cat i ons.
5. 1 S o l u t e t r a n s p o r t i n g r o u n d wa t e r
A t ypi cal ' coupl ed fields' szenario is t he t r ans por t of sol ut ant s in gr ound
wat er. The i nt er act i ng fields are t he vel oci t y of t he gr ound wat er flow and
t he concent r at i ons of t he sol ut ant s. Two i ndependent pr ogr ams t ha t per-
f or m such ki nd of 3- D si mul at i ons have been devel oped in t he I ns t i t ut e for
Pet r ol eum and Organi c Geochemi st r y at t he Research Cent r e Jiilich. The
pr ogr am TRACE ( Tr anspor t of Cont ami nant s in Envi r onment al Syst ems)
si mul at es t he flow of wat er in vari abl y s at ur at ed, por ous, het er ogeneous me-
dia. It uses a f i ni t e- el ement di scr et i zat i on of t he model equat i ons and has
been paral l el i zed at t he Cent r al I nst i t ut e for Appl i ed Mat hemat i cs in Jiilich
based on a domai n decomposi t i on [9]. It is coded in FORTRAN 90 and uses
MPI. The C+ + pr ogr am PARTRACE ( PARt i cl e TRACE) per f or ms t he sim-
ul at i on of t he sol ut ant s using a Mont e- Car l o met hod. In t hei r ori gi nal ver-
sions, t he pr ogr ams coul d onl y ' communi cat e' vi a files. TRACE si mul at es t he
wat er flow unt i l a s t at i onar y flow evolves and wri t es t he resul t i ng fields i nt o
a file which is t hen used as i nput for t he par t i cl e si mul at i on t ha t is done by
PARTRACE.
It was consi dered a serious r est r i ct i on of t hi s appr oach t hat t he si mul at i on
of part i cl e t r ans por t is l i mi t ed t o s t at i onar y flows. To resolve t hi s l i mi t at i on,
t he appl i cat i ons were coupl ed using PACX/ MPI . Each of t he m now r uns
125
in i t s own MPI - c o mmu n i c a t o r and t he wat er flow fields are exchanged vi a
message- passi ng.
Cur r ent l y, TRACE is r un on t he T3 E in Jiilich and P ART RACE on t he
SP2 in Sankt August i n. In a t ypi cal run, 10 MByt es are t r ans f er r ed over t he
t es t bed at t he begi nni ng of each t i mes t ep. Wi t h one t i me s t e p t a ki ng appr oxi -
ma t e l y 2 seconds, t hi s resul t s in mode r a t e aver age net wor k l oad. Never t hel ess,
t he peak r at es are much hi gher, since all d a t a are t r ans f er r ed in a si ngl e bur s t .
Cur r ent l y, work is under way t o i mpr ove pe r f or ma nc e and s cal abi l i t y of bot h
appl i cat i ons. Thi s will also resul t in i ncr easi ng net wor k r equi r ement s . Fut her -
mor e it is pl anned t o i mpl e me nt an onl i ne vi sual i zat i on of t he c o mp u t a t i o n .
5. 2 ME G a n a l y s i s
An appl i cat i on t ha t can benefi t f r om het er ogeneous me t a c o mp u t i n g emer ges
f r om t he anal ysi s of ma gne t oe nz e pha l ogr a phy dat a. The ma gne t i c field a r ound
a h u ma n head is meas ur ed wi t h an ar r ay of s uper conduct i ng q u a n t u m in-
t er f er ence devi ces ( SQUI Ds) . Fr om t hese dat a, t he di s t r i but i on of el ect ri c
cur r ent s in t he br ai n can be r econs t r uct ed by sol vi ng an i nverse pr obl em. I n
Jiilich, t hi s is done wi t h t he ' Mul t i pl e Si gnal Cl assi f i cat i on' ( MUSI C) al go-
r i t hm [10]. Wi t h MUSI C, pa r a me t e r s of a fi ni t e numbe r of cur r ent di pol es
are obt ai ned in t hr ee phases [11].
9 The numbe r of di pol es is e s t i ma t e d usi ng st at i st i cal me t h o d s t h a t sepa-
r at e si gnal f r om noise.
9 The posi t i ons of t he di pol es are cal cul at ed. Thi s is done by fi ndi ng t he
e xt r e ma of a f unct i on, t ha t meas ur es how well a di pol e pl aced at a gi ven
l ocat i on is abl e t o r epr oduce t he si gnal e s t i ma t e d in t he f or me r st ep.
9 In t he l ast st ep, t he t i me evol ut i on of di pol e s t r engt h and or i ent at i on ar e
cal cul at ed.
The second phase is mos t t i me cons umi ng but can be i mpl e me nt e d on a
mas s i vel y par al l el c omput e r ver y efficiently. The first phas e is be t t e r sui t ed
for a vect or - comput er . The r eason for t ha t is t ha t it i nvol ves ope r a t i ons on
mat r i ces t ha t are t oo smal l t o be efficiently par al l el i zed ( t ypi cal l y 360x360).
Separ at e me a s ur e me nt s of a par al l el pr ogr a m t ha t i mpl e me nt s t he MUSI C
al gor i t hm on t he T3E and t he T90 conf i r m t hi s. As soon as our MPI - 2 i m-
pl e me nt a t i on will be abl e t o coupl e t hose machi nes, a di s t r i but ed versi on of
t he pr ogr a m shoul d be abl e t o achi eve an overal l execut i on- t i me t h a t is bel ow
t he t i me needed on ei t her t he T3 E or t he T90.
5. 3 Re a l t i me f MRI
Anot her exper i ment in Jfilich t ha t deal s wi t h br ai n act i vi t y is based on func-
t i onal Magnet i c Resonance To mo g r a p h y ( f MRI ) [12]. Her e a t est per s on is
exposed t o e.g. per i odi c vi sual or acoust i c s t i mul at i ons . The ar eas of br ai n
126
act i vi t y are identified by fitting the par amet er s of a model of t he expect ed re-
sponse of the brai n with the MRI dat a. Thi s not only i mproves the sensi t i vi t y
of the measur ement compar ed to si mpl er correl at i on met hods but also allows
to check those models. Head movement s of the test person t end t o produce
art efact s in the detected activity. Therefore it is essential to correct for those
movement s. In order to allow i nt eract i ve response of the experi ment al i st , all
this should be done and visualized in real t i me.
It is pl anned to i mpl ement this with the following setup. The raw dat a
is t ransferred from the MRI scanner to the T3E, where it is processed. The
resulting funct i onal dat a is handed over to an SGI Onyx 2 at the GMD in
Sankt August i n. Thi s machi ne creates an i nt eract i ve 3- D represent at i on of
the brai n on a Responsive Workbench t hat is agai n l ocat ed in Jiilich. For t hat
purpose, two st ereo-i mages have to be t ransferred over t he gi gabi t t est bed. In
order to allow for interactive movement and slicing by a person oper at i ng the
workbench, these i mages have to be updat ed several t i mes a second. Current l y
only a si mpl e 2- D visualization of the processed dat a is i mpl ement ed. Thi s
setup is sketched in figure 3. It should be not ed t hat a si mi l ar appl i cat i on has
recently been demonst r at ed by the Pi t t sbur gh Super comput i ng Cent er [13].
MR-Scanner data
It
~tomical data
RT-Server o ~ ~ ~
,#c~ CRAY T3E (ZAM) ! {,
~ r'~-~~ ~a'~t anatc+micnl i . . . . ]data
RT-Client (2D-GUI) . . . . I I I
. . . . . . . . . . . . . . . . . . . . . . , -
. . : . : . s - : - - :'- : ' - : " : : - " - . . . . . . . .
I
III
SGI Onyx 2 (GMD)
Responsive Workbench (ZAM)
Fig. 3. Setup of the fMRI experiment. The raw scanner data are transferred through
a front-end workstation to the T3E where they are processed. From there, anatomi-
cal and functional brain-images are transferred to either a workstation with a simple
2-D display or over the testbed to an Onyx 2 in the GMD. The rendered images
are sent back over the testbed to a Responsive Workbench in Jiilich.
127
5 . 4 D i s t r i b u t e d c l i m a t e a n d w e a t h e r m o d e l s
A second 'coupled fields' application in the gigabit testbed deals with the
distributed calculation of climate and weather models. Here, the Alfred-
Wegener-Institute (AWl), the German Climate Computing Center (DKRZ)
and the GMD will use the supercomputers in Jfilich and Sankt Augustin for a
coupled simulation of atmospheric processes and the ocean-ice system. There
are two main differences to the ground water szenario. One is t hat here the
fields interact only at a 2-D interface, the ocean surface, whereas water and
solutants interact in the full 3-D simulation domain. This further reduces the
amount of data to be exchanged. Nevertheless, shorter simulation times for a
single timestep and higher model resolution lead to similar total bandwidth
requirements.
The second difference is that in the ground water case dat a flows in one
direction only - - there is no feedback from the solutants to the ground water
flow. In contrast to that, both the ocean and the atmosphere models need the
fields from the other model as boundary conditions. Because of that, peak
bandwidth and latency of the network is much more critical here than in the
gound water problem.
5.5 Di s t r i but e d f l ui d- s t r uct ur e i n t e r a c t i o n
A more general approach to 'coupled fields' type problems is pursued in the
EC funded project CISPAR. The idea there is to use well-established com-
mercial computational fluid dynamics (STAR-CD) and structural mechanics
codes (PAM-SOLID, PERMAS) for problems that involve the interaction of a
fluid with flexible structures. Examples for such problems are artificial heart
valves, torque converters or ships. A standard interface for those codes as well
as a coupling library (COCOLIB) have been developed by the GMD and in-
dustrial project partners. Within the Gigabit Testbed West, the COCOLIB
will be ported to the metacomputer after the end of the CISPAR project in
1999.
5.6 New net wor ks and appl i cat i ons
The testbed is currently extended by connecting new sites to the original link
between Jiilich and Sankt Augustin and by defining new applications that
use those extensions. A dark fibre that links the national German Aerospace
Research Center (DLR) and the University of Cologne to the GMD has just
been set up. This line will be used for projects that range from distributed
traffic simulation and visualization to distributed virtual TV-production (in
cooperation between GMD, DLR, Academy of Media Arts in Cologne, and
echtzeit GmbH). The latter relies on the results of a multimedia project that
evaluates components for studio quality digital video transmission over ATM
in the testbed. A new 622 Mbit/s ATM-link between the University of Bonn
128
and the GMD will be the basis for metacomputing projects that deal with
multiscale molecular dynamics and lithospheric fluids. Here the PARNASS-
cluster [14] of the Institute for Applied Mathematics of the University of
Bonn is connected to the IBM SP2 and the Cray T3E.
6 Concl us i on
This contribution gave an overview over the metacomputing activities in the
Gigabit Testbed West. The underlying 2.4 Gbi t / s SDH and ATM technology
for the wide area backbone seems to be mature, a neccessary condition for
the upgrade of the German scientific network that is planned for the year
2000. In contrast to that, the networking capabilities of the supercomputers
that are attached to the testbed have to be improved. The concept of a
Hi PPI/ ATM gateway seems to be promising. A couple of applications t hat
deal with various aspects of metacomputing are using the infrastructure of the
testbed. Their results should enhance our understanding about the conditions
under which distributed high-performance computing is feasible.
7 Ac knowl e dge me nt s
Most of the activities that are reported in this contribution are not the work
of the authors but of several persons in the institutions that participate in
the Gigabit Testbed West project. The authors wish to thank D. Conrads,
W. Frings, D. Gembris, T. Graf, R. Niederberger, S. Posse, M. Sczimarowski,
and H. Vereecken from the Research Centre Jiilich, U. Eisenbl/itter, H. Grund,
W. Joppich, G. GSbbels, M. GSbel, M. Kaul, E. Pless, R. VSlpel, K. Wolf,
P. Wunderling, and L. Zier at the GMD, W. Hiller and T. StSrtkuhl at the
AWI, V. G/ilzow at the DKRZ and J. Henrichs and K. Solchenbach at Pal-
las GmbH, to mention but a few. We also wish to thank the BMBF for
partially funding the Gigabit Testbed West and the DFN for its support.
Special thanks to the University of Stuttgart for the PACX/MPI-library.
Ref erences
1. Erwin, D., The UNICORE Architecture and Project Plan, Workshop on Seam-
less Computing, ECMWF, Reading, September 16-17, 1997.
2. Sander, V., High Performance Computer Management, Workshop Hypercom-
puting, Rostock, September 8-11, 1997.
3. Eickermann, Th., Henrichs, J., Resch, M., Stoy, R., and VSlpel, R., Metacom-
puting in gigabit environments: Networks, tools, and applications, Parallel Com-
puting 24, p. 1847-1872, 1998.
4. Message Passing Interface Forum, MPh A Message-Passing Interface Standard,
University of Tennessee, http://www.mcs.anl.gov/mpi/index.html, 1995.
129
5. Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing
Interface, University of Tennessee, ht t p: / / www. mcs. anl . gov/ mpi / i ndex. ht ml ,
1997.
6. Burns, G.D., Daoud, R.B., Vaigl, J.R., LAM: An Open Cluster Envi ronment for
MPI, Supercomputing Symposium '94, Toronto, Canada, June 1994.
7. Beisel, T. , Gabriel, E., Resch, M., An Extension to MPI for Di st ri but ed Com-
puting on MPPs, in Marian Bubak, Jack Dongarra, Jerzy Wasniewski, Eds.,
Recent Advances in Parallel Virtual Machine and Message Passing Interface, p.
75-83, Springer-Verlag Berlin Heidelberg, 1997.
8. Nagel, W.E., Arnold, A., Weber, M., Hoppe, H.C., Solchenbach, K., VAMPIR:
Visualization and analysis of MPI resources, Supercomp. 63, Vol. XII, no. 1, p.
69-80, 1996.
9. Wimmershoff, R., Entwicklung und Implementierung einer dreidimensionalen
Partitionierungsstrategie f'fir das Pr ogr amm TRACE auf einem massiv paral-
lelen Rechner. Technical Report Forschungszentrum Jiilich, Jiil-3157, 1995, in
German.
10. Mosher, J. C. , Lewis, P.S., and Leahy, R.M., Multiple Dipole Modeling and
Localization from Spat i o-Temporal MEG DATA. IEEE Trans. Biomed. Eng.
39, p. 541-557, 1992.
11. Beucker, R. and Schlitt, H.A., Objective Signal Subspace Det ermi nat i on for
MEG, Forschungszentrum Jfilich, ZAM, FZJ - ZAM- I B 9715, 1997.
12. Ogawa, S., Lee, T.M., Kay, A.R., Tank, D.W., Brain magnetic resonance imag-
ing with contrast depending on blood oxygenation. Proc. Natl. Acad. Sci. USA
87, p. 9868-9872, 1990.
13. Goddard, N.H., Hood, G., Cohen, J.D., Eddy, W.F., Genovese, C.R., Noll, D.C.,
and Nyst rom, L.E., Online Analysis of Functional MRI Dat aset s on Parallel
Platforms. Journal of Supercomputing, 11, p. 295-318, 1997.
14. Griebel, M., Zumbusch, G., Parnass: Porting gigabit-LAN component s to a
workstation cluster, in W. Rehm ed., Proceedings of the 1st Workshop Cluster-
Computing, held November 6-7, 1997, in Chemnitz, Chemni t zer Informat i k
Berichte, CSR-97-05 , p. 101-124, 1997.
Hi g h Pe r f o r ma n c e Me t a c o mp u t i n g i n a
Tr a ns a t l a nt i c Wi d e Ar e a Ap p l i c a t i o n Te s t b e d
Edgar Gabriel, Michael Resch, Paul Christ, Alfred Geiger, and Ulrich Lang 1
High Performance Computing Center Stuttgart
Allmandring 30,
D-70550 Stuttgart,
Germany
(gabriel, resch}(~hlrs.de
Abst r act . During the last couple of years, a wide variety of tools and libraries
have been developed to enable distributed computing and visualisation. This paper
presents the technical background and the results of such a project meaned to
couple different computational resources. A metacomputing implementation of MPI
called PACX-MPI was used to make the applications run on such a cluster. Three
applications were used for demonstration purposes. These applications had to be
adapted for Metacomputing, to make them more latency tolerant.
1 I n t r o d u c t i o n
In 1997 the HLRS was involved in two t ransat l ant i c projects in the frame
of the G7 Global Informat i on Society Initiative "Global Int eroperabi l i t y of
Broadband Networks" (GIBN). One was from PSC and HLRS and was focus-
ing on the application aspect of met acomput i ng. The other one was from SNL
and HLRS and was concentrating on distributed visualization in a virtual
laboratory. During the first project phase it became clear t hat the proj ect s
should be merged into a Global Wide Area Application Test-bed (G-WAAT).
This would allow to couple simulation and visualization in a met acomput i ng
scenario.
The main targets of the merger were:
9 To set up a product i on test-bed for met acomput i ng applications and dis-
t ri but ed visualization
9 To combine supercomput i ng forces in order to solve much larger problems
t han any of the partners could solve on his own resources.
9 To integrate software component s in order to establish a collaborative
simulation and visualization environment
In a first step this meant to set up a network connection fast enough to
allow distributed simulation and visualization. Second, it was necessary to
find a communi cat i on software t hat enables met acomput i ng for one single
application. Thi rd applications had to be adapt ed to be able to fully exploit
the provided met acomput er. Fourth, distributed visualization software had to
132
be adapted and extended. In response to these needs a transatlantic network
connection was set up. The communication issue was resolved by implement-
ing a completely new communication library based on the MPI standard.
An existing Collaborative Visualisation software was extended and improved
[11].
The concept of the paper is as follows. The technical details of the test-
bed are described in section 2. Section 3 presents a library, which enables
message-passing even between different Massively Parallel Processing Sys-
tems (MPP' s) or Parallel Vector Processors (PVP' s). The results achieved
during the Supercomputing '97 in San Jose and Supercomputing '98 in Or-
lando using several applications are presented in section 4. A brief overview
about the future work in this field is described in section 6.
2 A T r a n s a t l a n t i c N e t w o r k C o n n e c t i o n
For a sufficient network throughput for metacomputing applications and col-
laborative working the most relevant network Quality of Service (QoS) re-
quirements are small and constant delays and nearly no packet losses.
Measurements taken by HLRS on the standard internet connection which
is provided and shared by the German DFN community, including a transat-
lantic link of 2*45 Mbps shared bandwidth showed, that the available QoS
between HLRS and the US had a strong variance. During the working hours
in Europe and the eastern part of USA the packet losses varied between 10%
and 40% resulting in varying TCP throughputs, which were not sufficient for
effective metacomputing and cooperative work.
Therefore a dedicated transatlantic test-bed was established connecting
the two CRAY T3Es at HLRS and PSC based on a dedicated 2 Mbps ATM
channel. For the Supercomputing '97 event, this network was extended to
Sandia National Laboratories, Albuquerque New Mexico, and to San Jose. For
the Supercomputing '98 event, the dedicated transatlantic ATM was rebuilt
again based on a dedicated 10 Mbps ATM channel. Figure 1 shows the geo-
graphic extension and the participating network providers of the transatlantic
metacomputing environment during SC' 98 (see also ht t p: / / www. hl rs. de/ -
news/ event s/ t 998/ se98/ ).
2. 1 Ne t wo r k Pe r f o r ma nc e Me a s u r e me n t s
W i t h r e s p e c t t o l a t e n c y , c o m p a r i n g t h e c r o s s a t l a n t i c s t a n d a r d p a t h p r o -
v i d e d b y D F N a n d t h e d e d i c a t e d A T M - L i n k , i t w a s i n t e r e s t i n g t o n o t e t h e
e f f e c t d u e t o t h e n u m b e r o f r o u t e r s i n v o l v e d a n d t h e t r a n s l a t i o n o f p a c k e t
l o s s e s i n t o a d d i t i o n a l d e l a y s . T h e r e s u l t s a c h i e v e d o n t h e n e t w o r k c o n n e c -
t i o n b e t w e e n a t e s t w o r k s t a t i o n a t H L R S a n d t h e C R A Y T 3 E i n P i t t s b u r g h
o v e r t h e s t a n d a r d p a t h a n d t h e d e d i c a t e d l i n k a r e d e p i c t e d i n t h e f o l l o w -
i n g T a b l e . A s a l r e a d y m e n t i o n e d t h e n e t w o r k p e r f o r m a n c e o f t h e s t a n d a r d
St ar TAP
Chicago N
SC'98 Orlando
boothes with HLRS
- HLRS booth
- European booth
- iGRID booth
- NEC booth
TrE~satlantic ATM PVC
.... ATM OC3 lines
IP-Tunnel
Pittsburgh Manchester St~tgart HPC & Vis.
wnwm-
Cray T3E Cray T3E Cray T3E Onyx2
Juelich, FZ
BMBF project, Unicore
Toulouse, Cer facs
(planned)
Octane
133
Fi g. 1. Network for t ransat l ant i c met acomput i ng demonst rat i ons duri ng SC' 98
Connection Bandwi dt hl
[Mb/s]
DFN 2*45
ATM-Li nk 2
no. of t cp- t hr oughput
rout ers [kB/s] 3
day night
15 5O 3OO
4
5 200 -
packet losses
[%]
day night
3O 3
0 0
del ay
[ms]
day ni ght
1801 160
1502 4
Ta bl e 1. TCP t hr oughput and network QoS between HLRS and PSC on s t andar d
Int ernet and dedi cat ed ATM link, Summer 1997. 1average value (vari at i on
between 160 and 300 ms). 2variation between 150 and 155 ms. 3the socket buffer
used was 64 kB. 4no t est s done.
pa t h is s t r ongl y i nf l uenced by t he Eur ope a n and US wor ki ng hour s. Dur i ng a
smal l t i me- wi ndow in t he Eur ope a n ear l y mor ni ng hour s, t he packet l oss and
packet r ound- t r i p t i me were accept abl e and a TCP t hr oughput of appr ox. 300
kByt e / s was achi evabl e. However , dur i ng t he dayt i me, t he I P packet l osses
( meas ur ed wi t h packet si ze of 1 kByt e) downgr a de d t he TCP t hr oughput t o
less t ha n 50 kByt e / s . The mean packet r ound- t r i p t i me on t he s t a nda r d pa t h
r anged f r om 160 t o 300 ms.
134
On t he dedi cat ed ATM- l i nk, t her e were pr act i cal l y no packet losses ( dur -
i ng SC' 97 a smal l numbe r of packet losses a ppe a r e d dur i ng t he change over
f r om CANARI ES ATM- net wor k t o CA*Ne t I I ) wi t h a near l y c ons t a nt r ound-
t r i p t i me of 150 ms. Thi s good l i nk pe r f or ma nc e r es ul t ed in a c ons t a nt T CP
t h r o u g h p u t of 200 kByt e / s , whi ch is t he ma x i mu m t h r o u g h p u t avai l abl e on
an ATM- l i nk wi t h 2 Mb i t / s bandwi dt h.
The hi gher numbe r of r out er s on t he s t a nda r d p a t h i nt r oduced a r el at i vel y
smal l l at ency, so in t he case of a smal l l oad as seen dur i ng Eu r o p e a n ni ght -
t i me hour s t he r ound- t r i p t i me on t he s t a nda r d p a t h is c ompa r a bl e t o t h a t
of t he di r ect ATM link.
Fi gur e 2 shows a compar i s on of t he net wor k del ay and packet l osses dur i ng
a 24 hour per i od over t he s t a nda r d p a t h and t he dedi cat ed ATM- l i nk. The
d a t a on t he dedi cat ed ATM- l i nk was c a pt ur e d dur i ng SC' 97, t he d a t a on t he
s t a nda r d p a t h s ome t i me af t er SC' 97.
100 21' 11.97: packet l oss [%1 ys. t i me [hi,
80
60
o 4 ~
o +
00:00 06:00 12:00 18:00 24:00
300
250
200
15o
lOO
80
6o
40
20
o
0 0 : o 0
18.11.97: packet l oss [%] vs. ti me [hi
I
/
J.k_.b~ltLI]1.1 . . . . . . . . . . . . l hi l,t . . . . . . Jilt,. t,_,,l, b L ,
06:00 12:00 18:00 24:00
21. 11. 97: me a n r t t [ ms ] vs. t i me[ h]
100
50
0
18. ! 1.97: me a n r t t [ ms ] vs. t i me[ h]
300
250
200
t5o -.-1
1o0
50
o
00:00 0 0 : 0 0 06:00 12:00 18:00 24:00 06:00 12:00 18:00 24:00
a) Standard-Path (DE~') b) dedicated ATM-Link
Fi g. 2. Round t ri p t i me and packet loss on the st andard pat h and the direct ATM
link during a 24 hour period.
As is well known, t he TCP pe r f or ma nc e on l i nks wi t h l ar ge ba ndwi dt h
t i mes del ay pr oduct s is st r ongl y de pe nda nt upon t he T CP Wi ndow size,
whi ch is conf i gur ed on t he end- s ys t ems t hr ough t he T CP socket buf f er sizes.
On t he 2 Mb i t / s ATM- l i nk a socket buffer size of 64 kByt e was suffi ci ent for
ma x i mu m T CP t hr oughput .
3 I nt e r ope r abl e MPI
135
For met acomput i ng the question of communi cat i on is a crucial one. The li-
brary should be able t o fully exploit the fast net work of each single machi ne
in the met acomput i ng scenario. At the same t i me it should be able t o sup-
port the full communi cat i on funct i onal i t y between different machi nes t hat an
application requires. PACX-MPI (PArallel Comput er eXt ensi on MPI) was de-
signed t o enable message passing inside and over the bounderi es of an MPP,
too.
To realize this goal PACX- MPI has to distinguish messages which re-
main inside a machine, in this cont ext called internal communi cat i on, and
messages which have t o be t ransferred to anot her MPP. The l at t er one will
be called ext ernal communi cat i on. For the i nt ernal communi cat i on, PACX-
MPI is using the vendor i mpl ement ed MPI-library, since this is nowadays the
only opt i mi zed and port abl e protocol, which is available on every syst em and
which can fully exploit the capabilities of t he underl yi ng network. For t he
ext ernal communi cat i on PACX- MPI should use a st andar d prot ocol , and t he
decision was t o implement as a first prot ocol TCP/ I P.
To avoid, t hat each application node has t o open a socket -connect i on
to anot her node on a different machi ne when communi cat i ng, two so called
daemon-nodes have been i nt roduced. These two nodes t ake care of out goi ng
respectively incoming messages and are therefore t r anspar ent for t he appli-
cation.
Since PACX-MPI has the goal to support the whole MPI 1.2 st andard,
problems like the configuration of a global communi cat or had t o be solved.
Figure 3 is explaining the global configuration of MPI _COMM_WORLD on
a met acomput er consisting of two machines.
. . . . . . ~ ~ ~ 5 ~
I
' I I
I
I
I
i
I - - - - - - - - 1
I I
MPI _COMM_WORLD ~ global numbe r
I I
local number
Fi g. 3. MPI_COMM_WORLD on a metacomputer consisting of two MPP' s
136
On the left machine, which shall be t he machi ne with the number one,
t he first two nodes with ranks 0 and 1 are not part of MPI_COMM_WORLD,
since these are the daemon nodes. The next node with the rank 2 is t herefore
t he first node in our global communi cat or and gets t he global rank num-
ber 0. All ot her application nodes get a global number according t o t hei r
local ranks minus two, the last node on this machine has the rank 3. On
t he next machine, the daemon nodes again are not considered in t he global
MPI_COMM_WORLD. The node with the local rank 3 is number 4 in t he
global communi cat or, since the numberi ng on this machi ne st art s with t he last
global rank on the previous machine plus. Int roduci ng this renumberi ng and
mappi ng of local pids to global ones, one gets a global MPI_COMM_WORLD
wi t hout loosing the local information.
3.1 Poi nt - t o- poi nt ope r a t i ons i n PACX- MPI
A poi nt -t o-poi nt operat i on in PACX-MPI can be briefly described as follows.
The sender has to check first, whet her t he receiving node is on t he same
machine or not. If it is on the same machine, it can di rect l y send t he message
to the receiving node using native MPI-commands. If it is on a different
machine, like in the example of Figure 4, it has to creat e a header first, which
contains all informations to identify a message, and t hen a dat a-package.
Bot h packages are sent t o a daemon node. The daemon node t ransfers bot h
packages t o the destination machine, where anot her daemon node receives
t he message and hands it over to the dest i nat i on node.
~ Command package
Dat a package
............... Ret um Value ( optional )
global number
local number
Fig. 4. Point-to-point operation from global node 2 to global node 7
137
The recei ver has al so t o check whet her t he c ommuni c a t i on is i nt er nal or
ext er nal . For an i nt er nal c ommuni c a t i on i t can di r ect l y execut e an MPI _Recv
c omma nd. The onl y addi t i onal wor k whi ch has t o be done in t hi s case is, t h a t
t he MPI _St at us has t o be adapt ed, since gl obal and l ocal numbe r s ar e not
i dent i cal (see Fi gur e 3).
I f t he c ommuni c a t i on is ext er nal , t he r ecei vi ng node checks fi rst , whe t he r
t he expect ed message is al r eady in t he buffer. I f not , it has t o recei ve t he
header and t he dat a- packet s f r om a da e mon node.
3. 2 Gl obal ope r at i ons
Roughl y s peaki ng gl obal oper at i ons in MP I can be spl i t in t wo gr oups. The
first gr oup of oper at i ons has a r oot - node, whi ch has t o di s t r i but e (e.g. br oad-
cast ) or t o recei ve (e.g. reduce) s ome gl obal dat a. The second gr oup has no
such r oot - node, all nodes have t he s ame s t a t us (e.g. bar r i er ) or d a t a (e.g.
al l reduce, al l -t o-al l ) af t er t he gl obal oper at i on.
The first gr oup of gl obal oper at i ons , whi ch have such a r oot - node, ar e
spl i t t ed in t wo par t s in PACX- MPI . One pa r t is t o di s t r i but e/ col l ect t he
da t a bet ween t he machi nes, and a second pa r t is a l ocal ope r a t i on i nsi de t he
machi ne. The sequence of t hese t wo pa r t s is dependi ng on t he t he oper at i on.
For a br oa dc a s t d a t a will be first di s t r i but ed t o all machi nes and af t er war ds
t he l ocal br oa dc a s t will be per f or med. For a r educe- oper at i on PACX- MPI has
t o pe r f or m first t he local oper at i on and onl y in t he second s t ep t he gl obal
col l ect i ng of da t a will be per f or med.
For t he second class of gl obal oper at i ons whi t hout a r oot - node, t her e ar e
several possi bi l i t i es. The mai n di fference bet ween t he al gor i t hms is whe t he r
we ar e execut i ng an al l -t o-al l exchange of da t a bet ween t he machi nes, or
whet her we ar e col l ect i ng t he gl obal r esul t on a dedi cat ed node and di s t r i but e
t he gl obal r esul t s in a second st ep.
Let ' s r egar d t hi s s i t uat i on usi ng MP I ~ l l r e d u c e . I n t he fi rst a l gor i t hm
each machi ne woul d execut e a l ocal MPI _Reduce, usi ng a l ocal r oot - node. I n
a second st ep, each machi ne woul d send i t s r esul t t o all ot her machi nes , cal -
cul at e l ocal l y t he gl obal r esul t and di s t r i but e t hi s t o all nodes on i t s machi ne.
I n t hi s case we will have
N. ( N - 1)
numbe r of messages whi ch have t o be exchanged bet ween all machi nes , wi t h
N bei ng t he numbe r of coupl ed machi nes. The a dva nt a ge of t hi s a l gor i t hm
is, t ha t all ext er nal communi cat i on st eps can be pe r f or me d t heor et i cal l y in
paral l el .
I n t he second al gor i t hm each machi ne is execut i ng a l ocal r educe agai n,
but in t he second st ep t hey all send t hei r l ocal r esul t t o a dedi cat ed node,
whi ch cal cul at es t he gl obal r esul t and di s t r i but es t ha n t hi s r esul t t o all ot her
machi nes. I n t hi s case we will have
2 . N
138
external communication steps, but only N communications can be performed
in parallel at the same time. Which of these two algorithms is performing
faster, is an issue of actual investigations and strongly dependent of the
network-configuration between the machines.
3. 3 Re l a t e d wo r k s
Their are several works related to this theme, each having a somewhat dif-
ferent approach. The well known Globus-project [7] tries to build up a whole
bunch of metacomputing services, including also distributed computing for
MPI-applications based on MPICH [6] and the NEXUS communication li-
brary. A disadvantage of this attempt is, t hat every external communcation
step is done by direct node-to-node connection. This can lead for really big
configurations to problems because of too many open ports/sockets. Addi-
tionally the underlying NEXUS-library has no support for global operations.
Therefore the execution time for a broadcast-operation for example is stronlgy
dependent of the distribution of nodes on the different MPP' s.
The MagPie-project [9] was setup to solve this problem. This project
implemented global-operations for clusters of machines for MPICH, but on
the other hand they still do not solve the problem of the direct point-to-point
operations.
PVMPI [4] makes MPI applications run on a cluster of machines by using
PVM for the communication between the different machines. Unfortunately
the user can use only point-to-point operations and has to add some non MPI
congruent calls. The subsequent project, MPI_Connect uses the same ideas
but replaced PVM by a library called SNIPE [5], and supports now global
operations too, in contrary to PVMPI.
A similar approach has been done by PLUS [1]. This library additionally
supports communication between different message-passing libraries, like e.g.
PARMACS, PVM and MPI. But again the user has to add some calls to his
application.
Another project called Stampi [8] has been recently presented. This project
already uses the MPI2 process model, but focuses mainly on local area com-
puting. On the other hand they distinguish between one/ t wo/ t hree hop com-
munication, and therefore a metacomputer need not perform direct node-
to-node communication but can use some kind of daemon for the external
communication.
4 Appl i c a t i ons a nd Re s ul t s
During the Supercomputing '97 in San Jose and Supercomputing '98 in Or-
lando a couple of demonstrations were done using PACX-MPI. In this section
we will briefly describe the applications used in the metacomputing environ-
ment and we will also present some results achieved.
139
4. 1 URANUS
The first appl i cat i on is a CFD- code cal l ed URANUS ( Upwi nd Rel axat i on
Al gor i t hm for Nonequi l i br i um Fl ows of t he Uni ver si t y of St ut t ga r t ) [2]. Thi s
pr ogr am has been devel oped for si mul at i ng t he r eent r y of a space vehi cl e in a
wide al t i t ude vel oci t y range. The r eason why URANUS was t est ed in such an
envi r onment is t ha t soon t wo addi t i onal component s of URANUS will have a
gr eat demand on memor y: t he nonequi l i br i um pa r t has been fi ni shed in t he
sequent i al code and will be paral l el i zed soon. Fur t he r mor e we will si mul at e
t he Crew-Rescue-Vehi cl e (X-38) of t he new i nt er nat i onal space- st at i on wi t h
mor e t ha n 3 Million cells. Bot h component s t oget her r equi r e me mor y in t he
r ange of hundr eds of Gi gabyt es, t ha t cannot be pr ovi ded by a single machi ne
t oday. Dur i ng t he SC98 we si mul at ed t he Eur ope a n space-vehi cl e HERMES
wi t h 1.7 Million cells usi ng 992 CPU' s on t wo Cr ay T3E' s .
The code is based on a r egul ar gri d decomposi t i on, whi ch l eads t o a ver y
good l oad bal anci ng and a si mpl e communi cat i on pat t er n.
In t he following we give t he overal l t i me i t t akes t o si mul at e a medi um size
pr obl em wi t h 880.000 gri d cells. For t he t est s we si mul at ed 10 I t er at i ons. We
compar ed a single machi ne wi t h 128 nodes and t wo machi nes wi t h 2 t i mes 64
nodes. Obvi ousl y t he unchanged code is much slower on t wo machi nes. How-
Method 128 nodes 2*64 nodes
using MPI using PACX-MPI
URANUS 102.4 156.7
unchanged
URANUS 91.2 150.5
modified
URANUS - 116.7
pipelined
Ta bl e 2. Comparison of timing results (sec) in met acomput i ng for URANUS
ever, t he over head of 50% is r el at i vel y smal l wi t h r espect t o t he slow net wor k.
Modi f i cat i on of t he pre-processi ng does not i mpr ove t he si t uat i on much. A l ot
mor e can be gai ned by fully asynchr onous message-passi ng. Using so cal l ed
' Message Pi pel i ni ng' [2] messages ar e onl y recei ved if available. The recei vi ng
node ma y cont i nue t he i t er at i on pr ocess wi t hout havi ng t he most r ecent da t a
in t ha t case. Thi s hel ped t o r educe t he comput i ng t i me significantly. The
i mpl i cat i on of t hi s me t hod is, t ha t for conver gence t he number of i t er at i ons
has t o be i ncr eased by about 10 per cent . Addi t i onal l y one has t o t ake care,
t ha t t he messages are not ol der t ha n t wo i t er at i ons, since t hi s ma y pr event
conver gence at all. Test s for one single machi ne were not r un because r esul t s
are no l onger compar abl e wi t h r espect t o numer i cal convergence.
140
4. 2 P 3 T - DS MC
The second appl i cat i on is P3T- DSMC. Thi s is an obj ect - or i ent ed Di r ect Si m-
ul at i on Mont e Carl o Code which was devel oped at t he I ns t i t ut e for Comput e r
Appl i cat i ons (ICA I) of St ut t ga r t Uni ver si t y for gener al par t i cl e t r acki ng
pr obl ems [10].
Since Mont e Carl o Met hods ar e well sui t ed for met acomput i ng, t hi s ap-
pl i cat i on gives a ver y good per f or mance on t he t r ans at l ant i c connect i on. For
Paxticles/CPU!60 nodes
using MPI
1935 0.05
3906 0.1
7812 0.2
15625 0.4
31250 0.81
125000 3.27
500000 13.04
Ta bl e 3. Compaxlson of timing results (sec)
2*30 nodes
using PACX-MPI
0.28
0.31
0.31
0.4
0.81
3.3
13.4
in metacomputin~ for P3T-DSMC
smal l number of part i cl es t he met acomput i ng shows some over head. But up
t o 125.000 part i cl es t i mi ngs for one t i me st ep ar e i dent i cal on one machi ne
and in t he met acomput i ng- t es t bed. Thi s excel l ent behavi our is due t o t wo
basi c f eat ur es of t he code. Fi rst , t he comput at i on t o communi cat i on r at i o is
becomi ng bet t er if mor e part i cl es ar e si mul at ed per process. Second, l at ency
can be hi dden mor e easily if t he number of par t i cl es i ncreases.
4. 3 P 3 T - MD
Th e t hi r d appl i cat i on is also based on t he P3T- t ool ki t , but i nst ead of a
Mont e- Car l o code t hi s pr ogr am solves t he mol ecul ar - dynami c equat i ons t o
si mul at e t he i nt er act i ons bet ween t he part i cl es. Ther ef or e t he code is s t r onger
coupl ed compar ed t o P3T- DSMC. Dur i ng t he SC98 event , a l ot of t est s have
been per f or med wi t h bot h P3 T appl i cat i ons usi ng up t o 1024 processors. Th e
resul t s coul d not yet be fully eval uat ed.
Of t en mol ecul ar - dynami cs si mul at i ons gener at e l arge sized out put s . It
makes no sense t o st or e a compl et e conf i gur at i on of such a si mul at i on in a
met acomput i ng envi r onment , since t hi s woul d r equi r e a l ot of addi t i onal t i me.
P3T- MD does for big si mul at i ons a di s t r i but ed post pr ocessi ng of t he dat a.
For exampl e t o cal cul at e t he f or ce- di st r i but i on bet ween t he part i cl es, each
node is doi ng its own st at i st i cal anal ysi s and onl y t he r esul t of t hi s anal ysi s
is st or ed i nst ead of t he raw dat a.
5 Conc l us i ons
141
The last two chapters have pointed out, t hat one has to invest a lot of work un-
til the applications performs on a cluster of MPP' s. An MPI-implementation
suitable for such a cluster of machines has completly different requirements
than an MPI-library working on a single machine. Optimizing point-to-point
operations by dealing with different protocols is required but still not enough.
The global operations have to be adapted to algorithms dealing with latencies
of different ranges.
Additionally apllications have to be adapted to become more latency tol-
erant and to use less bandwidth. A CFD-application like URANUS is much
more difficult to adapt for such an environment, since it is strongly coupled
and it is not simple to save communication whithout loosing numerical perfor-
mance. The key question in any case is, whether one succeeds in overlapping
communication and computation.
A Monte-Carlo Method like P3T-DSMC is communicating less then the
application above, and therefore fits automatically better for Metacomputing.
Problems may still arise of dealing with huge amounts of data, which have to
be transfered to a single machine for the final output. Thus some distributed
postprocessing operations are inalienable to transfer and save only really
important data.
These points may be resumed regarding the costs for Metacomputing.
Since costs for the networks are depending on the reserved bandwidth and the
time for which we are using the network, algorithms should be developed that
consider economical apsects as well. Nevertheless, there are some applications,
for which Metacomputing may be nowadays the only method to get some
results, and for which it is worth to do the whole work.
6 Out l ook
The future metacomputing activities of the High Performance Computing
Center Stuttgart will be focused in a project called METODIS (MEtaco-
muting TOols for Distributed Systems). This project is supported by the
European Community and has the major goal to create a set of tools for
Metacomputing. This will include a MetaMPI, based on PACX-MPI, a gen-
eral ATM interface, that will be used by PACX-MPI and a metacomputing
version of the performance analysis tool VAMPIR, which will be coupled to
PACX-MPI.
To achieve a full support for MPI1.2, PACX-MPI has to be extended to
support more functions. Up to now we've implemented mainly the MPI-calls
according to our applications needs. Additionaly PACX-MPI will be extended
to support not only TCP/ I P for the external commmunication, but also to
support other protocols, like ATM or HiPPI.
142
Ac knowl e dge me nt
The aut hor s woul d like t o t ha nk for t he hel pful s uppor t by net wor ki ng or-
gani zat i ons and gr oups, especi al l y Ge r ma n Tel ekom, Tel egl obe, CANARI E,
STAR TAP, vBNS, ESNet . I n addi t i on we woul d like t o t h a n k Pi t t s b u r g h Su-
pe r c omput i ng Cent er and t he Hi gh Pe r f or ma nc e Comput i ng Cent er St u t t g a r t
for pr ovi di ng t hei r machi nes for our t est s and de mons t r a t i ons .
Ref erences
1. Mat t hi as Brune, JSrn Gehring and Alexander Reinefeld (1997), Heterogeneous
Message Passing and a Link to Resource Management, Journal of Supercomput -
ing, Vol. 1, 1-17
2. Thomas BSnisch and Roland Rfihle (1998) Adapt i ng a CFD Code for Met acom-
puting, 10th International Conference on Parallel CFD, Hsi nchu/ Tai wan, May
11-14.
3. Th. Eickermann, J. Heinrichs, M. Resch, R. Stoy, R. VSlpel (1998) Met acom-
put i ng in gigabit environments: Networks, tools and appl i cat i ons, Parallel Com-
put i ng 24, 1847-1872.
4. Gr aham E. Fagg, Jack J. Dongarra and A1 Geist (1997) Heterogeneous MPI
Application Int eroperat i on and Process management under PVMPI , in Marian
Bubak, Jack Dongarra, Jerzy Wasniewski (Eds.) ' Recent Advances in Parallel
Virtual Machine and Message Passing Interface' , 91-98, Springer.
5. Gr aham E. Fagg, Keith Moore, Jack J. Dongarra, A1 Geist (1997) Scalable Net-
worked Informat i on Processing Envi ronment (SNIPE), Technical Paper, Super-
comput i ng 1997.
6. Ian Foster, Jonat han Geisler, William Gropp, Nicholas Karonis, Ewing Lusk,
Georg e Thiruvathukal, Steven Tuecke (1998) Wi de-Area I mpl ement at i on of t he
Message Passing Standard, Parallel Comput i ng 24.
7. Ian Foster, Carl Kesselman (1998) The Globus Project: A St at us Report , Proc.
I PPS/ SPDP '98 Heterogeneous Comput i ng Workshop, pg. 4-18, 1998.
8. Toshiya Ki mura, Hiroshi Takemi ya (1998) Local Area Met acomput i ng for Mul-
tidisciplinary Problems: A Case st udy for Fl ui d/ St r uct ur e Coupled Simulation,
12th ACM International Conference on Supercomput i ng, Melbourne, Jul y 13-17.
9. Thilo Kielmann, Rut ger F.H. Hofman, Henri E. Bal, Aske Pl aat , Raoul A.F.
Bhoedjang (1998), MagPie: MPI ' s Collective Communi cat i on Operat i ons for
Clustered Wide Area Systems, to appear at PPoPP' 99, online version available
at ht t p: / / www. cs. vu. nl / al bat ross
10. Mat t hi as Miiller and Hans J. Her r mann (1998) DSMC - a stochastic al gori t hm
for granular mat t er, in Hans J. Her r mann and J.-P. Hovi and Stefan Luding
(Eds.) ' Physics of dry granular media' , Kluwer Academic Publisher.
11. Andreas Wierse (1995) Performance of the COVISE visualization syst em under
different conditions, in Visual Dat a Exploration and Analysis II, in Georges G.
Grinstein, Robert F. Erbacher eds., Proc. SPIE 2410, pages 218-229, San Jose.
MI L E S S - A L e a r n i n g a n d T e a c h i n g S e r v e r f o r
Mu l t i - Me d i a D o c u me n t s
Hol ger Gol l an, Fr ank Lf i t zenki r chen, and Di et er Nast ol l
Comput er Center, Essen University, Schfitzenbahn 70, 45117 Essen, Ger many
Ab s t r a c t . MILESS [7] is a joint project between t he Comput er Cent er and t he
Central Li brary of Essen University, together with t he two pilot depar t ment s of
linguistics and physics. The mai n purpose is to provide st udent s and faculty of Essen
University with a library server t hat support s several different functions t hat are
needed within a digital library. Based on the IBM DB2 Digital Li brary product [4],
MILESS can store and retrieve digital document s in any given format ; moreover,
searching is possible in a very el aborat e way, and access control is suppor t ed as
well.
In this article, we will first discuss why there is a growing need for digital library
servers, followed by a description on how MILESS is build on t op of t he IBM DB2
Digital Library. We will describe the software techniques t hat are used to build t he
system, and we give a test case for the use of MILESS when referencing different
articles within a mat hemat i cal journal.
1 T h e Ne e d f o r a Di g i t a l L i b r a r y
Wi t h t he evol vi ng web t echnol ogi es, t he a mount of di gi t al d a t a t h a t is ac-
cessi bl e vi a t he i nt er net is enl ar gi ng in a dr a ma t i c way. Usual l y, t hese d a t a
will a ppe a r on cer t ai n websi t es, ei t her per sonal or i nst i t ut i onal . I n addi t i on,
t her e mi ght be commer ci al pl aces on t he web t ha t hol d l ot s of i nf or mat i on
in di fferent di gi t al f or mat s. But t hi s huge set of i nf or mat i on l eads t o sever al
pr obl ems .
9 I t is s omet i mes har d t o find.
9 I t has no s ys t emat i c order.
9 I t mi ght vani sh wi t hout f ur t her not i ce.
On t he ot her hand, cl assi cal l i br ar i es have t o fi nd new ways t o enabl e t hei r
cus t omer s t o wor k wi t h t hi s new mat er i al in addi t i on t o t he wel l known books
and j our nal s on t he shelfs. Mor eover , uni ver si t y s t udent s a nd f acul t y want t o
use di gi t al and esp. mul t i medi a mat er i al in l ear ni ng, t eachi ng and r esear ch.
To face t hese pr obl ems and t o meet t hese needs, a di gi t al ver si on of t he
cl assi cal l i br ar y servi ces is needed. I t shoul d s uppor t t he use of such ma t e r i a l
by pr ovi di ng a rel i abl e, pe r ma ne nt , and s ys t emat i cal l y or der ed access t o it.
144
2 MI LESS a nd t he I BM DB2 Di gi t a l Li br a r y
In late 1997, the Computer Center and the Central Library of Essen Uni-
versity started the MILESS project, which was funded by the local state
ministry and the university. The idea was to install a digital library server
that could solve the problems mentioned in the previous section. While the
Computer Center brought in its knowhow in information technology and soft-
ware development, the Central Library started to redefine the classical library
techniques and services for the new types of digital and multimedia objects.
In addition, two pilot departments (linguistics and physics) started to fill the
digital library server with appropriate material.
To store and archive the digital documents, MILESS uses the IBM DB2
Digital Library product [4]. Its main parts consist of one library server and
several object servers, where the object servers are responsible for the actual
storage of the documents, but access is only possible via the library server
that controls and manages the documents that are put into the digital library.
Using this control mechanism, it is impossible to delete any documents in the
object servers without notification of the library server, hence there can be no
dead links within the system. The library server itself is running on top of a
DB2 database, and the object servers can be connected to an ADSTAR Dis-
tributed Storage Manager (ADSM) [1] that handles the storage and archiving
problems. E.g., documents can be archived when they haven' t been used for
a longer time.
The IBM DB2 Digital Library product offers a lot of features including
several services that are useful for a digital library server. It handles storage
and management of the documents via the object servers, and it enables ac-
cess control via a rights management. Moreover, it has sophisticated search
techniques, e.g. text mining and Query By Image Content (QBIC) to.en-
able the user to find what he (she) is looking for within the stored digital
documents.
While the IBM DB2 Digital Library product does an excellent job when
it comes to the storage and retrieving problem, it is of no great help for
the implementation of the MILESS data model. Since we wanted to have
great flexibility in this respect, we adopted the Dublin Core [2] standard
for the description of electronic resources, adding some additional features
like contact information for the creators and contributors of the documents.
Thus a document within MILESS can e.g. have several titles, several creators
and contributors, several types and formats, etc. In particular, documents in
MILESS can have several derivates in different formats; no standard format
is required. Moreover, MILESS can handle hierarchical classifications t hat
are widely used in science to help capturing the subjects of a document
in a standardized way. This very complex and yet flexible dat a model for
the metadata of electronic documents enables the librarians in the project
to extend their classical library services to the digital material within the
MILESS system.
145
Since the I BM DB2 Digital Li brary product can not handle such complex
data models, additional software had to be wri tten to enable MILESS to work
wi t h t he Dubl i n Cor e s t a nda r d. We wi l l t a ke a cl oser l ook on t he ne w sof t war e
i n t he ne xt sect i on. The f ol l owi ng f i gur e i l l us t r at es t he di f f er ent pa r t s of t he
I BM DB2 Di gi t al Li br a r y pr oduc t .
ub i c. o, N.
VldeoCharger ~..
Server:
Streaming of
Audio/Video Data
( ~EG, ...)
Web-Server
/
Java Servlet
Engine:
MTLESS
Server Components
HTML
XlVIL
Web- Br ow~
Java VM: Appleta
MILESS
GUI for Authors
Library Sewer:
Metadata
L
Object Server. ' ~
Files [
(PS, PDF, ...), [
;ntral / distn~outed [
I
(Title, Author, ...) ~ I central / dis~n'buted
| ' |
. . . . . .
i t Text Sem'ch Server:
Fulltext Queries ~ ~e i
(Textindices) I l l ar,9 I
IBM Digital Library : r ~' 1
II . . . . . . . . . . . . " I
. . . . . . ..... BackupSy~:::: ::: ,,::, ADSM Server: Archivia$ and IBM DB2 / Oracle
3 MI LESS - A Cl os e r Look
Besides the need for special software because of the complex data model, the
additional MILESS software is divided into different layers, dependent on
their functionality. To be platform independent, a design decision was made
to use JAVA as the programming language, made possible by a JAVA API for
the IBM DB2 Digital Library that can be used to connect JAVA code with
the Digital Library product. Thus the bot t om layer of the system is given by
this programming interface that connects the Digital Library product with
the outside world. This API is used by a so-called Persistency Layer that is
responsible for the storing and retrieving of documents.
On top of that there is a collection of JAVA classes that implement the
funtionalities for documents, legal entities (creators and contributors), etc.
Another part of the inner system is using JAVA servlets. MILESS is run-
ning inside a web server that is capable of using servlets, and these servlets
are used for the connection and communication between the user and the
system, e.g.
146
9 A Document Ser vl et is used t o present t he met adat a of a document on a
page within t he web server.
9 The Deri vat eServl et is needed t o access a cert ai n f or mat of a specified
document .
9 The SearchServl et takes t he user queries, connect s t o t he Di gi t al Li br ar y
product t o do t he search, and present s t he results on a page wi t hi n t he
web server.
The following figure illustrates t he different software layers of t he MI LESS
syst em.
Load and edit metsdata and contmt :/::::n HTML Pages: i I t
i : : : : : : : ' : : : : : :::::: *aD* / Sear ch~navi gat c I 0
: i : r - - MI LESS i / show ,sando . , ! *
.:: . . . . : : Communicator i:::: :~
J Di l l Model Pack
I (Java Class Library for I
"1 Dublin Core Data Model) "~" - -
MILESS Persistency Layer (Java Class Library): R
Create, retrieve update, delete and search MILESS data items V
Besides t he inner part s of t he syst em t hat run on t he server side, t here are
ot her part s t hat run on t he side of t he user client. Basically, t he onl y t hi ng
t he user needs is a web browser t hat connect s hi m t o t he MI LESS homepage
at ht t p: / / mi l e s s , uni - e s s e n, de. From here, t he user has access t o t he full
funct i onal i t y of t he system; e.g. t he search facilities can be reached j ust vi a
normal HTML- pages. In addition, t here is t he possibility for an aut hor to
creat e or change document s inside t he MILESS syst em. To do this, he (she)
can call a graphi cal user interface (GUI) t hat runs as a JAVA appl et inside t he
web browser. Thi s GUI can be used t o creat e new document s for t he syst em,
or t o change al ready existing document s or personal dat a of creat ors and
cont ri but ors. Moreover it helps navi gat i ng t hr ough hi erarchi cal classifications
t o find t he correct subj ect s for t he document s. The communi cat i on bet ween
this GUI on t he client side and t he inner par t of t he syst em on t he server
side is handl ed vi a a Servlet Communi cat or , t he dat a exchange is done vi a
XML [3]. In t he near fut ure cont ri but ors can use XML directly t o put mat er i al
147
i nt o t he syst em. Thus t he appl i cat i on on t he cl i ent side does n' t have t o know
anyt hi ng about t he i nt er nal r epr esent at i ons on t he ser ver side; especi al l y t he
I BM DB2 Di gi t al Li br ar y pr oduct and its i nt er nal s t r uct ur e is t ot al l y i nvi si bl e
t o t he out si de world.
4 MI LES S - F r o m t h e u s e r p o i n t o f v i e w
Ther e are a l ot of possible scenari os for a user of t he MI LESS syst em. Most
peopl e will use it j ust t he way t hey use a classical l i brary, but wi t h enhanced
feat ures. One of t hese is t he search facility. By maki ng use of t he sear ch
t echni ques of t he under l yi ng I BM DB2 Di gi t al Li br ar y pr oduct , t he user
can not onl y search for cer t ai n aut hor s or words and phr ases wi t hi n t he
t i t l e or t he keywords, but it is also possi bl e t o sear ch for words and phr ases
wi t hi n t he t ext s of t he document s. Moreover, because of t he way hi er ar chi cal
classifications ar e i mpl ement ed inside MI LESS, he (she) can navi gat e t hr ough
t he hi er ar chy of such classifications, l ooki ng e.g. at all document s at a specific
level.
Because of t he capabi l i t y of t he syst em t o handl e document s in any gi ven
f or mat , t he user mi ght have a pr obl em wi t h t he act ual f or mat of a r et r i eved
document . To over come this difficulty, MI LESS has a pl ug- i n col l ect i on t ha t
coul d hel p t he user and his br owser t o under s t and a st r ange f or mat . Mor eover ,
we are col l ect i ng di fferent conver t er s t ha t coul d be used t o cr eat e new f or mat s
f r om exi st i ng ones when put t i ng new mat er i al i nt o t he syst em.
Cr eat i ng new mat er i al is anot her scenari o for t he use of t he MI LESS sys-
t em. Wi t h t he hel p of t he user GUI, a nybody can cr eat e new document s in
t he syst em by pr ovi di ng t he necessar y me t a da t a wi t hi n t he GUI and upl oad-
ing t he da t a files of t he document i nt o t he syst em. Thi s can be used e.g. by
l ect ur er s t o put t hei r l ect ures and exerci ses i nt o t he syst em, enabl i ng s t udent s
t o work wi t h t hi s mat er i al onl i ne whenever t hey want .
Anot her scenari o sees a l ect ur er pr epar i ng his next t al k and sear chi ng t he
syst em for cer t ai n mul t i medi a mat er i al he can use in t he class. Such mat er i al
can be JAVA appl et s or s i mul at i ons / ani mat i ons , audi o- / vi deo mat er i al , et c.
To access vi deo mat er i al , a vi deo server [5] is i ncl uded in t he s ys t em t ha t uses
st r eami ng t echni ques t o del i ver mul t i medi a mat er i al in real t i me t o mul t i pl e
users.
Yet anot her use of MI LESS is t he linking bet ween di fferent art i cl es of a
mat hemat i cal j our nal ; a first t est case for t hi s will be pr es ent ed in t he ne xt
section.
5 MI L E S S a n d t h e " Ar c h i v d e r Ma t h e ma t i k " - A T e s t
Ca s e
Thi s final sect i on present s a col l abor at i on wi t h t he I ns t i t ut e for Expe r i me nt a l
Mat hemat i cs a l. Essen Uni versi t y, wher e six old vol umes of t he ma t h e ma t -
148
ical journal "Archiv der Mathematik" will be retrodigitized to make them
available on the web in digital form (see [6]).
One small piece in this project is the automatic extraction of the biblio-
graphic data like title, authors, author address, journal name, volume num-
ber, etc. These data, stored in an XML-format [3], can be used to fill the
metadata fields of the Dublin Core standard automatically and to put the
retrodigitized articles into the MILESS system, with restricted and controlled
access because of copyright issues.
Another piece in this project is the automatic recognition of the cited ref-
erences at the end of any article. Using Optical Character Recognition (OCR)
and heuristics, an HTML-page is produced t hat contains the references and
tries to link, where possible, to an online copy of the cited article. To do
this, some standardization is needed to produce a correct link. To install a
prototype for such a referencing functionality, we have put two articles of the
"Archiv der Mathematik" into the MILESS system, namely
9 William Crawley-Boevey, Tameness of biserial algebras, Arch. Math. 65,
399-407.
9 Christof Geiss, On degenerations of tame and wild algebras, Arch. Math.
64, 11-16.
where the second one is a cited reference in the first article. After the auto-
matic recognition of the references of the first article, the link to the second
article will be produced automatically as
h t t p ://miless. u n i - e s sen. d e / i e m / A r chiv_der_Mat h e m a t i k / 6 4 / 1 1
where the volume number and the first page are used to uniquely identify the
cited article. Upon this request the MILESS system starts an internal search
to retrieve the referenced article. With this feature the reader can view the
first article, realizing the citation, and following the link by just clicking on
it. This can easily be extended to referenced articles being published in other
journals, once these journals are available online and a unique way to reach
their articles is realizable just from the bibliographic dat a as in the example
above. An extension of the prototype in this direction is planned for the
near future. Such an extension can lead to a distributed library for scientific
journals, not necessarily restricted to mathematics, adding new features and
functionalities for the user, but creating some demands on the underlying
networks as well.
6 Ac knowl e dge me nt
The MILESS project is financially supported by the local state government
of Northrhine-Westfalia, Germany, and Essen University. Many people have
been involved in the design and implementation of the system, including,
but not restricted to D. Azkan, A. Bilo, E. Coelfen, B. Lix, V. Nordmeier,
B. Schlesiona, A. Sprick.
R e f e r e n c e s
149
1. ADSTAR Distributed Storage Management,
ht t p: / / www. st orage. i bm. com/ soft ware/ adsm/
2. The Dublin Core Standard, ht t p: / / pur l . ocl c. or g/ dc/
3. Extensible Markup Language, ht t p: / / www. w3. or g/ TR/ REC- xml
4. IBM DB2 Digital Library, ht t p: / / www. soft ware. i bm. com/ i s/ di g-l i b/
5. IBM DB2 Digital Library Video Charger,
ht t p: / / www. soft ware. i bm. com/ dat a/ vi deocharger/
6. G. O. Michler, "A Prot ot ype of a Combined Digital and Retrodigitized Search-
able Mathematical Journal", Preprint.
7 . MI L E S S - Multimedialer Lehr- und Lernserver Essen, ht t p: / / mi l ess. uni -essen. de
Rural Educational System Network (RESNET):
Design and Deployment
Salim Hariri and Wang Wei, Sung-Yong Park, Harvey Janelli
Center for Advanced TeleSysMatics (CAT)
University of Arizona
Tucson, AZ 85721
{hariri, wang} @ece.arizona.edu, www.ece.arizona.edu/-hpdc
Department of Computer Science
Sogang University
Seoul, Korea
sypark@ieee.org
Interactive Media Group, Inc.
14817 Sopras Circle
Addison, TX 75224
janelli_img @ worldnet.att.net
ABSTRACT: The main objective of this project is to design and deploy the initial
infrastructure of the Rural Education System Network (RESNET) in eastern Texas.
We have selected the Asynchronous Transfer Mode (ATM) and single-mode fiber to
build the RESNET infrastructure. The RESNET network operated initially at a
backbone speed of OC-3c (155 Mbit/s), with the goal of upgrading to OC-12c (622
Mbit/s). The RESNET backbone connected the following sites: Tyler County
Courthouse, Woodville ISD High School, Tyler County Hospital in the city of
Woodville, Alabama & Coushatta Indian Reservation, Big Sandy ISD, Livingston
High School, the Polk County Courthouse, the Polk County Hospital, main campus of
Sam Houston State University in Huntsville, TX. The intended applications for
RESNET are classified into three types: 1) Telecommunication Services, 2)
Interactive Multimedia Services, and 3) Mutlimedia Services. The telecommunication
services include: Switched data, voice, and video ATM service at 25 Mbps and 155
Mbps. In addition, it is the goal of the Tribes to establish a Call Center and Network
Control Center. The interactive multimedia services include: Virtual Classroom,
Virtual courtroom, Virtual County, Virtual Clinic, Teacher Network, Parent Network.
The multimedia services include: Video-On-Demand (MPEG I & II), Education-On-
Demand, Training-On-Demand, Multimedia Publishing, Electronic Publishing, and
Intra/Internet Broadcasting. In this paper, we give an overview of the RESNET
history, goals, design and technology adopted for RESNET, and conclude with future
RESNET activities.
1. Ov e r v i e w o f RES NET - Hi s t ori cal Pe r s pe c t i ve
The Rural Education System Network or RESNET was founded in 1992. It is an
educational and public service partnership of private industry, local government,
152
universities, hospitals, rural independent school districts and the Alabama-
Coushatta Tribes. It was incorporated in the State of Texas on November 4,
1993. In 1994 the Tribes entered into the RESNET partnership to seek fundi ng
for the Tribal Technology project. In 1996 RESNET was successful in securing
initial funding of $750,000 from the Houst on Endowment (a private foundation).
Money was provided for the construction of an 80-mile fiber optic net work with
an OC-3c (155Mbps) Asynchronous Transfer Mode (ATM) backbone. This
network backbone connects three independent school districts (ISD) of
Livingston, Big Sandy, and Woodville; two hospitals, the Indian Health Servi ce
(IHS) clinic and Tyl er County hospital with UTMB Galveston; Sam Houst on
State University (a regional university); and the Polk, and Tyl er count y
governments. In addition, the Department of Agriculture provided $340, 000 of
funding for the base line ATM electronics that connects these locations with the
Alabama-Coushatta Tribes.
It was the consortium' s intent that RESNET becomes a low cost communi t y
resource, operated and supported by tribal members. It provides a high-speed
interface to the Internet, and serves as an Intranet for the Tribes, hospitals, local
governments, and educational components.
RESNET has assembled a solid partnership of local business, international
technology providers, and world class academics. All of the partnerships have
been consummated with the use of a formal teaming agreement. This binding
agreement clearly defines the commitment of all the parties. The majority of the
partners have been involved for over three years and continue to get more and
more involved. Current business and technology partners include: Lucent
Technologies (4 years); Entergy Corp. (3 years); Sam Houst on Electric
Cooperative (SHECO) (2.5 years); the CASE Center at Syracuse University (2
years); and the newest partner, the IBM Corporation, Net work Hardware
Division. All of the current RESNET ISD' s mentioned have been commi t t ed for
over four years and have played an active role in the definition of end-user
needs. Sam Houston State University, Rice University, Texas A&M University,
University of Texas, and the University of Houston are committed to providing
distance learning programs and higher education courseware for the network.
This has been a "grass-roots" effort, primarily in support of the Tribes and their
children. Unlike other programs, the Alabama-Coushatta Tribes will continue to
share resources with the surrounding community.
RESNET recently entered into a special partnership with the SHECO. Both
parties have a j oi nt sheath agreement whereby SHECO in addition to giving
RESNET pro-bono pole contacts, matches mile for mile additional fiber
installation. For every mile of fiber RESNET installs, SHECO installs a mile and
both parties exchange fibers to form a common network. In addition, SHECO
maintains the entire fiber optic system upon completion of installation by
RESNET.
153
2. RESNET Goals and Strategies
The main goal of RESNET is the devel opment of a communi t y networking
environment to cooperatively devel op applications that utilize advanced
communication technologies. The RESNET will provi de high-speed net work
connectivity to all the RESNET sites (e.g., schools, court houses, hospitals, and
Indian reservations) and state of the network-based applications (e.g., switched
telecommunication services, interactive multimedia services and multimedia
services).
The RESNET also provides a high-speed connection to the Internet, including
NGI (Next Generation Internet), and the NII (National Informat i on
Infrastructure). This federally mandated program was established by the High
Speed Computing Act of 1991 (a.k.a. "NREN" - the National Research and
Education Network). Its original goal was to establish a multimedia broadband
fiber-optic network, connecting over 1200 national universities and research
facilities in the U.S. and eventually overseas. The NII now has evol ved into the
Internet. It provides access to a myriad of information sources. It is in effect, an
extension of thousands of electronic assets worldwide, with very high-bandwidth
requirements. Very large databases, multimedia database with very large files,
high-bandwidth applications (e.g., the ability to log on to the Hubbell telescope
and concurrently view on-going experiments) are all part of this "mega-
network".
It is RESNETs' goal to see that the children in rural east-Texas are not left
behind while this Information Superhighway (NII) comes to fruition. The
population base of rural America is highly economi cal l y disadvantaged and
highly populated by minorities. It is, in effect, a mirror image of the inner city,
with a more favorable environment and lower population density. The RESNET
technology goal is to install Fiber-to-the-Schools (FTTS) using single-mode fiber
backbone, with ATM transport at OC-12. There will be ATM switches at each
ISD campus and ATM network interface cards (25 Mbps) installed in each
student workstation.
The technology involved in our migration toward NII/ NGI compatibility is fairly
simple. We must work with what is presently installed, wherever possible, and
upgrade to the compatible technology whenever possible. Our goal in the pilot is
to extend the fiber installed at the University of Houst on and their full services,
into the three ISD campuses of the RESNET pilot. RESNET takes care of the
solutions to insure the security and privacy of the data and network bet ween the
two sites.
3. RESNET Design and Technology
The RESNET is a broadband fiber-optic based, private wide-area ATM network,
whose scope, by the year 2002, will include ten counties in eastern Texas area.
154
The first segment of this consortium-owned, not-for-profit managed network was
started around the Alabama-Coushatta Indian Reservation in Polk County,
Texas, and it was designed in a hierarchical way such that each town has a
backbone switch, several small-to-medium sized ATM switches, and customer
premise equipment 9 Therefore, the initial design of RESNET (Phase I), as we
can see from Figure 1, has been focused on building a high-speed ATM
backbone (OC-3) by connecting Indian Reservation to Livingston, Woodville
and Dallardsville (Big Sandy ISD), and on connecting the backbone switch to the
community facilities such as schools and hospitals in each town. The next phases
of RESNET design (Phase II, III, and IV) will be extended to cover other ISDs
located in west, north, and south regions as shown in Figure 1.
Dibol North
~as~ ~ ISD Term
CordKan
Other Ed~, . ISD
~, ~egget t
West ~
Term
Huntsvil 13lzue II i~
00-30 SMF
New Waveriy Coldgpring Goodrich
ISD ISD ISD
Backbone Switch I,NrA
8265
00-30 SMF i 00-30 SMF i
t~ Alabama~l~'oushatta Wo~:t~lle
[)~ Indian P'Trvati~ East
O~ I 00-30 SMF Term
Houston GigaPo p S~y
Shepard (NGI/vBNS)
Cut and Shoot ~ ISD
ISD SoothJla& " ~e~ CleYe]and
.......... ~ ............. ISD Big ' ISD
Term
Figure I: Overview of RESNET Design
The RESNET design is implemented mainly by using ATM products from IBM.
For example, in each RESNET site, we have installed a backbone switch (IBM
8265 or IBM 8260 Nways ATM switch), and a combination of IBM 8260 Nways
Multiprotocol Switching Hubs and IBM 8285 workgroup ATM switches to build
ATM clusters operating at 155 and 25 Mbps speed 9 Each ATM switch runs
PNNI version 1.0 and UNI 3.0/39 to connect to other ATM switches and
workstations. Each 8265 backbone switch has been equipped with a
Multiprotocol Switched Services (MSS) module that provides various routing
and switching services 9 The MSS is a key component of IBM' s Switched Virtual
Network architecture and supports various features such as Classical IP over
155
ATM (CLIP) [1], LAN Emulation (LANE) [2] services and Next Hop
Resolution Protocol (NHRP) [3]. It also supports various routing (e.g., RIP,
OSPF, etc.) and bridging protocols.
For the administrative purpose, each town is configured with different IP subnet
and each subnet runs either CLIP or LANE services based on its requirements. In
this case, the MSS module installed in each backbone switch provi des the
necessary functions such a s Address Resolution Protocol (ARP) server and
LANE servers (LECS, LEC, and BUS). Since any cut-through routing (e.g.,
NHRP) capabilities has not been configured at the time of this writing, the
communication between any two towns still passes through two routers, which
may be the performance bottleneck as the inter-town traffic grows. However , we
envision that the problem of this performance bottleneck can be resolved when
standard-based short-cut routing schemes such as Multiprotocol over ATM
(MPOA) [4] or Multiprotocol Label Switching (MPLS) [5] are i mpl ement ed
within the MSS module.
The current RESNET backbone operates at the speed of 155 Mbps (OC-3) and
ATM is the main transport mechanism over the RESNET. However, with the
advances in high-speed interfaces (e.g., OC-12, OC-48, and OC-192) and optical
technologies (e.g., Dense Wavelength Division Multiplexing (DWDM)), the
future RESNET backbone is expect ed to be upgraded to support gigabits or
terabits applications.
RESNET also allows the sharing of networked resources within a school district,
town, count y or region. A high-speed connection (DS-3 or OC-3) to the Houst on
GigaPop is being sought by RESNET in order to access Rice University, Texas
A&M, University of Texas, and the University of Houston, most of which have a
connection to the vBNS (very-high-speed Backbone Net work Service) national
ATM network. Sharing of the computational resources and the cooperat i ve
educational programs (both K-12 and adult education) permits RESNET
participants to access to the resources that they could not afford on an individual
basis.
Figures 2 through 5 show the actual designs for RESNET sites, Livingston,
Alabama Coushatta, Woodville, and Big Sandy ISD, respectively.
156
Lwmgston
2:5 M!os
Liviston
H.S
LivistonisD IB l
" 0C-3C SMF
8, 65
Indian Reservat i ml
Backbone Switch
0C-3C: SMF i
IBM 82611
Polk
County
0C-3C SMF
1B~65
Figure 2: Detailed View of Livingston
Pre-K I
Head i i Reservation __ I 2:: Mps ]
Town Switch
Start I J Backbone | ~-~mmellllr Tribal Enterprise
, I IBM 8285 :,
9 : A 5 " ~. ~. 5 ,,' :s, J
Distance Learning
To Livingston Town Switch IBp 8~65
0C- 3C
/
ITo Woodville Town Switch
0C-3C
Figure 3: Detailed View of Alabama-Coushatta Indian Reservation
155 Mps
PC'S " ~
Woodville Tele.radiology I - ~
.... H',S", I
,B.],26o-'-~-~..
I 0C-3C SMF
/i
IBM 8260
155 Mps IBM 1285
~" PC'S--~
Figure 4: Detailed View of Woodviile
157
Big Sandy ISD
Class
Room
Computer Class Room Class Room
Lab in H.S. in Elem. S.
q
~c.~ ~. ' ~176176176
25 Mps
0C-3C SMF
.... . l / oo- ~
W
CAT5
C~,T5
IBM 8260
Figure 5: Detailed View of Big Sandy ISD
0C-3C SMF
158
4. RES NET De mo n s t r a t i o n
One of the design goals of RESNET is to develop and depl oy network-based
multimedia applications that can fully utilize the high-speed ATM net work
connectivity across RESNET sites. For this purpose, we have created three
proof-of-concept scenarios and demonstrated them at the RESNET opening
ceremony in November, 1997. The demonstrations were present ed at the
Alabama Coushatta Indian Reservation and three scenarios were demonstrated:
1) Video-conferencing demonstration, 2) Video on demand demonstration, and
3) Voi ce over IP demonstration. These three applications were selected based on
the current needs of different RESNET sites and on the future business plans
using the RESNET infrastructure. In what follows, we briefly revi ew the
demonstrations and some of the experience from the demonstrations.
4.1 Video-conferencing Demonstration
Video-conferencing is one of the most important applications that provi de users
with advanced video collaboration solutions both for education and business
applications. In this demonstration, the First Virtual' s ATM-based video-
conferencing solutions [6] have been installed both at Alabama Coushatta Indian
Reservation and at Big Sandy ISD. The First Virtual' s vi deo-conferenci ng
solution consists of a PC equipped with a plug-and-play 25 Mbps ATM NIC card
from First Virtual (VC-NIC) and an MVIP-capable vi deo-conferenci ng
equipment from PictureTel. Unlike other ATM NIC cards, the VC-NIC card is
specially designed to support video-networking applications and includes an
industry standard MVIP interface on the board. This onboard MVIP interface
provides direct connection to MVIP-capable multi-vendor vi deo-conferenci ng
equipment and allows the video traffic to bypass the system bus. The VC-NIC
card is fully compliant with standard UNI 3.x signaling prot ocol and LANE 1.0
protocol. The video data is transmitted at 384 Kbps speed. Although the 128
Kbps video-conferencing based on a single ISDN line meets the needs of face-
to-face conferencing, the ATM-based 384 Kbps conferencing provides higher
quality video enough to support most business applications and distance learning
applications.
4.2 Video on Demand Demonstration
In the business and educational applications, it is common to record video
presentations, classes, and movies, and store them in a centralized storage system
in order to eliminate the need for the replication and distribution of tape.
Retrieving this information from the remote PCs on demand is an important
application since most of the ISDs in RESNET can share a wealth of course
material and educational movies.
In order to implement this scenario, a PC server (IBM 330) with a 155 Mbps
ATM interface card (LANE interface) has been installed and connect ed to the
backbone ATM switch at Alabama-Coushatta Indian Reservation. This PC
159
server runs Wi ndows NT Server 4.0 operating syst em and is equi pped with
RAID-5 disk array to provi de a high-level of fault t ol erance and bet t er
performance. Several DVD movi es were downl oaded into the di sk array so that
client PCs in remot e sites can access the movi es si mul t aneousl y over the
network. On the other hand, we have installed I BM Tur boways ATM 25 Mbps
NICs into several client PCs located at different RESNET sites (e.g., Li vi ngst on,
Woodville, Bi g Sandy) and connected t hem to their local I BM 8285 wor kgr oup
ATM switches. The client PCs were also equipped with MPEG- 2 decoder board
and software DVD pl ayer to pl ay the DVD movi es stored in the PC server. The
public domai n Net work File Syst em (NFS) soft ware (NFS server and NFS
client) was used to i mpl ement the communi cat i on bet ween the client and server.
For exampl e, the PC server exports its file syst em and the client PCs mount t he
remot e file syst em into the network drive. Once the mount operat i ons are
properl y executed, each client PC opens the net work drive and pl ays the DVD
movi es stored in the PC server over the ATM network.
4 . 3 V o i c e o v e r I P D e m o n s t r a t i o n
One of the mai n advantages of using ATM is that any t ypes of data (voice, video,
data) can be mi xed and transferred over the same net work infrastructure. Wi t h
the expl osi ve growth of Internet and the increasing interests in building Next
Generation Net work (NGN) ( NGN is a future communi cat i on infrastructure that
integrates voice, data, and video traffic into a single common packet net work),
Voice over IP has been gaining increasing popularity among researchers and
creating a lot of opportunities for business and educational applications.
In our demonstration, we have installed two Tempest Dat a Voi ce Gat eway
(DVG) from Franklin Tel ecom [7] at Al abama-Coushat t a Indian Reservat i on and
Woodvi l l e High School. The two places are 15 miles apart and bel ong to
different LATAs. The Tempest DVG is self-containd, PC-based standalone box
with three interface cards (DSP board, Tel ephone interface board, LAN i nt erface
board). This box runs Linux operating syst em and contains syst em soft ware f r om
Franklin Tel ecom. One of the probl ems we have met was that the data i nt erface
provi ded by Tempest DVG was Ethernet only, t hereby the direct connect i on to
the RESNET ATM backbone was not an option. In order to solve this probl em,
two local Ethernet subnets were created both at Al abama-Coushat t a Indian
Reservat i on and at Woodvi l l e Hi gh School. In each subnet, we have also
installed a Wi ndows NT-based router (a PC with dual data interfaces (Ethernet
ad ATM)) so that the Ethernet traffic generated f r om the Tempest DVG is routed
and trasmitted across the RESET ATM backbone to the Woodvi l l e Hi gh School.
The router at the Woodvi l l e Hi gh School in turn routes the data to the local
Ethernet subnet. Although each voice packet has to pass through t wo routers, the
quality of the voi ce was quite impressive. As we increase mor e si mul t aneous
voice sessions, the quality of voice mi ght be degraded due to the nature of
Ethernet and the two intermediate routers. Installing a T1 board f r om Franklin
Tel ecom and connecting it directly to the backbone switch ( I BM 8265/8260 has
an interface modul e for T1/E1) is another option to i mpr ove the t hroughput and
160
guarantee the quality of simultaneous voice sessions. Also, in a real
environment, we can create an ATM PVC between two Ethernet subnets and
bridge the voice traffic (or tunelling) to improve the performance.
5. Summary and Concluding Remarks
In this paper, we presented the design and depl oyment of the Rural Educat i onal
System Net work (RESNET) in eastern Texas. We reviewed how this proj ect
started, funded and the steps involved in implementing the RESNET backbone
network. We also reviewed in further detail the t echnol ogy adopted to design
each RESNET site. We are currently working with Texas A & M university to
take the responsibility of managing all the RESNET services. In addition, we are
currently pursuing aggressively initiatives to provide high-speed connect i vi t y to
the national high-speed backbone (vBNS). Once this connect i on is established,
we will work with the Researchers at the Center for Advanced Tel eSysMat i cs
(CAT) at the University of Arizona and Texas A&M to establish an Adaptive
Distributed Virtual Computing Environment (ADVICE) [8] on RESNET.
References
[1] M. Laubach, "Classical IP and ARP over ATM", RFC 1577, January 1994.
[2] ATM Forum, "LAN Emulation over ATM Specification - ver 1.0", February
1994.
[3] D. Katz, D. Piscitello and B. Cole, "NBMA Next Hop Resolution Prot ocol ",
Internet Draft, December 1995.
[4] ATM Forum, "Multiprotocol over ATM - ver 1.0", July 1997.
[5] R. Callon, P. Doolan, N. Felman, A. Fredette, G. Swallow and A.
Viswanathan, "A Framework for Multiprotocol Label Switching", Internet
Draft, November 1997.
[6] http://www.fvc.com
[7] http://www.ftel.com
[8] Salim Hariri et al., "The design and evaluation of a virtual distributed
computing environment", Cluster Computing, Vol. 1, May 1998, pp. 81-93.
S o me Pe r f o r ma n c e St udi e s i n Ex a c t Li ne ar
Al g e br a
Geor ge Havas 1 and Cl emens Wagner 2
1 Centre for Discrete Mathematics and Computing
Depart ment of Computer Science and Electrical Engineering
The University of Queensland, Queensland 4072, Australia
havas~csee, uq. edu. au
h t t p ://www. it. uq. edu. a u / ~ h a v a s /
2 Fachgruppe Praktische Informatik
Fachbereich Elektrotechnik und Informatik
Universit~it-GHS Siegen
D-57078 Siegen, Germany
wagner@inf ormat ik. uni-siegen, de
h t t p ://pi. i n f ormatik, uni-siegen, d e / c l e m e n s /
Abs t r a c t . We consider parallel algorithms for computing t he Hermite normal form
of matrices over Euclidean rings. We use st andard types of reduction met hods
which are the basis of many algorithms for determining canonical forms of matrices
over various computational domains. Our implementations take advantage of well-
performing sequential code and give very good performance.
1 I n t r o d u c t i o n
Al gor i t hms for exact l i near al gebr a have been much st udi ed. Many di fferent
st rat egi es for cal cul at i on of canoni cal forms of mat r i ces have been pr oposed.
A compr ehensi ve bi bl i ogr aphy and a number of earl i er met hods for i nt eger
mat r i ces are exami ned in [8], i ncl udi ng references t o vari ous pol ynomi al t i me
al gori t hms. Some paral l el and some mor e r ecent met hods ar e descr i bed in
[16,9,6,22,20,11,12,21]. Reduct i on met hods for general Eucl i dean ri ngs ar e
st udi ed in det ai l in [23].
We concent r at e on al gor i t hms whi ch use r educt i on as t hei r under l yi ng
pri nci pl e. In spi t e of t he fact t ha t t he worst case per f or mance of r educt i on
met hods can be exponent i al l y bad (see [4] and [13]), such t echni ques pr ovi de
t he basis for ma ny sequent i al i mpl ement at i ons. We do not consi der modul ar
met hods in t hi s paper . Rat her we s t udy ot her r ecent i mpl ement at i ons whi ch
focus on fi ndi ng wel l -performi ng al gor i t hms and good heur i st i cs for r educt i on
met hods. We s t ar t by out l i ni ng t he mat hemat i cal backgr ound. We t he n show
how t o ext end sequent i al al gor i t hms t o a paral l el envi r onment . We finish by
pr esent i ng some sampl e per f or mance figures and out l i ni ng r ecommendat i ons
for choosi ng appr opr i at e paral l el al gori t hms.
162
2 Ma t h e ma t i c a l b a c k g r o u n d
A c ommut a t i ve ri ng R wi t h i dent i t y 1 is Eucl i dean if t her e is a val ue f unct i on
: R* ~ NO (where R* = R \ {0} and No is t he set of nonnegat i ve i nt eger s)
such t ha t t he following pr oper t i es hol d for a E R and b E R*.
1. For a ~ O, ~( a b ) >_ ~ ( a ) .
2. The r e exi st q, r E R wi t h a = qb+r , such t ha t ei t her r = 0 or ~(b) > qo(r).
Pa r a d i g m exampl es of Eucl i dean ri ngs ar e Z ( t he r i ng of i nt eger s, wi t h
abs ol ut e val ue as val ue f unct i on) and Fix] (t he ri ng of uni var i at e pol ynomi a l s
wi t h coefficients in a field F, wi t h degr ee as val ue f unct i on) .
An el ement a E R is a uni t if it has an i nver se a - 1 E R such t h a t
aa - 1 = a - l a = 1. The set U( R) of all uni t s of R is a mul t i pl i cat i ve gr oup.
El ement s a, b E R ar e associ at es if t her e exi st s a uni t c E R such t h a t a = bc
and we wr i t e a ..~ b. Associ at i on is an equi val ence r el at i on wi t h equi val ence
cl asses [a] : = {b E R I a ~ b}. A subset R C_ R is a r e pr e s e nt a t i ve set for R
if: {[a] I a E R} = R/ . . , ; and Va, b E R, a 7~ b.
For a Eucl i dean ri ng 7~ a f unct i on p : R R* - 4 R is cal l ed a r esi due
class s ys t e m if for all a, a ~ E R and b E R*
p(a,b) 9 {a- qbl q 9
~ ( p ( a , b ) ) < ~(b), and
p ( a , b ) = p ( a ' , b ) ~ 3t 9 + t b .
Let M C R wi t h M ~t {0} be a finite, n o n e mp t y s ubs et of R. The gr e a t e s t
c ommon di vi sor of M ( gcd( M) ) is t he equi val ence cl ass [g] such t hat : Ya 9 [g],
a [ M; and Vb 9 R wi t h b [ M, b I [g]. I f we have a r epr es ent at i ve set R for
R and d 9 gc d( M) M R t hen d is uni quel y det er mi ned. Fur t her ba c kgr ound
mat er i al on Eucl i dean ri ngs is gi ven in [18,7,5].
Mat r i ces A and B wi t h ent ri es in a Eucl i dean ri ng 7~ ar e col umn equi v-
al ent if t her e exi st s a uni modul ar ma t r i x V such t h a t A = B V . Ma t r i x V
cor r es ponds t o a sequence of e l e me nt a r y col umn oper at i ons : mul t i pl yi ng a
col umn by a uni t of 7~; addi ng any mul t i pl e by a r i ng el ement of one col umn
t o anot her ; or i nt er changi ng t wo col umns.
For any ma t r i x B over a Eucl i dean ri ng 7~ wi t h r e pr e s e nt a t i ve s ys t e m R
t her e exi st s a uni que lower t r i angul ar ma t r i x H whi ch is col umn equi val ent
t o B and whi ch satisfies t he fol l owi ng condi t i ons. Let r be t he r a nk of B.
1. The first r col umns of H are nonzer o and t he r emai ni ng col umns ar e zero.
2. For 1 _< j _< r let Hi j , j be t he first nonzer o ent r y in col umn j . The n
i l < i 2 < . . . < i t .
3. H/ j , j 9 l _ < j _ < r .
4. For 1 < k < j < r , Hi j , k = p ( Hi j , k , Hi ~ , j ) .
Thi s ma t r i x is cal l ed t he col umn He r mi t e nor ma l f or m ( HNF) of t he
gi ven ma t r i x B and has ma n y i mpor t a nt appl i cat i ons. As al r eady ment i oned,
163
t her e are ma ny al gor i t hms based on r educt i on met hods for c omput i ng t he
HNF. Descri pt i ons of such met hods for canoni cal f or m c omput a t i on in Eu-
cl i dean rings (somet i mes speci al i zed t o t he i nt egers) in t he l i t er at ur e i ncl ude
[ 1 8 , 7 , 5 , 1 9 ] .
3 S e q u e n t i a l a l g o r i t h ms
Det er mi ni st i c pol ynomi al - t i me HNF al gor i t hms ( non- modul ar ) i ncl ude t hose
of Ka nna n and Bachem [14], Chou and Collins [3], and Havas, Maj ewski and
Mat t hews [12] for t he i nt egers; of Ka nna n [15] for Q[x]; and of Wagner [23]
for Fq Ix]. Heuri st i c al gor i t hms (oft en f ast er a n d / o r " bet t er " ) i ncl ude t hose of
Havas and Maj ewski [9] for t he i nt egers and of Wagner [23] for Fq [x]. All in
all, even t he sequent i al al gor i t hm si t uat i on is a qui t e compl i cat ed st ory, whi ch
is addr essed in much mor e det ai l in [23]. We do not go i nt o t hi s f ur t her here,
but r at her bui l d paral l el al gor i t hms based upon effect i ve sequent i al ones.
4 P a r a l l e l i mp l e me n t a t i o n s
The pr obl em we consi der is: Gi ven A E Ma t m c omput e in par al l el H
t he HNF of A t oget her wi t h a uni modul ar ma t r i x V such t ha t H -- AV. To
uni fy t he comput at i on we act ual l y comput e t he HNF of a wor ki ng ma t r i x
A) Let K be t he Her mi t e nor mal f or m of W. The first m rows of K
W= I . 9
/
ar e t he Her mi t e nor mal f or m of A and t he last n rows of K give a uni modul ar
t r ans f or mat i on mat r i x V.
A parallel comput er P :-- {7Co,... , 7rN-1} consi st s of N pr ocessor s wi t h
di st i nct me mor y and a communi cat i on net wor k. Let
{ ;
~(l, T) : = + ot her wi se
for each 0 < r < N- 1.
We use t he mat r i x di st r i but i on model f r om [24]. Each pr ocessor 7r~ on
t he paral l el comput er st ores par t of t he worki ng ma t r i x W cor r es pondi ng
t o a (~(m, T) + 1) X n mat r i x A (~) and a ~(n, r ) x n ma t r i x V (' ) whi ch are
submat r i ces of t he worki ng versi ons of A and t he mul t i pl i er V E GL n ( ~ x ] ) ,
respect i vel y. An ext r a row, row ~( m, r ) + 1 of A (r) is used t o cont r ol t he
comput at i ons. We call t hi s t he comput at i on row.
The ma t r i x is di st r i but ed rowwi se t o processors, but in st r i pes n o t in
blocks. For each oper at i onal row we also do a br oadcast . Thus, we di s t r i but e
t he i nput ma t r i x A by st or i ng t he i t h row of .4 in row i ~ of ma t r i x W (~) on
pr ocessor 7r, wher e r : = (i -- 1) mod N and i ' : = L( i - 1) / N] + 1 for 1 < i < m.
The paral l el , hybr i d HNF al gor i t hm PARALLEL-HNF is gi ven by ps eudocode
in Fi gur e 1. I t uses t he s t andar d DI V oper at or for t he a ppr opr i a t e Eucl i dean
ri ng and calls t wo subpr ocedur es: COMPUTE- GCD and PARALLEL-ROD.
164
PARALLEL-HNF ( W (r) , o~,/3)
i n p u t r : pr oc e s s or i nde x
W( r 7r e- par t of a full c o l u mn - r a n k , d i s t r i b u t e d ma t r i x W
a, / 3: n o n - n e g a t i v e i nt eger s
k + - I
( i ~ , . . . , i , , ) + o
w h i l e k < n d o
r + - - 0
i + - - 1
s r mi n { k + a - 1, n}
f + - - k
w h i l e r < s d o
f +-- m a . x { r + 2, f }
+ - (i - I) m o d N
if # = r t h e n
j +-- L(i - W N J + I
b r o a d c a s t W(,;)+I, W: , ) ) , . . . , W(, ; ) t o a l l o t h e r
e l s e
j +- g( m, r ) + 1 ~ I ndex of c omput at i on row
r e c e i v e lJ:(~) T~:(~) W: , ] ) f r o m #
' ' j , r +l , ' ' j , f ~' ' " ,
i f i ---- i~+1 t h e n
f o r l +- - f t o s d o
q +--- DI V( W! ~ ), W ( ; ) I )
9 . , o - ) + _ _ , . , d : ) : ' . , < - >
w . : w. , t - q w ; : + l
W (r) +-- COMPUTE- GCD( W (r), j , r + 1, f , s)
W (~) i f j , r +l ~ 0 t h e n
( i l , . . . , i , , ) +- ( i l , . . . i r , i , i ~ + l , . . . , i ~ - 1 )
r + - r + l
W (r) +- PARALLEL- ROD( W (r) , r, / 3, i l , . . . , iT)
k + - k + c ~
r e t u r n W (~)
F i g . 1. Pa r a l l e l h y b r i d He r mi t e n o r ma l f o r m a l g o r i t h m
A c a l l C O M P U T E - G C D ( B , i , j o , j l , j 2 ) t a k e s a m a t r i x B w i t h n c o l u mn s ,
a n i n t e g e r i _> 1 a n d 1 < Jo < j l _< j 2 <_ n a s i n p u t , w h e r e i i s a v a l i d r o w
i n d e x o f B. I t p r o d u c e s a r i g h t e q u i v a l e n t m a t r i x B ' a s o u t p u t w h e r e
BI..
z,3o = g c d ( Bi , j o , B i , j l , . 9 9 , Bi , j ~ ) a n d B I-,,J. = 0
f o r j l < j _< j 2- T h i s a l g o r i t h m c o m p u t e s B ' f r o m B b y u n i m o d u l a r c o l u mn
o p e r a t i o n s ( s wa p p i n g o f t wo c o l u mn s , mu l t i p l y i n g a c o l u m n w i t h a n u n i t , a n d
a d d i n g a mu l t i p l e o f one c o l u mn t o a n o t h e r c o l u mn ) . T h e r e a r e v a r i o u s d i f f e r -
e n t m e t h o d s t o o b t a i n B ' f r o m B (e. g. [ 1, 2, 10, 23] ) . B y u s i n g g c d a l g o r i t h m s
wh o s e e x e c u t i o n d e p e n d s o n l y o n t h e e n t r i e s i n t h e o p e r a t i o n a l r o w we n e e d
n o a d d i t i o n a l c o m m u n i c a t i o n f or t h i s p u r p o s e .
165
The function PARALLEL-ROD reduces the off-diagonal entries. It is a
hybrid al gori t hm which is controlled by par amet er / 3 E No. For/ 3 + 1 great er
t han or equal to the rank of t he i nput mat r i x it is a parallel vari ant of t he
st andard reduct i on algorithm. For/3 = 1 this al gori t hm is a parallel version of
Chou-Collins' reduction met hod. If we choose/3 equal t o zero t he al gori t hm
does not change the input mat ri x. A more detailed descri pt i on of Parallel-
ROD algorithm can be found in [24].
The or e m 1. Let A E Mat mx~( R) with rank r be in echel on/ orm. Let 1 <
/3 < r and q := Lr ~J . Then the PARALLEL-ROD algorithm uses q(q---k~21/3 + r
broadcasts to transfer at most r2+r+q2Z+qZ ring elements.
2
A proof is given in [24].
The PARALLEL-HNF al gori t hm divides the i nput mat r i x into vertical
blocks of width c~. For k := (~, 2c~,... , (p - 1)~, n (where p : = r ~l ) t he HNF
of the leading k-column submat ri x is comput ed. Thi s is shown in Fi gure 2.
k_
T
L
o
f $ r' j ~
I]
I!R
L
0 0
k_
M M'
Fig. 2. Computing the HNF of the leading k columns, for k = ~, 2~, . . .
The or e m 2. Let A E Mat mx~( R) with rank r and 1 <_ ~,/3 < n. Then
O" n 3 nm
the PARALLEL-HNF algorithm uses (Tfi + --5-) broadcasts with distributed
W := I. , ~ and/3 as input. It trans/ers 0 ring elements.
For R = ~x] the procedure uses O( ( m + n - r)n5/324~iiAi]2 ) field opera-
tions. At most 0(n4~2: IIAll ) field elements are trans/erred via broadcasts.
166
Proof. Tr ansf or mi ng a (m + n) x s pr i nci pal s ubmat r i x of W wi t h s E
{a, 2 a , . . . , (p - 1) a, n} i nt o echel on f or m requi res at most m + n - r br oad-
casts. Tr ansf or mi ng t he echel on f or m i nt o HNF requi res ( by The or e m 1)
q(q+l) [ . ~ J
2 /3 wi t h q : = br oadcast s. In t ot al t hi s l eads t o t he br oadcas t
bound
i=1
i =1
r ) + a P ( P + l ) [ 2 p 3 + 3 p 2 + p a P ( P + 1) J
= p ( m + n - , ~ + a 2
9 , , 9 128 , + , 4
To comput e t he echel on form we do not need t o br oadcas t t he l i near l y de-
pendent rows. Thus, t he number of br oadcas t ri ng el ement s can be maj or i zed
by
~-~(1 + ( a - 1)) = s a
i=1
for a ( m + n) x s pri nci pal s ubmat r i x of W. Tr ansf or mi ng t he echel on f or m
t o Her mi t e nor mal f or m requi res br oadcast i ng at most O( r 2) ri ng el ement s.
Comput i ng t he Her mi t e nor mal f or m of all ( m + n) x s pr i nci pal s ubmat r i ces
of W wi t h s E { a , 2 a , . . . , ( p - 1) a , n} we need
p- i
O ( n a + ~-~ i a 2) + p O( r 2) C_ O ( n a + P ( P 2 x) a ) + O( pr 2)
i=1
ri ng el ement s t o be br oadcast .
The proofs of t he ot her t wo est i mat es ar e qui t e l engt hy. The y can be
f ound in [23].
5 Perf ormance exampl es
We have i mpl ement ed t hi s and r el at ed al gor i t hms in C/ C+ + on t he I BM
SP2 at t he GMD in Sankt August i n. We have used t he xl C compi l er and
167
the message passing library MPL (bot h IBM product s). We have used the
Sort i ng-GCD algorithm, due to Majewski-Havas [17], for the i mpl ement at i on
of the COMPUTE-GCD function, where we used a heap for det ermi ni ng the
polynomial with the largest degree or the integer with the largest absolute
value, respectively in a subvector.
We have done many practical studies with these algorithms. In this paper
we give some details of the behavior of PARALLEL-HNF for some r andom
matrices over F2 [x] and Z. Thus we used an input mat ri x over F2 [x] which
is a random 80 x 80 matrix, where the degree of each ent ry is less t han or
equal to 80. The rank of this mat ri x is r = 80. The input mat ri x over Z is
a random 100 x 100 matrix. The absolute value of each ent ry is less t han or
equal to 64.
Table 1 and Figures 3 and 4 show the results of experiments in which we
varied a and ft. We used 16 nodes of the SP2. The first row of each measure-
ment is the total running time (minutes:seconds.hundredth). The second row
gives the maxi mum degree (F2 Ix]), or the number of bits in the largest ab-
solute value (Z) which arose during the comput at i on. In Figure 3 the x-axis
shows a while the y-axes show run times and maxi mum degrees. In Figure 4
the x-axis shows a while the y-axes show run times and maxi mum number
of bits.
~2 [x] z
= n+ /3 = n+
/ 3=1 fl=c~ ~ 8=1 f l =~
1 - - ~ 1 --c~
06:53.08106:57.10 06:52.68 1 00:20.35 00:20.28 00:18.92
1
12708 12708 12717 1401 1401 1412
05:34.19]07:14.12 10:28.21 3 00:23.25 00:22.91 00:21.99
3
12708 15073 18646 1698 1698 2138
05:45.01 06:41.58 10:51.69 00:19.68 00:19.59 00:22.39
5 5
13701 15016 23377 1598 1744 2639
04:43.10105:06.79. 10:02.18 8 00:20.57 00:20.35 00:21.78
8
12232 19614 26823 1641 2374 3146
04:11.13104:40.08. 09:05.69 20 00:21.01 00:22.65 00:27.84
16
12384 27387 33377 1263 5851 6506
04:28.23105:13.12. 08:40.16 40 00:31.24 00:40.86 00:46.43
2O
11270 29951 35593 1464 13977 14266
03:49.84106:12.17. 06:02.98 60 00:43.38 01:14.43 01:11.70
40
9982 37468 37883 2043 20222 20222
03:32.68108:11.38. 05:00.42 80 01:37.72 03:09.35 01:49.23
6O
11270 35593 30719 4544 64587 51291
01:33.32107:01.55. 01:32.47 100 11:36.60 11:36.20 11:38.67
8O
9955 43486 9955 20987 20987 20987
Tabl e 1. Effect of varying c~ and
168
i1)
ill
700
500
4 0 0
3 0 0
200
100
0
0
- - " ) K - - -
1 3 =N+l - a - - - m- -
2-.
45000
40000
35000
30000
25000
20000
15000
10000
5000
, , , , , . . _ _ _
~ ---)K--- /
~- N+ 1- a - - - ~ - /
/ t
o r
,' / ;
i
I I I I I I I i i i i I i I
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
( a ) ( b )
Fi g. 3. Effect of varying c~ and fl for a mat ri x over F2[x]: (a) running times, (b)
maximum degree
6 Concl udi ng Remarks
The reuse of sequent i al code for paral l el i mpl ement at i ons is well s uppor t e d by
t he oper at i on row concept . We have used earl i er sequent i al GCD al gor i t hms
for our paral l el i mpl ement at i ons. Thi s leads t o paral l el i mpl ement at i ons for
Her mi t e nor mal f or m comput at i on wi t h good speed- up. In t he i nt eger case
t he hybr i d al gor i t hm gives best resul t s for smal l a and fl = n. For R = F2 [x]
we get t he best per f or mance for a -- n and smal l ft.
For integer mat r i ces wi t h large r ank, t he new pr ocedur es ar e f ast est . For
di st r i but ed comput at i on it is good t o use t he paral l el , hybr i d pr ocedur e wi t h
small a ( < 10) and fl : = n. These pr ocedur es pr oduce ver y good t r ans f or -
mat i on mat r i ces (i.e., ent ri es wi t h smal l absol ut e value) if t he r ank of t he
mat r i x is less t ha n t he number of col umns. The t r ans f or mat i on mat r i ces are
al most as good as ones obt ai ned usi ng LLL l at t i ce basis r educt i on met hods ,
whi ch are or der s of magni t ude slower. For mat r i ces over F2 Ix] and for integer
mat r i ces wi t h small r ank, "Gaussi an el i mi nat i on" t ype pr ocedur es (c~ : = n)
are fast est . For di st r i but ed comput at i on use small f l (6 {1, 2}).
700
169
600
500
v 400
.~ 300
200
100
70000
i i i i I ~=L 11_ i i
~0~ - - ' - ~ - -
[~-N+l-a -----
J
i ;
t
20000
I t
' . - 10000
o
0 10 20 30 40 50 60 70 80 90 100
60000
50000
40000
30000
' ' ' ' ' ' P = I ' ~ '
l ~ N + l - ~ - -
,'
l.:
I,
l:'
t '
0 10 20 30 4 0 50 60 70 80 90 100
Et
(a) (b)
Fi g. 4. Effect of varyi ng a and /3 for an i nt eger mat r i x: (a) r unni ng t i mes, (b)
maxi mum number of bits needed
Acknowledgements
Th e f i r st a u t h o r was pa r t i a l l y s u p p o r t e d by t h e Au s t r a l i a n Re s e a r c h Counc i l .
References
1. W. A. Blankinship. A new version of t he Eucl i di an al gori t hm. Amer. Math.
Monthly 70 (1963) 742-745.
2. G. H. Bradley. Al gor i t hm and bound for t he gr eat est c ommon divisor of n
integers. Comm. ACM 13 (1970) 433-436.
3. T- W. J . Chou and G.E. Collins. Al gor i t hms for t he sol ut i on of syst ems of linear
Di ophant i ne equations. SI AM J. Comput. 11 (1982) 687-708.
4. X. G. Fang and G. Havas. On t he worst -case compl exi t y of i nt eger gaussi an
elimination. ISSA C'97 (Proc. 1997 I nt er nat . Sympos. Symbol i c Al gebrai c Corn-
put . ), ACM Press (1997) 28-31.
5. K. O. Geddes, S.R. Czapor and G. Labahn. Algorithms for Computer Algebra.
Kluwer Academi c Publishers, 1992.
6. M. Giesbrecht. Fast comput at i on of t he Smi t h nor mal f or m of an i nt eger mat r i x.
ISSAC' 95 (Proc. 1995 I nt er nat . Sympos. Symbol i c Al gebrai c Comput . ) , ACM
Press (1995) 110-118.
7. B. Har t l ey and T. O. Hawkes. Rings, Modules and Linear Algebra. Ch a p ma n
and Hall, 1976.
170
8. O. Havas, D.F. Holt and S. Rees. Recognizing badl y present ed Z-modules.
Linear Algebra Appl. 192 (1993) 137-163.
9. G. Havas and B.S. Majewski. Hermi t e normal form comput at i on for integer
matrices. Congressus Numerantium 105 (1994) 87-96.
10. G. Havas and B.S. Majewski. Ext ended gcd calculation. Congressus Numeran-
tium 111 (1995) 104-114.
11. G. Havas and B.S. Majewski. Integer mat r i x diagonalization. J. Symbolic Com-
putation 24 (1997) 399-408.
12. G. Havas, B.S. Majewski and K. R. Matthews. Ext ended gcd and Hermi t e nor-
mal form algorithms via lattice basis reduction. Experimental Mathematics 7
(1998) 125-135.
13. G. Havas and C. Wagner. Mat ri x reduction algorithms for Euclidean rings.
Proc. 1998 Asian Symposium on Computer Mathematics, Lanzhou University
Press (1998) 65-70.
14. R. Kaunan and A. Bachem. Polynomial algorithms for comput i ng Smi t h and
Hermite normal forms of an integer mat ri x. SI AM J. Comput. 8 (1979) 499-507.
15. R. Kannan. Solving systems of linear equations over polynomials. Theoretical
Computer Science 39 (1985) 69-88.
16. E. Kaltofen, M. S. Kri shnamoort hy, and B. D. Saunders. Parallel al gori t hms
for mat ri x normal forms. Linear Algebra Appl. 136 (1990) 189-208.
17. B. S. Majewski and G. Havas. A solution to t he ext ended gcd problem. IS-
SAC' 95 (Proc. 1995 Int ernat . Sympos. Symbolic Algebraic Comput . ), ACM
Press (1995) 248-253.
18. M. Newman. Integral Matrices. Academic Press, 1972.
19. C.C. Sims. Computation with finitely presented groups. Cambri dge University
Press (1994).
20. A. Storjohann. Near optimal algorithms for comput i ng Smi t h normal forms of
integer matrices. ISSAC' 96 (Proc. 1996 Int ernat . Sympos. Symbolic Algebraic
Comput . ), ACM Press (1996) 267-274.
21. A. Storjohann. Comput i ng Hermi t e and Smi t h Normal Forms of Triangular
Integer Matrices. Linear Algebra Appl. 282 (1998) 25-45.
22. A. Storjohann and G. Labahn. Asympt ot i cal l y fast comput at i on of Hermi t e
normal forms of integer matrices. ISSAC' 96 (Proc. 1996 Int ernat . Sympos.
Symbolic Algebraic Comput . ), ACM Press (1996) 259-266.
23. C. Wagner. Normal/ormberechnung yon Matrizen iiber euklidischen Ringen.
PhD thesis, Inst i t ut fiir Experimentelle Mat hemat i k, Uni versi t ~t -GH Essen,
1997. Published by Shaker-Verlag, 52013 Aachen/ Germany, 1998.
24. C. Wagner. Fast parallel Hermi t e normal form comput at i on of mat ri ces over
F[x]. Euro-Par'98 Parallel Processing, Lecture Notes Comput . Sci. 1470 (1998)
821-830.
Performance Analysis of Wavefront Al gori thms on
Very-Large Scale Di stri buted Systems
Adol fy Hoisie, Ol af Lubeck and Har vey Wasserman
<hoisie, oml , hjw> @lanl.gov
Scientific Comput i ng Group
Los Al amos National Laborat ory
Los Alamos, NM 87545
Abstract. We present a model for the parallel performance of algorithms that consist of
concurrent, two-dimensional wavefronts implemented in a message passing environment.
The model combines the separate contributions of computation and communication wave-
fronts. We validate the model on three important supercomputer systems, on up to 500
processors. We use data from a deterministic particle transport application taken from the
ASCI workload, although the model is general to any wavefront algorithm implemented
on a 2-D processor domain. We also use the validated model to make estimates of per-
formance and scalability of wavefront algorithms on 100-TFLOPS computer systems ex-
pected to be in existence within the next decade as part of the ASCI program and else-
where. On such machines our analysis shows that, contrary to conventional wisdom, in-
ter-processor communication performance is not the bottleneck. Single-node efficiency is
the dominant factor.
1. Introduction
Wavefront techniques are used to enabl e parallelism in algorithms that have re-
currences by breaki ng the comput at i on into segments and pipelining the segments
through multiple processors [1]. First described as "hyper pl ane" methods by
Lamport [2], wavefront methods now find application in several i mport ant areas
including particle physics simulations [3], parallel iterative solvers [4], and par-
allel solution of triangular systems of linear equations [5-7].
Wavefront computations present interesting i mpl ement at i on and performance
modeling challenges on distributed memor y machines because they exhibit a
subtle bal ance between processor utilization and communi cat i on cost. Optimal
task granularity is a function of machi ne paramet ers such as raw comput at i onal
speed, and inter-processor communi cat i on latency and bandwidth. Although it is
simple to model the comput at i on-onl y portion of a single wavefront, it is consid-
erabl y more compl i cat ed to model multiple wavefronts existing simultaneously,
due to potential overl ap of comput at i on and communi cat i on and/or overl ap of
different communi cat i on or comput at i on operations individually. Moreover, spe-
cific message passing synchronization met hods i mpose constraints that can fur-
ther limit the available parallelism in the algorithm. A realistic scalability analy-
sis must take into consideration these constraints.
172
Much of the previous parallel performance modeling of software-pipelined appli-
cations has involved algorithms with one-dimensional recurrences and/or one-
dimensional processor decompositions [5-7]. A key contribution of this paper is
the development of an analytic performance model of wavefront algorithms that
have recurrences in multiple dimensions and that have been partitioned and pipe-
lined on multidimensional processor grids. We use a "compact application"
called SWEEP3D, a time-independent, Cartesian-grid, single-group, "discrete
ordinates" deterministic particle transport code taken from the DOE Accelerated
Strategic Computing Initiative (ASCI) workload. Estimates are that deterministic
particle transport accounts for 50-80% of the execut i on time of many realistic
simulations on current DOE systems; this percentage may expand on future 100-
TFLOPS systems. Thus, an equally-important contribution of this work is the
use of our model to explore SWEEP3D scalability and to show the sensitivity of
SWEEP3D to per-processor sustained speed, and MPI latency and bandwidth on
future-generation systems.
Efforts devoted to improving performance of discrete ordinates particle transport
codes date back many years and have extended recently to massively-parallel
systems [8-12]. Research has included models of performance as a function of
problem and machine size, as well as other characteristics of both the simulation
and the computer system under study. For example, Koch, Baker, and Al couffe
[3] developed a parallel efficiency formula that considered computation only,
while Baker and Alcouffe [9] developed a model specific to CRAY T3D put/get
communication. However, these previous models had limiting assumptions about
the computation and/or the target machines.
In this work, we model parallel discrete ordinates transport and account for both
computation and communication. We validate the model on several architectures
within the realistic limits of all parameters appearing in the model. Sections 2 and
3 of the paper briefly describe the algorithm and its implementation. Sections 4
and 5 derive the performance model and give validation results. In the final sec-
tions of the paper, the model is used to estimate SWEEP3D performance on fu-
ture generation parallel systems, showing the sensitivity of this application to
system computation and communication parameters.
Note that although we present results for three different parallel systems, no
comparison of achieved system performance or scalability is intended. Rather,
measurements from the three systems are presented in an effort to demonstrate
generality of the performance model and sensitivity of application performance to
machine parameters.
2. Description of Discrete Ordinates Transport
Although much more complete treatments of discrete ordinates neutron transport
have appeared elsewhere [12-14], we include a bri ef explanation here to make
clear the origin of the wavefront process in SWEEP3D. The basis for neutron
transport simulation is the time-independent, multigroup, inhomogeneous Boltz-
mann transport equation, which is formulated as
173
V-g2q~(r,E,f2) + J'/o(r,E)v(r,E,g2) =
f~dE'd'(r,E' --> E,~.~' )~F(r,E' ,f2 ') +
(1/4n)/~dE' d~' z(r,E' --> E) va (r,E' )~F(r,E' ,~ ") + Q(r,E,f2).
The unknown quantity is ~F, which represents the flux of particles at the spatial
point r with energy E traveling in direction f2.
Numerical solution involves complete discretization of the multi-dimensional
phase space defined by r, fL and E. Discretization of energy uses a "multigroup"
treatment, in which the energy domain is partitioned into subintervals in which
the depedence on energy is known. In the discrete ordinates approximation, the
angular-direction g2 is discretized into a set a quadrature points. This is also re-
ferred to as the SN method, where (in 1D) N represents the number of angular or-
dinates used. The discretization is completed by differencing the spatial domain
of the problem on to a grid of cells.
The numerical solution to the transport equation involves an iterative procedure
called a "source iteration" (see Ref. 13). The most time-consuming portion is
the "source correction scheme," which involves a transport sweep through the
entire grid-angle space in the direction of particle travel. A lower triangular ma-
trix is obtained, as such one needs to go through the grid only once in inverting
the iteration matrix. In Cartesian geometries, each octant of angles has a differ-
ent sweep direction through the mesh, and all angles in a given octant sweep the
same way.
For a given discrete angle, each grid cell has a spatially-exact particle "balance
equation" with seven unknowns. The unknowns are the particle fluxes on the six
cell faces and the flux within the cell. Boundary conditions and the spatial dif-
ferencing approximation are used to provide closure to the system. Boundary
conditions (typically vacuum or reflective) allow the sweep to be initiated at the
object' s exterior. Thereafter, for any given cell, the fluxes on the three incoming
cell planes for particles traveling in a given discrete angle are known and are
used to solve for the cell center and the three cell faces through which particles
leave the cell. Thus, each interior cell requires in advance the solution of its
three upstream neighboring cells - a three-dimensional recursion. This is illus-
trated in Figure 1 for a 1-D arrangement of cells and in Figure 2 for a 2-D grid.
Figure 1. Dependences for a 1-D Transport Sweep.
174
mml
mr.
P
W
um
umm
._mmmm
Figure 2. 2-D Transport Sweep along a Diagonal Wavefront.
3. Parallelism in Discrete Ordinates Transport
The only inherent parallelism is related to the discretization over angles. How-
ever, reflective boundary conditions limit this parallelism to, at most, angles
within a single octant.
The two-dimensional recurrence may be partially eliminated because solutions
for cells within a diagonal are independent of each other (as shown in Figure 2).
The success of this "diagonal sweep" scheme on SIMD computers such as single-
processor vector systems (using 2-D plane diagonals) and the Thinking Ma-
chines, Inc. Connection Machine (using 3-D body diagonals) has been demon-
strated [3].
Diagonal concurrency can also be the basis for implementation of a transport
sweep using a decomposition of the mesh into subdomains using message passing
to communicate the boundaries between processors, as described in [12] and
shown in Figure 3. The transport sweep is performed subdomain by subdomain
in a given angular direction. Each processor' s exterior surfaces are computed by,
and received in a message from, "upstream" processors owning the subdomains
sharing these surfaces.
However, as pointed out by Baker [9] and Koch [3], the dimensionality of the SN
parallelism is always one order lower than the spatial dimensionality because re-
cursion in one spatial direction cannot be eliminated.
Because of this, parallelization of the 3-D SN transport in SWEEP3D uses a 2-D
processor decomposition of the spatial domain.
Parallel efficiency would be limited if each processor computed its entire local
domain before communicating information to its neighbors. A strategy in which
blocks of planes in one direction (k, in the current implementation) and angles
are pipelined through this 2-D processor array improves the efficiency, as shown
in Figure 3. Varying the k- and angle-block sizes changes the balance between
parallel utilization and communication time.
175
Figure 3. Illustration of the 2-D Domain decomposition on eight processors
with 2 k-planes per block. The transport sweep has started at top of the
processor in the foreground. Concurrently-computed cells are shaded.
4. A Pe r f or manc e Mode l f or Paral l el Wave f r ont s
This section describes a performance model of a message passing implementation
of SWEEP3D. Our model uses a pipelined wavefront as the basic abstraction
and predicts the execution time of the transport sweep as a function of primary
computation and communication parameters. We use a two-parameter (la-
tency/bandwidth) linear model for communication performance, which is
equivalent to the LogGP model [15]. We use the term latency to mean the sum
of L and o in the LogGP framework, and bandwidth to mean the inverse of G.
Since different implementations of MPI use different buffering strategies as a
function of message size, a single set of latency/bandwidth parameters describes
a limited range of message sizes. Consequently, multiple sets are used to de-
scribe the entire range. Computation time is parameterized by problem size, the
number of floating-point calculations per grid point, and a characteristic single-
CPU floating-point speed.
4.1 Pipelined Wavefront Abstraction
An abstraction of the SWEEP3D algorithm partitioned for message passing on a
2-D processor domain (ij plane) is described in Figure 4. The inner-loop body of
this algorithm describes a wavefront calculation with recurrences in two dimen-
sions. Each processor must wait for boundary information from neighboring
processors to the north and west before computing on its subdomain. For con-
venience, we assume that the implementation uses MPI with synchronous,
blocking sends/receives. There is little loss of generality in this assumption since
the subdomain computation must wait for message receipt. Multiple waves initi-
ated by the octant, angle-block and k- block loops are pipelined one after another
as shown in Figure 5, in which two inner loop bodies (or "sweeps") are executing
176
on a Px by Py processor grid. Each diagonal line of processors is execut i ng the
same k-block loop iteration in parallel on a different subdomain; two such diago-
nals are highlighted in the figure.
Using this pipeline abstraction as the foundation, we can build a model of execu-
tion time for the transport sweep. The number of steps required to execut e a
computation of N~ee, wavefronts, each with a pipeline length of Ns stages and a
repetition delay of d is given by equation (1).
Steps = Ns + d(Nsweep - 1), (1)
The first wavefront exits the pipeline after Ns stages and subsequent waves exit at
the rate of lid.
The pipeline consists of both computation and communication stages. The num-
ber of stages of each kind and the repetition del ay per wavefront need to be de-
termined as a function of the number of processors and shape of the processor
grid. The cost of each individual computation/communication stage is dependent
on problem size, processor speed and communication parameters.
FOR EACH OCTANT DO
FOR EACH ANGLE-BLOCK IN OCTANT DO
FOR EACH K-BLOCK DO
IF (NEIGHBOR ON WEST) RECEIVE FROM WEST (BOUNDARY DATA)
IF (NEIGHBOR_ON _NORTH) RECEIVE FROM NORTH (BOUNDARY)
COMPUTE_MESH (EVERY I,J DIAGONAL; EVERY K IN K-BLOCK;
EVERY ANGLE IN ANGLE-BLOCK)
IF (NEIGHBOR_ON_EAST) SEND TO EAST(BOUNDARY DATA)
IF (NEIGHBOR_ON_SOUTH) SEND TO SOUTH(BOUNDARY DATA)
END FOR
END FOR
END FOR
Figure 4. Pseudo Code for the wavefront Algorithm
4.2 Computation Stages
Figure 5 shows that the number of computation stages is simply the number of
diagonals in the grid.
A different number of processors is empl oyed at each stage but all stages take the
same amount of time since processors on a diagonal are executing concurrently.
The cost of one computational stage is thus the time to compl et e one
COMPUTE_MESH function (see algorithm abstraction above) on a processor' s
subdomain. The discussion can be summarized with two equations. Equation (2)
gives the number of computation steps in the pipeline,
N ~ = P~ + Py- ! (2)
and Equation 3 gives the cost of each step,
177
T~p. ( Nx NY+ Nz Na N'a~
= + + ) ( 3 )
Px Py Kb Ab Rpops
where Nx, Ny, and Nz are the number of grid points in each direction; Kb is the size
of the k-plane block; Ab is the size of the angular block; Nflop, is the number of
floating-point operations per gridpoint; and Rflops is a characteristic floating-point
rate for the processor. The next sweep can begin as soon as the first processor
completes its computation so the repetition delay, d ~ is 1 computational step
(i.e., the time for completing one diagonal in the sweep).
4.3 Communication Stages
The number and cost of communication stages are dependent on specific charac-
teristics of the communication system. The effect of blocking synchronous com-
munications is that messages initiated by the same processor occur sequentially in
time and messages must be received in the same order that they are sent. As im-
plemented, the order of receives is first from the west, then from the north, and
the order of sends is first to the east and then to the south. These rules lead to the
ordering (and concurrency) of the communications for a 4 x 4 processor grid as
shown in Figure 6 for a sweep that starts in the upper-left quadrant.
Pv
( ~ . . . . . . . .
L t 1 i
) - . . . o ........ O ........ 0 - . . . . 4 3
Px
Figure 5. Multidimensional Pipelined
Wavefronts
Py
6
8~ 1 i 12~
7.._ v 9 v 1 L v
Px
Figure 6. Communication
Pipeline.
In Figure 6 edges labeled with the same number are execut ed simultaneously and
the graph shows that it takes 12 steps to complete one communication sweep on a
4 x 4 processor grid. We assume that a logical processor mesh can be imbedded
into the machine topology such that each mesh node maps to a unique processor
and each mesh edge maps to a unique router link. One can generalize the number
of stages to a grid of Px by Py processors by observing that communication for
each row of processors is initiated by a message from a north neighbor in the first
178
column of processors. South-going messages in the first col umn of processors
occur on every other step since each processor in the col umn a) has no west
neighbor, and b) must send east before sending south. Thus the last processor in
the first column receives a message on step 2(Py-1). This initiates a string of
west-going messages along the last row that are also sent on every other step, and
the number of stages in the communication pipeline is given by
Ns " " = 2 ( P / - 1 ) + 2 ( P x - 1 ) (4)
Analogous to the computational pipeline, different stages of the communi cat i on
pipeline have different numbers of point-to-point communications. However,
since these occur simultaneously, the cost of any single communication stage is
the time of a one-way, nearest neighbor communication. This time is given by:
N,,~
T g = t o + ~ ( 5 )
B
where latency + overhead (to) and bandwidth (B), are defined in LogGP as noted
above.
The repetition delay for the communication pipeline, d ~ is 4 because a mes-
sage sent from the top-left processor (processor 0) to its east neighbor (processor
1) on the second sweep cannot be initiated until processor 1 completes its com-
munication with its south neighbor from the first sweep (Figure 6).
4 . 4 Combining Computation and Communication S t a g e s
In the previous two sections, we derived formulas for the modeling of SWEEP3D
that are general for any pipelined wavefront computation. We can summarize the
discussion in two equations that give the separate contributions of computation
and communication:
T~~ = [ ( P x + e y - 1) + ( N~ e e t , - 1)] * Tcpu (6)
T c~ = [2(P, + Py - 2) + 4 ( N , weep - 1)]*T,,~g (7)
The major remaining question is whether the separate contributions, T ~~ and
7 ~~ can be summed to derive the total time. They would not be additive i f
there were any additional overlap of communication with computation not al-
ready accounted for in each term. To see that this is not the case, consider the
task graph for an execution consisting of two wavefronts on a 3 x 3 processor
grid (Figure 7). This graph shows communication tasks (circles numbered with a
send/receive processor pair) and computation tasks (squares numbered by a com-
puting processor). The total number of stages in the combined communica-
tion/computation pipeline is equal to the number of nodes (of each type) in the
longest path through the graph (the critical path) shown in dotted boxes in the
figure. The critical path for the first sweep can be counted from Figure 7 : 5
computational tasks and 8 communication tasks. This result is exactly the num-
ber given by eqns. (2) and (4). One can further verify that there is no further
overlap between two pipelined sweeps other than the predicted sum of
i oi
179
[ . . . . . . . . . 2
Figure 7. Pipelined Wavefront Task Graph.
eqns. (6) and (7). The second sweep completes exactly 1 comput at i on and 4
communication steps after the first.
In summary, total time for the sweep algorithm is the sum of eqns. (6) and (7),
where Tcpu is given by eqn. (3) and Tmsg is given by eqn. (5). The validation of
the model against experiment involves the measurement and/or modeling of
Tmsg and Tcpu. We take Tmsg to be the time needed for the compl et i on of a
send/receive pair of an appropriate size and Tcpu to be the computational work
associated with the subgrid computation on each processor.
5. Va l i d a t i o n o f t h e Mo d e l
In this section, we present results that validate the model with performance data
from SWEEP3D on three different machines, with up to 500 processors, over the
entire range of the various model parameters. Inspection of eqns. (6) and (7)
leads to identification of the following validation regimes:
Nswee p --- 1: This case validates the number of pipeline stages in ~o,~p and T ~~ as
180
functions of ( P , +Py), in the available range of processor configurations.
N~ e , p - (Px+Py): Validation of a case where the contributions of the ( P x + P y ) a n d
N~eep terms are comparabl e.
N~weep >> ( Px +Py ) : This case validates the repetition rate of the pipeline.
For each of these three cases, we analyze probl em sizes chosen in such a way as
to make:
T ~~ >> 7~~ (validate eqn. (6) only)
7 ~~ = 0; (validate eqn. (7) only)
T ~~ - T~~ (validate the sum of eqns. (6) and (7)).
5 . 1 N~,,ep = 1
For a single sweep, the coefficients of T,,~g and Tcp. in equations 6 and 7 represent
the number of communi cat i on and comput at i on stages in the pipeline, respec-
tively. Any overlap in communi cat i on or comput at i on during the single sweep of
the mesh is encapsulated in the respect i ve coefficients. In hypothetical probl ems
with T,,~ 8 - Tep,, and in the limit of large processor confi gurat i ons (large P~ +Py ) ,
equations 6 and 7 show that the communi cat i on component of the el apsed t i me
would be twice as large as the contribution of the comput at i on time. In reality,
for probl em sizes and partitionings reasonabl y designed (small subgrid surface-
to-volume ratio), Tcp, is considerably larger than Tm~8. Comput at i on is the domi -
nant component of the elapsed time.
This is apparent in Figure 8, which presents the model -experi ment compari son
for a weak scalability analysis of a 16 x 16 x 1000 subgrid size sweepi ng onl y
one octant. This size was chosen to reflect an estimate of the subgrid size for a 1-
billion cell-problem running on a machine with about 4, 000 processors; the for-
mer is a canonical goal of ASCI and the latter is si mpl y an est i mat e of the ma-
chine size that might satisfy a 3-TFLOPS peak per f or mance requirement. In a
"weak scalability" analysis, the probl em size scales with the processor configura-
tion so that the computational load per processor stays constant. Thi s experi ment
shows that the contribution of communi cat i on is small (in fact, the model shows
that it is about 150 times smaller than computation), and the model is in very
good agreement with the experiment.
We note that in the absence of communi cat i on our model reduces to the linear
"parallel computational effi ci ency" model s used by Baker [9] and Koch [3] for
SN performance, in which parallel computational effi ci ency is defined as the frac-
tion of time a processor is doing useful work.
To validate the case with N~w~ep = 1 and "compar abl e" contributions of communi -
cation and computation we had to use a subgrid size that is pr obabl y unrealistic
for actual production simulation purposes (5 x 5 x 1). Even with this size com-
putation outweighs communi cat i on by about a fact or of 6. Fi gure 9 depicts a
weak scalability analysis on the SGI Origin 2000 for this size. The model -
experiment agreement is again very good.
Validation of cases where T ~~ = 0 i nvol ved the devel opment of a new code to
80
6O
A
~e
! 6 O
20
Measumd
Model
.L Toomp from Model
Se-3
4e- 3
|
I~ 2e-3
F-
I e- 3
Oe+O
Measured
Model
.L Tcomp frem Model
181
O0 I 0 20 30 4 6 8 10 12 14 16
P x + P y P x + P y
Fi g ur e 8. T c~ do mi na nt .
N,w,p = 1. I BM RS/ 6000.
Fi g ur e 9. r c~ - - Tc~ Sswee p = 1.
SGI Or i gi n.
simulate the communi cat i on pattern in SWEEP3D in the abs ence o f comput a-
tion. The c o de devel oped for this purpose si mpl y i mpl ements a recei ve- west , re-
cei ve-north, send- sout h, send-east communi cat i on pattern e nc l os e d i n l oops that
initiate mul ti pl e waves. Figure 10 s hows a very good agreement o f the model
with the measured data from this code.
Measured ~ Me a a u r e d
Model Mo d e l
4e-2 5
3e- 2
|
~
2e- 2
E
l e - 2
Oe+O
T I
me
( s e
co
nd
z )
/ /
10 20 30 40 10 2 0
P x + P y P x + P y
30
Fi gur e 10. rc~ Sswee p = 1.
SGI Or i g i n . .
Fi g ur e 11. Tc~
N~w,p = 10. SGI Or i g i n.
s.2 S~w.p - (ex+ey)
As descri bed i n Sect i on 4, sweeps o f the domai n generated by s ucces s i ve octants,
182
angle blocks, and k-plane blocks are pipelined, with the depth of the pipeline,
N~ p , given by the product of the number of octants, angle blocks, and k-plane
blocks. We can select k-plane and angle bl ock sizes so that N~eep = 10, which, in
turn, balances the contribution of N~ep and (Px+Py) for processor configurations
used in this work. In Figure 11 the comparison using a data size for which T ~~
is dominant is presented, showing an excellent agreement with the measured
elapsed time. The case with no computation is in fact a succession of 10 sweeps
of the domain, with the communication overlap described by equation 6. Figure
12 shows a very good agreement with experimental data for this case.
An excellent model-experiment agreement is similarly shown in Figure 13, for a
subgrid size 5 x 5 x 1, which leads to balanced contributions of the communica-
tion and computation terms to the total elapsed time of SWEEP3D.
2.0e-2
1.Se-2
I.Oe-2
5.0e-3
o Measur ed
Model
o MeoBured ~ Tcomp from Model
9 Model
I O
9 . ; o
. ' o o
9 o
0.0~.0
0 I 0 20 30 40
Px+Py
8.0e-3
7.0e-3
6.0e-3
~ 5.1~-3
~ 4.0e-3
3.~-3
2.[~-3
g
0 o
o g
o
o o
4 6 8 10 12 14 16
Px+Py
Fi gur e 12. rc~ Sswee p .~ 10.
CRAY T3E.
Fi g ur e 13 T c~ d o mi n a n t .
Nsw. p=10. SGI Or i g i n.
5.3 Nsweep >> Px+Py
We present model-data comparisons using weak scalability experiments for cases
in which N~eep is large compared with (Px+Py) in Figure 14 (6 x 6 x 360 sub-
grid;/~omp - /~omm) and in Figure 15 (16 x 16 x 1000 subgrid; T ~~ dominant).
The model is in good agreement with the measured execut i on times of
SWEEP3D in both cases.
5.4 Strong Scalability
In a "strong scalability" anal ysi s, the overal l probl em s i ze remai ns const ant as the
processor confi gurati on increases. Therefore, Tm~g and T o p u vary from run to run
as the si ze o f the probl em si ze per processor decreases. In Fi gure 16 t he corn-
183
parison between measured and model ed time for the strong scalability analysis
out to nearly 500 processors on the probl em size 50 x 50 x 50 is shown. The
agreement is excellent.
i
o
Measur ed
Model
1.5
o
o o o
1.3 " ' ~ ~ ~ 9 ' " "
~~
1.1
11.9
0. 7
0. 5
0.3
0 10 20 30 40
px+~
Figure 14. r c~ - T c~
6 x 6 x 360. Ns.,,p large.
CRAY T3E. Kb = 10.
o
2.',
1.1
I.C
0.6
2
o Mecmur ed
, Model
l l m m m D l l U l l l
o o o ~
80 0 0 ~ 0
4 6 8 10 12 14 16
Px+Py
Figure 15. T ~~ dominant.
16 x 16 x 1000. N ~, p large.
IBM RS/6000 SP.
0 M~ I O d
9 Model
7.0
6.0
5.0
4.0
3.0
2,0
1.0
0.0
gl
o
i l
o
0 lO 20 30
ex+Py
Figure 16. Strong Scalability. CRAY T3E.
6. Ap p l i c a t i o n s o f t he Mo d e l . Sc a l a bi l i t y Pr e d i c t i o n s
Performance models of applications are important to comput er designers trying
to achieve proper balance between performance of different system components.
184
ASCI is targeting a 100-TFLOPS system in the year 2004, with a workload de-
fined by specific engineering needs. For particle transport, the ASCI target in-
volves O( 10 9) mesh points, 30 energy groups, O(104) time steps, and a runtime
goal of about 30 hours. With 5,000 unknowns per grid point, this requires about
40 TBytes total memory. In this section we apply our model to understanding the
conditions under which the runtime goal might be met.
Two sources of difficulty with such a prognosis are (1) making reasonable esti-
mates of machine performance parameters for future systems, and (2) managing
the SWEEP3D parameter space (i.e., bl ock sizes). We handle the first by study-
ing a range of values covering both conservative and optimistic changes in tech-
nology. We handle the second by reporting results that correspond to the shortest
execution time (i.e., we use block sizes that minimize runtime).
We assume a 100-TFLOPS-peak system composed of about 20, 000 processors (5
GFLOPS peak per processor, an extrapolation of Moor e' s law). With this proc-
essor configuration, given the proposed size of the global problem, the resulting
subgrid size is approximately 6 x 6 x 1000.
Plots showing dependence of runtime with sustained processor speed and latency
for MPI communications are shown in Figures 17 and 18 for several k-plane
block sizes and using optimal values for the angle-block size. Tabl e 1 collects
some of the modeled runtime data for a few important points: sustained proces-
sor speeds of 10% and 50% of peak, and MPI latencies of 0.1, 1, and 10 micro-
seconds. Our model shows that the dependence on bandwidth (1/G in LogGP) is
small, and as such no sensitivity plot based on ranges for bandwidth is presented.
Tabl e 1 data assumes a bandwidth of 400 Mbytes/s.
One immediate observation is that the runtime under the most optimistic techno-
logical estimates in Tabl e 1 is still larger than the 30-hour goal by a factor of two.
The execution time goal could be met if, in addition to these values of processor
speed and MPI latency (L+o in LogGP), we used what we believe to be an unre-
alistically high bandwidth value of 4 GBytes/s.
Assuming a more realistic sustained processor speed of 10% of peak (based on
data from today' s systems), Table 1 shows that we miss the goal by about a factor
of six, even when using 0.1 las MPI latency. With the same assumption for
processor speed, but with a more conservative value for latency (1 ~ts), the model
predicts that we are a factor of 6.6 off. In fact, our results show that the best way
to decrease runtime is to achieve better sustained per-processor performance.
Changing the sustained processor rate by a factor of five decreases the runtime by
a factor of three, while decreasing the MPI latency by a factor of 100 reduces
runtime by less than a factor of two. This is a result of the relatively low com-
munication/computation ratio that our model predicts. For example, using values
of 1 ~ts and 400 MB/sec for the communication latency and bandwidth, and a
sustained processor speed of 0.5 GFLOPS, the communication time will onl y be
20% of the total runtime.
9 I O0 k-pl one$ per bl ock
.t 5~] k-pl anes per bl ock
1400
1200
lO00
{ -
400
20()
0
0 200 400 600 800 lOGO
Sustolned CPU Speed (MFLOPS)
600
500
400
300
200
100
0
I 0 k-pl anes per bl ock
100 k-pl anes per bl ock
500 k-pl ane~ per bl ock
I 0 k-pi ones per bl ock
40 60 80 100
Lat ency (usec)
185
Fi gur e 17. Sensi t i vi t y of t he billion-
poi nt t r ans por t sweep t i me to sus-
t ai nedper - pr oces s or CPU speed on a
hypot het i cal 100- TFLOPS syst em as
pr oj ect ed by t he model f or sever al k-
pl ane bl ock sizes and wi t h MPI la-
t ency = 15 /is, and bandwi dt h = 400
Mbyt es/ s.
Fi gur e 18. Sensi t i vi t y of t he billion-
poi nt t r a ns por t sweep t i me t o MPI
l at ency on a hypot het i cal 100-
TFLOPS syst em as pr oj ect ed by t he
model f or sever al k- pl ane bl ock and
wi t h sust ai ned per - pr oces s or CPU
speed = 500 MFLOPS, bandwi dt h =
400 Mbyt es/ s.
Tabl e 1. Est i mat es of SWEEP3D Per f or mance on a Fut ur e - Ge ne r a t i on Sys-
t em as a Funct i on of MPI Lat ency and Sust ai ned Per - Pr oces s or Comput i ng
Rat e
10% of Peak 50% of Peak
MPI
tency
0.1 gs
1.0 g s
10 ~ts
La- Runtime
(hours)
180 16%
198 20%
291 20%
Amount of
Communication
Amount of
Runtime (hours) Communica-
tion
56 52%
74 54%
102 58%
7. Concl usi ons
A scalability model for parallel, multidimensional, wavefront calculations has
been proposed with machine performance characterized using the LogGP frame-
work. The model accounts for overlap in the communication and computation
components. The agreement with experimental data is very good under a variety
of model sizes, data partitionings, blocking strategies, and on three different par-
allel architectures. Using the proposed model, performance of deterministic
186
transport codes on future generation parallel architectures of interest to ASCI has
been analyzed. Our analysis showed that contrary to conventional wisdom, inter-
processor communication performance is not the bottleneck. Single-node effi-
ciency is the dominant factor.
8. Acknowledgements
We would like to thank Ken Koch and Randy Baker of LANL Groups X-CM and
X-TM for many helpful discussions and for providing several versions of the
SWEEP3D benchmark. We thank Vance Faber and Madhav Marathe of LANL
Group CIC-3 for interesting discussions regarding mapping problem meshes to
processor meshes. We acknowledge the use of computational resources at the
Advanced Computing Laboratory, Los Alamos National Laboratory, and support
from the U.S. Department of Energy under Contract No. W-7405-ENG-36. We
also thank Pat Fay of Intel Corporation for help running SWEEP3D on the San-
dia National Laboratory ASCI Red TFLOPS system, and SGI/CRAY for a gen-
erous grant of computer time on the CRAY T3E system. We also acknowledge
the use of the IBM SP2 at the Lawrence Livermore National Laboratory.
References
1. G. F. Pfister, In Search of Clusters - The Coming Battle in Lowly Parallel Computing,
Prentice Hall PTR, Upper Saddle River, NJ, 1995, pages 219-223.
2. L. Lamport, The Parallel Execution of DO Loops," Communications of the ACM,
17(2):83:93, ?., 19?.
3. K. R. Koch, R. S. Baker and R. E. Alcouffe, "Solution of the First-Order Form of the
3-D Discrete Ordinates Equation on a Massively Parallel Processor," Trans. of the Amer.
Nuc. Soc., 65, 198, 1992.
4. W. D. Joubert, T. Oppe, R. Janardhan, and W. Dearholt, "Fully Parallel Global M/ILU
Preconditioning for 3-D Structured Problems," to be submitted to SlAM J. Sci. Comp.
5. J. Qin and T. Chan, "Performance Analysis in Parallel Triangular Solve," In Proc. of
the 1996 IEEE Second International Conference on Algorithms & Architectures for Par-
allel Processing, pages 405-412, June, 1996.
6. M. T. Heath and C. H. Romine, "Parallel Solution of Triangular Systems on Distrib-
uted Memory Multiprocessors," SIAM J. Sci. Statist. Comput. Vol. 9, No. 3, May 1988,
7. R. F. Van der Wijngaart, S. R. Sarukkai, and P. Mehra, "Analysis and Optimization of
Software Pipeline Performance on MIMD Parallel Computers," Technical Report NAS-
97-003, NASA Ames Research Center, Moffett Field, CA, February, 1997.
8. R. E. Alcouffe, "'Diffusion Acceleration Methods for the Diamond-Difference Dis-
crete-Ordinates Equations," Nucl. Sci. Eng. {64}, 344 (1977).
9. R. S. Baker and R. E. Alcouffe, "Parallel 3-D SN Performance for DANTSYS/MP[ on
the CRAY T3D, Proc. of the Joint Intl'l Conf. On Mathematical Methods and
Supercomputing for Nuclear Applications, Vol 1. page 377, 1997.
187
10. M. R. Dorr and E. M. Salo, "Performance of a Neutron Transport Code with Full
Phase Space Decomposition and the CRAY Research T3D," ???
11. R. S. Baker, C. Asano, and D. N. Shirley, "Implementation of the Fi rst -Order For m of
the 3-D Discrete Ordinates Equations on a T3D, Technical Report LA-UR-95-1925, Los
Alamos National Laboratory, Los Alemaos, NM, 1995; 1995 American Nuclear Society
Meeting, San Francisco, CA, 10/29-11/2/95.
12. M. R. Dorr and C. H. Still, "Concurrent Source Iteration in the Solution of Three-
Dimensional Multigroup Discrete Ordinates Neutron Transport Equations, " Technical
Report UCRL-JC-116694, Rev 1, Lawrence Livermore National Laboratory, Livermore,
CA, May, 1995.
13. E. E. Lewis and W. F. Miller, Computational Methods of Neutron Transport, Ameri-
can Nuclear Society, Inc., LaGrange Park, IL, 1993.
14. R. E. Alcouffe, R. Baker, F. W. Brinkley, Mart, D., R. D. O' Del l and W. Waiters,
"DANTSYS: A Diffusion Acclerated Neutral Particle Transport Code, " Technical Report
LA-12969-M, Los Alamos National Laboratory, Los Alamos, NM, 1995.
15. D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos, K. Schauser, R. Subramonian,
and T. von Eiken, "LogP: A Practical Model of Parallel Computation, " Communications
of the ACM, 39(11):79:85, Nov., 1996.
16. H. J. Wasserman, O. M. Lubeck, Y. Luo and F. Bassetti, "Performance Evaluation of
the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Appl i cat i ons, "
Proceedings ofSC97, IEEE Computer Society, November, 1997.
17. C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. L. Hennessy, "The Effects of
Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors, "
Stanford University Computer Science Report CSL-TR-95-660, January, 1995.
The DFN Gigabitwissenschaftsnetz G-Wi N
Ei ke Jessen,
DFN/ TU Mtinchen
Abstract. The German national scientific networking association, DFN, will provide a
gigabit network (G-WiN) in spring 2000. The paper analyzes history and trend of DFN
network throughput, bandwidth, and cost, and the traditional and innovative load to be
carried by the network is evaluated and forecast. Testbeds, which promote gigabit
applications and pilot technology, are described. The current status of G-WiN
specification is compared to US projects.
1 Deutsches Forschungsnetz
G- Wi N is the abbreviation for Gi gabi t -Wi ssenschaft snet z. It will go into
operation in spring 2000. It is provi ded to research and educat i on in Ger many by
the Verein zur FOrderung eines Deut schen Forschungsnet zes (DFN), an
association of 400 members, mai nl y universities and pol yt echni cs, research
institutes and other technical and scientific institutions. The purpose of DFN is to
9 provi de networking for research and educat i on in Ger many; DFN does so by
speci fyi ng and procuring net work services; it acts as a cooperat i ve for the
interest of its members
9 promot e efficiency and quality of research and educat i on by i nnovat i ve
net work usage
In 1998, DFN has a turnover of 174 Mio DM, of which 110 Mi o DM are spent
for data net worki ng services, including the links f r om Ger many to forei gn
countries. The cost of net worki ng services is pai d by the participating
institutions, except transient funding by the Federal Ministry of Educat i on and
Science, Research and Technol ogy (BMBF), for the introduction of new net work
generations.
2 Evolution of Network Throughput and Cost
The first DFN Wissenschaftsnetz, the X.25 Wi N, began its operat i on in 1990.
Since then, the original net work has been upgraded to access rates of 2 Mb/ s
(1992) and been widely replaced by the Brei t bandwi ssenschaft snet z B- Wi N
(broad band science network), with access rates of up to 155 Mb/s. These
networks have been specified by DFN and are operat ed by the Deut sche
Tel ekom. The G- Wi N will offer access rates of up to 2488 Mb/s.
190
Fig. 1 shows the evolution of the annual average t hroughput and the bandwi dt h
of the DFN networks. The bandwidth is the maxi mum t hroughput that can be
carried by the network, gi ven the existing confi gurat i on and routing and a traffic
distribution as observed in current operation. The bandwi dt h has to surpass the
average throughput by the peak hour factor (1.8), and a burstiness surcharge
factor (2..3 at least). In 1995/96 the X.25 Wi N suffered by heavy congest i on, as
the average t hroughput approached net work bandwidth.
Net work average throughput has grown by factors of 2...5 per year, where there
was onl y one year, that of the introduction of B-Wi N, with the high factor. For
future, one can expect a growth of 2..2.5 per year for the traditional net work
load. In Fig. 1, this leads to the forecast area marked by the t wo di vergi ng
straight lines, roughl y a factor of 5 from summer 1998 to summer 2000.
Besides, the net work will enabl e its users to go into new f or ms of
communi cat i on, principally characterized by high bandwi dt h real t i me
applications. With large uncertainty, a further fact or of 1.5 respect s this
innovative usage (i.e. 2.5 times the capaci t y of the B- Wi N for today). So,
summi ng it up, G-Wi N should have approxi mat el y 8 times the t hroughput in
2000, compared to B-Wi N in mi d 1998, or 4.5 Gb/s. Our est i mat es are
conservative, compar ed to estimates for future US nets.
The di agram of fig. 1 describes the cost of science nets in Ger many (core
network for connecting 600 institutions) by lines of constant cost. The lines rise
over t i me by a factor of 1.4 per year, i.e. net works of this class have - at const ant
bandwidth - become cheaper by a factor of 1.4 (or 30 % less cost) per year. Thi s
trend will increase in the years to come, and an additional fact or of 2 per year
seems credible in the next 3 years, because of excess fi ber capacities in Ger many
(investment of the compet i ng line and net work provi ders), mul t i pl ex usage of
existing fibers (wave length division multiplexing, WDM), and an effect i ve
compet i t i on on the market, which in the high speed data communi cat i on area
onl y begins to work. Thi s specific forecast makes the iso-cost-lines bend in 1998.
The distance of the iso-cost lines describes the cost of bandwi dt h at constant
time. It is slightly better (from the point of vi ew of the net work cust omer) than
Gr osch' s law in the first two decades of the comput er market. The cust omer pays
onl y square root of n times the price for a n-fold increase in power. Thi s is an
, , economy of scale"-effect. During their history, the DFN science net works have
taken their way up the slope of cost; for the 400 fold increase in bandwi dt h, onl y
5 times the cost is paid (core net only!), as a consequence of the trend in t i me and
the progress to higher bandwidth, using the economy of scale. Ext rapol at i ng the
analysis to the year 2000 and to the G- Wi N with 8 times the bandwi dt h of B-
Wi N, the cost of G- Wi N should be well under 2 times that of B-Wi N, maybe 1.4.
By the way, this means that the price per bit x ki l omet er will be 6 t i mes lower,
recommendi ng G-Wi N not onl y as a high per f or mance net work for i nnovat i ve
usage, but also a cheap medi um for mass data transport.
191
Q.
t..-
O'J
o o
"k ~ t
(5
0
E
t
CO
0
g
0
. ~ ,
0
0
e~
e~
E
,,,,-I
E
O
E
o
o
c )
E
( 5
o
o
E
CQ
(5
O
192
3 Ne t wo r k Lo a d
Traditional network load will grow by a factor 4 to 6 bet ween spring 1998 and
spring 2000. This seems to be well established by the experiences of the past, see
fig. 1. The network will, however, enable its users to put a far more demandi ng
load on the net, in particular high bandwidth communi cat i on under realtime
constraints. Fig.2 gives a survey on the types and limiting factors of innovative
load and of the resulting traffic vol ume in the network and of the peak bandwi dt h
per connection.
Among the real-time communication types, met acomput i ng (distributed
supercomputing) has the highest requirements for peak bandwidth per
connection. Data may be fed into the network at speeds of 0.8 and 1.6 Gb/s. The
number of concurrent computations in the network, however, will be small,
perhaps 5, and the repetition frequency and vol ume of the communi cat i on bursts
will be small, resulting in a medium throughput. There is compet i t i on with local
computing, realistic bandwidth cost and security arguments will limit the usage.
4 I n n o v a t i v e Lo a d
Visualisation and control of remote computing processes, execut ed on special
and high performance computers, will be widely used, but - under adequate
compression techniques - not render high data throughput per connection. Even
virtual reality can be reduced to multiple video and graphics channels with onl y
moderate bandwidth (some Mb/s each) per connection.
Interpersonal communication by network phone/video, as well as interpersonal
cooperation will be very frequent and therefore result in a high bandwidth
demand, though the demand per connection is also moderate. Medi a servers, for
instance for distribution of distant learning and teaching multimedia material,
will be few in number, but with high throughput, though widely not in realtime,
where it does not belong into our survey of realtime innovative load. There is,
however, a class of realtime high bandwidth media communi cat i on visible, where
studio media data are transported to remote processing systems and
retransmitted in realtime, so that during the recording the processed signal
can be checked and used for the control of the recording.
193
Type Factors Limiting the
usage
realtime
Network
volume
total
Peak bandwi dth per
connecti on
Distributed
I computation
Remote
visualisation &
control of
computation
Interpersonal
communication
& cooperation
(phone, video,
CSCW, virtual
reality
Media Servers
Signal
processing
Algorithms, economy,
security
Competition with
Workstations
Terminal devices
Medium
Low
Medium
high
low ... high
low ... medium
Server technology, Low low ... medium
I portable media
Medium low ... high Competition with local
processin~
non-realtime
Update,
distribution,
backups, non-
interactive
visualisation,
experiments
Competition with
portable media
Low medium
Fig. 2: Classes of innovative network load and probable volume (in total) and
peak bandwidth (per connection)
194
As the network has much lower costs for the vol ume x distance product , non-
realtime mass data transports may be attractive, compar ed to the shi pment of
physical media. Thi s will bring soft ware and data distribution ont o the net work,
as backups, experi ment al data and non-i nt eract i ve visualisation out put of r emot e
computations, to an extent far increased compar ed to today. The limiting fact or
will be the progress in the t echnol ogy of port abl e media. Thi s kind of load may
be considerable in volume, it is not defining net work peak performance, as it may
be delayed.
5 Gigabit Testbeds (1997-2000)
DFN runs two gigabit testbeds (see Fig. 3) for application pr omot i on and for
t echnol ogy piloting. The testbeds are to del i ver experi ence for the
specification of the Gi gabi t wi ssenschaft snet z G-Wi N.
The first testbed was established in 1997 and is situated in Nort h Rhi ne-
Westfalia. It connects institutions in Jtilich, Bonn, Col ogne and Essen. It has been
upgraded to 2.5 Gb/ s and studies applications in met acomput i ng, r emot e
visualisation, medi a processing and simulation. The probl ems come f r om
physics, material sciences, geosciences, traffic control, and studio dat a
processing.
The second testbed was established in 1998 and connect s Munich, Erl angen, and
Berlin. It pilots wavelength division multiplexing (3x2488 Mb/ s per fiber) and
optical amplification. Its applications come f r om the same areas and from
medicine and distant teaching.
6 G-WiN Specification
The general requirements for G-Wi N are, as of late aut umn 1998,
9 data communi cat i on infrastructure for Ger man science, as the successor
of B-Wi N, i.e. a full scale network, compri si ng high per f or mance and
commodi t y traffic
9 built with component s from the market (this seems achi evabl e)
9 a production network (not merel y a research object), t hough probabl y
beyond state of the market in 2000
9 roughl y 10 fold bandwidth, compar ed to B- Wi N 1998
9 more configuration flexibility, to use advant ageous regi onal
opportunities and to offer mor e flexible conditions of usage to the
participating institutions
9 efficient IP-based communi cat i on; it is not cl ear in 1998 whet her the
net work will, in the long run, base IP on ATM and SDH, as B- Wi N (and
195
vBNS e.g.) has done with very good operat i onal results, or base IP
directly on SDH, as the Abilene net work in USA will do, or base the IP
protocol stack directly on the ,,black" fiber, mul t i pl exed by WDM
(which can be combi ned with the first t wo options as well).
The decision reflects mai nl y the t radeoff bet ween bandwi dt h
consumpt i on, compl exi t y, and bandwi dt h allocation granularity. In any
way, the net work is to cooperat e with ATM based access syst ems, and it
is likely that G- Wi N should start with ATM and skip it when the
compet i ng architectures are mature.
9 guaranteed quality of service; it is, however, by aut umn 1998, by no
means clear how this can be achieved; the solution is cl osel y related to
the ATM/ SDH variants. B- Wi N offers ATM per manent virtual channel s
as the onl y means of guaranteed service. ATM traffic classes still lack
the embeddi ng in application oriented quality of service requirements.
Besides, ATM will not generally be avai l abl e on an end-t o-end basis
bet ween the host computers. So there is much debat e on schemes based
fully on IP, as RSVP and (more recently) MPLS (mul t i -prot ocol label
switching). These techniques are, however, not yet well underst ood in
actual operation. So, at least now, the way of provi di ng adequat e quality
of service to the flows in the net work remai ns open.
9 specification for potential line and net work provi ders until the end of
1998, start of G-Wi N operation in spring 2000
7 Compari son wi t h other Sci enti fi c Net works
Fig. 4 puts the G- Wi N into perspect i ve with similar projects of the precedi ng
generation and that of t omorrow, vBNS is structurally very similar to B-Wi N, but
operates on a higher level of bandwidth and l ower level of load; the distance
between G- Wi N and Abilene (one year ahead) will be smaller. Bot h US scientific
nets are mai nl y (80%) funded by sponsors, the participating institutions see onl y
a small percent age of the actual net work costs, vBNS and Abi l ene are defi ned not
to carry the lower bandwidth commodi t y traffic, which deliberately is left to
commerci al providers, though it probabl y could be carried by the br oadband
networks at lower cost. B-Wi N connects a much larger number of institutions
(even neglecting the 4000 users of the B- Wi N dial-up service Wi NShut t l e)
196
8 Gigabit Testbeds (1997-2000)
* Technology piloting
* Application promotion
West South
Jiilich, Bonn, Cologne, Essen ... ? Erlangen, Miinchen, Berlin, Suttgart
( ? )
Start: August 97 July 98
Lines: 0.6, 2.5 Gb/s 10.6, 2.5 Gb/s
WDM
Provider: o.tel.o DTAG
HiPPI, ATM/SDH ATM/SDH
(2 Gb/s ATM achieved!)
Applications
Metacomputing + Visualisation
* Molecular Dynamics
* Earth Shell
Media Processing
* Distrib. virtual TV Prod.
Simulation + Visualisation
* Traffic
* Black Holes
* Surface Effects
* Distrib. TV Prod./Server
* Virtual Laboratory
* Distance Learning
* Medicine
Fig. 3: DFN Gigabit Testbeds (1997-2000)
197
Network vBNS B-WiN Abilene G-WiN
Operational 1996 1996 1999? 2000?
since
Ordered by NSF DFN UCAID DFN
Provider
selected
Paid by
MCI
MCI/NSF/instit
DTAG
e.a.
instit/BM
BF
Qwest, UCAID
Qwest/instit
to be decided
instit/BMBF
Protocols IP/ATM/Sonet IP/ATM/S IP/Sonet IP/?
DH
Sites 71 600 120 600
Usage "projects" (science) (no commodity (science)
restriction traffic)
Trunks 622 Mb/s 41... 94 2.5 Gb/s 0.6/2.5 Gb/s
Gb/s Mb/s
Access 155/622 Mb/s 2...155 155/622 Mb/s 2...622 Mb/s
Rates Mb/s
QoS SVCs, PVCs MPLS SVCs, PVCs
PVCs MPLS?
Fig. 4: Comparison between G-WiN and other scientific networks.
Abbrevi ati ons:
ATM: Asynchronous Transfer Mode
B-WiN: Breitbandwissenschaftsnetz
DTAG: Deutsche Telekom AG
MPLS: Multiprotocol Label Switching
NSF: National Science Foundation
SDH: Synchronous Digital Hierarchy
S/PVC: Switched/Permanent virtual circuit
UCAID: University Cooperation for Advanced Internet Development
vBNS: very high-speed Backbone Network services
On Net work Resource Management for
End- t o- End QoS*
Ibrahim Matta
Computer Science Department
Boston University
Boston, MA 02215, USA
matt a@cs. bu. edu
Ab s t r a c t . This article examines issues and challenges in building distributed Quality-
of-Service (QoS) architectures. We consider architectures that facilitate cooperation
between the applications and the network so as to achieve the following tasks. Ap-
plications are allowed to express in their own language their QoS (performance and
cost) requirements. These application-specific requirements are communicated to
the network in a language the network understands. Resources are appropriately
allocated within the network so as to satisfy these requirements in an efficient,
scalable, and reliable manner. Furthermore, the applications and the network have
to cooperate "actively" (or continually) to be able to achieve these tasks under
time-varying conditions.
1 I n t r o d u c t i o n
Advanced network-based applications have received a lot of at t ent i on re-
cently. An example of such applications is depicted in Figure 1. Here, a
number of scientists are collaborating in a videoconference to control a dis-
t ri but ed simulation. This application requires di st ri but i ng si mul at or out put
to the scientists, the exchange of audio and video signals among t hem, etc.
Such application demands some high-quality services from the network. For
example, the simulator dat a should be distributed to the scientists quickly
and with no loss. The audio and video signals should be t r ansmi t t ed to ot her
participants in a timely and regular fashion to ensure interactivity, etc. The
network in t urn should provide these services while utilizing the network re-
sources efficiently (to maximize revenue). Also, the network should deliver
these services in a scalable and reliable way.
In this article, we examine architectures designed to provide such ad-
vanced applications with varying QoS. We start by discussing in Section 2
t radi t i onal network services t hat do not provide QoS support as well as tra-
ditional applications t hat do not express their QoS or are not aware of the
QoS t hat the network is providing t hem with. We will argue why we need
* This work was done while the author was with the College of Computer Sci-
ence of Northeastern University. The work was supported in part by NSF grants
CAREER ANIR-9701988 and MRI EIA-9871022.
200
Simula'don Task ]
Fig. 1. Example of an advanced network-based application.
t he network and applications to be aware of (and sensitive to) QoS, and we
discuss in Section 3 how QoS support is achieved. Then, for these QoS-aware
applications and network to cooperat e, we present in Section 4 a generic inte-
grat ed archi t ect ure and describe its component s and discuss its mai n features.
These features include the scalability of t he archi t ect ure t o large networks
and large number of applications or users, and also its robust ness t o various
dynamics. We conclude in Section 5.
2 QoS- obl i vi ous Archi t ect ures
In this section, we discuss t radi t i onal network archi t ect ures and t hei r short-
comings due to their lack of QoS support .
2.1 QoS- obl i vi ous Net work Servi ces
Traditionally, a network, such as t he current Int ernet , provides onl y a best-
effort service, which means t hat t he dat a which applications send can expe-
rience ar bi t r ar y amount s of loss or delays. On the positive side, t he net work
could only i mpl ement simple mechanisms for traffic control. For exampl e, any
application can access (or send dat a over) t he network at any time. Net work
devices (like switches or routers) can serve packets carryi ng appl i cat i on' s
dat a in a simple first-come-first-serve fashion. Packets can be rout ed t o t hei r
dest i nat i on over pat hs t hat are opt i mal with respect t o a single met ri c, for
exampl e pat hs t hat have the mi ni mum number of links (or t hat t raverse t he
least number of routers). Wi t h these mechanisms, however, t he appl i cat i on' s
QoS may not be met. For example, short est -di st ance pat hs may be loaded
201
and not provide t he least delay and so may not satisfy t he requi rement s of
delay-critical applications. Thi s kind of best-effort net work has served quite
well t radi t i onal dat a applications (e.g. Telnet, FTP, E-mail) t hat have rela-
tively modest QoS requirements.
More advanced applications need addi t i onal support , for exampl e, for
multicasting a dat a packet t o many destinations. The net work should t hen
be able to establish a delivery t ree root ed at t he source and whose branches
lead t o the various destinations. In t he case of many senders, t he net work can
build one t ree for each sender. This is what current mul t i cast prot ocol s like
DVMRP (Distance Vector Multicast Rout i ng Prot ocol ) [27] and MOSPF
(Multicast Open Shortest Pat h First) [23] do. The t ree usually consists of
pat hs with mi ni mum delay from t he source t o each dest i nat i on. Fi gure 2
shows an exampl e t hree-node net work with two senders S1 and $2 and with
a receiver at each node. A packet from S1 is repl i cat ed on bot h links t o reach
the two ot her nodes/receivers. Similarly, for $2. So this mul t i cast uses 4 links
or the t ot al cost (in number of links) is 4. Thi s cost may reflect t he overhead
of replication or the amount of bandwi dt h consumed by this communi cat i on
group. Also, each packet experiences a maxi mum delay of going over 1 link.
Sl
2%
Fig. 2. Sender-based multicast trees.
Cost = 4
Del ay= 1
Ot her multicast routing protocols would build a single shared t ree t hat
spans all members of the group. A maj or obj ect i ve of such prot ocol is t o
minimize the sum of the link costs, or build a so-called "St ei ner tree. " Since
it is well known t hat it is comput at i onal l y expensive (or NP-compl et e) t o find
such mi ni mum-cost tree [11], protocols usually i mpl ement heuristics t hat are
less expensive and give close t o opt i mal trees. Wi t h such trees, t he goal is t o
minimize t he cost of replication and bandwi dt h, possibly at t he expense of
higher delays from a source t o a destination. Fi gure 3 shows a shared tree.
Here, the dat a packet from $2 needs t o be generat ed onl y once, as opposed t o
being repl i cat ed in Figure 2 when source-root ed trees are used. Thi s shared
tree uses 3 links only, as opposed t o 4 links with source-rout ed trees. However,
the packet from $2 experiences a maxi mum delay of t raversi ng 2 links, as
opposed t o 1 link with source-rooted trees. Some prot ocol s like PIM-sparse
[8] and CBT (Core Based Tree) [3] t r y t o achieve a bal ance bet ween cost
and delay by having a single shared t ree with t he root at a center node
and mi ni mum-del ay pat hs built from t he center t o members of t he mul t i cast
group.
202
Sl
Cost = 3
Del ay = 2
$2
Fig. 3. A shared multicast tree.
One can easily see t hat regardless of what t ype of t ree a best-effort multi-
cast rout i ng prot ocol builds, this t ree may not be appropri at e. Thi s depends
on several factors such as the l ocat i on of t he group members, t he l ayout
(topology) of the network, the request ed QoS, etc. Fi gure 4(a) shows a case
where a source-based t ree should be built r at her t han a shared t ree (with
boldfaced links) as it costs less (using 2 links as opposed t o 3). Fi gure 4(b)
shows a case where a shared tree achieves lower cost (using 2 links as opposed
t o 3). This illustrates t he need for adapt i ve net work services t hat establish
appropri at e st ruct ures so as t o efficiently utilize t he net work resources as
well as satisfy the QoS requirements of applications. We l at er discuss such
QoS-sensitive multicast rout i ng protocols.
S: Source
R1, R2: Receivers
Shared Tree
Source Tree
S R1 R2 S R1 R2
(a) (b)
Fig. 4. (a) A source-based tree costs less, (b) a shared tree costs less.
2.2 QoS-oblivious Applications
As we t radi t i onal l y had networks t hat are insensitive t o various par amet er s
for the st at e of the network and applications, we t radi t i onal l y had applica-
tions t hat are insensitive to the st at e of t he net work and what kind of QoS
t hey are getting from the network. A maj or consequence of this is t hat t he
application can experience "arbi t rary" QoS. For exampl e, under loaded net-
work conditions, a video application can st art losing its frames and suffering
arbi t rary degradat i on in quality. If t he appl i cat i on had adapt ed its coding
203
strategy to further compress its data and thus sent fewer frames, this may
have increased the likelihood t hat the transmitted frames make it through
the network. As a result, the application gets a consistent service (although
of lesser quality due to compression). Another example of QoS-oblivious ap-
plications is one t hat "arbitrarily" assigns a distributed computation over the
network.
Figure 5 shows a video example where the decoder (receiver) could decode
either MPEG- or JPEG-coded streams. MPEG is more expensive to decode
than JPEG. Thus, if there is enough computation cycles, and communication
is expensive, we could just send MPEG-coded video and have a matching
MPEG decoder, where video traffic is routed over the shortest (one-link)
path. On the other hand, if there are not enough cycles and communication
is cheap, we could use the decoder in JPEG mode to reduce computation cost
and transparently insert an intermediate hardware-based (computationally
inexpensive) transcoder that translates from MPEG format to JPEG format.
Here, video traffic is routed through the transcoder over a longer (two-link)
path, which is acceptable as communication is assumed to be cheap. This
example illustrates the advantages of having applications t hat could adapt
their operation mode based on the state of the system.
Video Source Transcoder
(MPEG) S:MPEG S:MPEG MPEG ~ JPEG
Video Display
(MPEG or JPEG)
G
Cheap Computation & Costly Computation &
Costly Communication Cheap Communication
Fig. 5. Video transport example.
3 QoS- s e ns i t i ve Ar c hi t e c t ur e s
Now that we have argued for the flexibility and potential benefits of QoS-
sensitive network and applications over traditional ones, we next discuss var-
ious issues and challenges that must be addressed to implement them.
The network has to distinguish among traffic streams (or flows) requiring
different QoS. A flow generally defines a stream of data packets that belong
to the same application or a pre-defined aggregated traffic class. The works
of several standardization groups, such as the IETF integrated-services [6]
and differentiated-services [4] working groups and the ATM Forum Traffic
Management working group [14], are based on allocating each traffic flow a
given (absolute or relative) share of resources (bandwidth, buffers, etc.). This
204
provides different flows with different services; better service (but typically
more expensive) to higher priority flows at the expense of worse service (and
usually cheaper) to lower priority flows.
Figure 6 shows a general architecture of a QoS network. A QoS manager
is responsible for receiving requests for some QoS through some signaling pro-
tocol, such as the Internet RSVP protocol [7] or the ATM signaling protocol.
The manager communicates with a routing component to find the outgoing
link(s) or path t hat can likely satisfy the request and over which the flow
would be routed. This path selection is typically based on an outdated view
of the state of the network. The manager then communicates with an ad-
mission control component which decides whether indeed there are enough
resources on the selected link(s) to satisfy the given QoS without violating
the QoS already promised to existing flows. If the flow request is accepted
(admitted), the QoS manager installs appropriate state in other components:
A classifier t hat recognizes packets belonging to the flow. A route lookup
component that forwards the flow's packets over the selected path. A shaper
(or dropper) that shapes the flow (or drops excess traffic) according to the ini-
tial traffic specifications declared by the request and based on which resource
allocations have been made. Finally, the manager has to set the parameters of
the scheduler to allocate to the flow the necessary resources that are needed
to satisfy the requested QoS.
QoS request s
R o u t n o I _ J o c s M a n a g e r L I A m i s s i o n I
Traffic _ [ Cl assi f i er Rout e Lookup Dr opper
- I
Fi g. 6. A genera] architecture of a QoS network.
Figure 7 shows the architecture of a general scheduler and a traffic shaper.
The scheduler isolates different traffic flows by allocating them fractions of
the link's resources (bandwidth, buffer space). The traffic of a flow is shaped
before entering the scheduler to compete for resources. The shaper shown
is called "token bucket shaper." Tokens accumulate (fill the bucket) at some
specified rate. A packet is allowed to enter the scheduler only if there is one or
more tokens to drain. The depth of the bucket allows for the flow to burst (i.e.
packets can enter back-to-back), but then the rate at which dat a enters so as
to contend for resources is bounded by the rate at which tokens are generated.
This bounded traffic specification makes it possible to test whether a new flow
could be supported by considering its worst-case behavior.
tokens
Token Bucket
Scheduler
Fig. 7. A general scheduler with a traffic shaper.
205
As for routing (multicast) traffic, the routing protocol should choose a
path t hat satisfies the QoS requirements of (all) the receiver(s). Thus, in
multicasting, the type of multicast tree t hat should be constructed depends
on the state of the network. Consider the example in Figure 8, where the
numbers shown reflect the maximum link delays in the presence of a three-
participant application , two of which send and receive at nodes A and C.
Assume the application's QoS requirements are maximum end-to-end delay
of 13 from the sender to any receiver, and jitter (or maximum difference
between individual end-to-end delays) of 7. A shared tree would violate the
application's delay and jitter requirements since the maximum end-to-end
delay is 20 and the jitter is 10. Thus, under this network state, source-based
trees should be constructed. Figure 9 shows the network in a different state,
where a shared tree should be constructed.
R3 R3
9 RI k,.._J ~ -~..J R2 R1
(a) (b)
10
10
Sl
R2
$2 R310 ~
R2 10
RI ~: $2
(r
Fig. 8. (a) Original network, (b) shared tree, (c) source-based trees. Example where
source-based trees should be constructed.
Clearly, to be QoS-sensitive, protocols have to account for the various
dynamics at all levels: application, host, and network. Protocols have to be
aware of the current participants in the application. Who communicates with
206
~
R 3
R3 SI ~ R2
RI R2 ~ R1 R2
(a) R1/ ~ 10 ~ $2 (c)
k.EJ
( b )
Fi g. 9. (a) Original network, (b) source-based trees, (c) shared tree. Example where
a shared tree should be constructed.
whom? Is a par t i ci pant aware of t he qual i t y of her t r ans mi s s i on so she can
adapt t o var yi ng qual i t i es? How is da t a gener at ed dur i ng t hi s t r ans mi s s i on?
Wha t are t he del i very r equi r ement s? How much r esour ces ar e avai l abl e at
host s (or end- syst ems) and at t he net wor k ( s wi t ches / t out er s ) ? How does t he
l ayout ( t opol ogy) of t he net wor k l ook like now? By account i ng for al l t hese
dynami cs, we woul d have a syst em t ha t is capabl e of sat i sf yi ng a wi de var i et y
of QoS r equi r ement s for di verse appl i cat i ons at all t i mes, whi l e oper at i ng
efficiently.
Goi ng back t o mul t i cast r out i ng as an exampl e of net wor k cont r ol , t he
syst em ma y mi gr at e f r om one mul t i cast t r ee t o anot her in r esponse t o changes
at t he appl i cat i on, host or net wor k level. Fi gur e 10 shows an exampl e, wher e
if t he mul t i cast pr ot ocol builds a single shar ed t r ee, t hen 5 links ar e used, and
onl y S2' s packet needs t o be repl i cat ed. Also, if, say, each sour ce t r ans mi t s
at x bi t s/ sec, t hen t o guar ant ee r at e for R2 t o recei ve f r om bot h senders, we
need t o reserve 2x bi t s / s ec on l i nk ($2, R2). If t he avai l abl e ba ndwi dt h on
( $2, R2) were less t ha n 2x, t hen a QoS net wor k woul d not a dmi t t hi s appl i ca-
t i on. However, i f t he mul t i cast r out i ng pr ot ocol adapt s i t s t r ee cons t r uct i on
mechani sm and swi t ches t o source-based t rees, t hi s appl i cat i on coul d be ad-
mi t t e d since t he traffic is now bet t er di s t r i but ed over t he net wor k.
Thus, it is somet i mes beneficial t o swi t ch t o bui l di ng a new t ype of t ree.
Bui l di ng sour ce- based t r ees in t he pr evi ous conf i gur at i on had al l owed for
admi t t i ng t he appl i cat i on, al t hough at t he expense of mor e r epl i cat i on and
mor e r out i ng s t at e i nf or mat i on as we have t o mai nt ai n 2 t r ees (one for each
source) as opposed t o 1 t ree. Thi s pr evi ous exampl e also cont r adi ct s t he com-
mon view t ha t a mi ni mum- cost shar ed t r ee r educes t he a mount of ba ndwi dt h
consumed compar ed t o sender - based t rees, especi al l y when we have mul t i pl e
senders as t he pa t h f r om some sender t o a recei ver ma y t ur n out t o be ver y
long. Thus, a QoS syst em must empl oy a mor e i nt el l i gent t r ee cons t r uct i on
s t r at egy t ha t adapt s t he shape of t he t r ee dynami cal l y, wher e we can t r ade-
_. . . ~ Sl Pat h
: S2 Pat h
Cost = 5 Cost = 4
(a) (b) (c)
207
Fig. 10. (a) Original network, (b) shared tree, (c) source-based trees. Example
illustrates benefits of adaptive multicast tree construction.
off between revenue from QoS support and cost of overhead to provide this
support. We present an example of such strategy later.
4 I nt e gr a t e d QoS Ar c hi t e c t ur e
The objective is to build an architecture t hat allows for protocols to adapt
their behavior so as to account for various parameters and dynamics. The
architecture should facilitate the exchange of information between the ap-
plications and network; the application should express an acceptable QoS
region and be able to adapt its behavior to one t hat matches the level of
QoS t hat the network currently delivers. The network should of course com-
municate that QoS level to the application. The goal is to efficiently utilize
the network (or maximize its revenue) under the QoS constraints imposed by
existing applications.
Figure 11 shows a generic integrated QoS architecture. In this archi-
tecture, applications express their QoS requirements in application-specific
terms to an application-specific QoS manager. This manager understands the
various attributes of a specific type of applications, and maps the application-
dependent requirements into application-independent and implementation-
independent requirements. A host QoS manager maps these in turn into
implementation-dependent requirements so that necessary resources are al-
located to the application by the operation system and network subsystem
at the hosts as well as by router QoS managers within the network. QoS
managers communicate with their peers or directly with their neighbor man-
agers to coordinate the allocation of resources. Router QoS managers control
the allocation of paths within the network by communicating with routing
managers, which may use different types of routing protocols to locally or
globally build paths that are capable of satisfying the QoS requirements of
applications.
208
o s t
multic~t muting multicast muting
protocol I protor Ro u t e r
Fig. 11. Integrated QoS architecture.
Such architecture allows for supporting QoS between the endpoints of
communication ( i . e . the applications) in the presence of various dynamics.
The QoS is controlled in a manner sensitive to the specifics of the application
so that the system can successfully and efficiently deliver targeted QoS. The
applications and network are allowed to exchange information so as to adapt
mutually to system dynamics. In the following subsections, we elaborate on
the delivery of targeted (application-oriented) QoS and the application' s ca-
pability to be aware of (and adaptive to) system state, and also as an example
of network control, we elaborate on building multicast routing trees t hat are
sensitive to QoS.
We finally discuss mechanisms t hat should be in place in order to scale
to large networks and to provide stability and reliability in the presence of
changes in the state of the system.
4.1 Application-oriented QoS Mappi ngs
An end-to-end QoS architecture has to deal with mapping the logical view of
the applications to a physical allocation of resources. This involves application-
specific QoS managers that take as input abstract QoS requests, such as
application input graph with various tasks and interdependencies between
them. Through a protocol to discover the state and location of physical re-
sources (this knowledge is typically outdated), the QoS manager produces as
output the needed physical resources that are likely to satisfy the applica-
tion's QoS in an efficient manner. See Figure 12. This in turn involves host
QoS managers that would communicate with other control entities at hosts
2O9
and within the network using some signaling protocol to finally allocate the
physical resources.
Resource Request I I Resource
Handling Discovery
Resource Selection
and Optimization
Fig. 12. Application-oriented QoS mapping.
Appl i cat i on- or i ent ed Resour ce Sel ect i on: Given a (typically outdated)
view of the system, which physical resources to select to satisfy an application-
level QoS request? The answer to this question depends on the nature of the
application. For example, consider an application t hat is real-time in nature
and requires the scheduling of real-time tasks on a set of hosts. A real-time
task needs to be reserved some amount of CPU cycles to meet a deadline. The
host t hat is selected may end up rejecting the assigned task if the host finds
that it does not currently have enough cycles (capacity). For this application,
the major objective is to minimize task rejection rate. A traditional load dis-
tribution scheme is "load balancing." The goal here is to equalize the load
over candidate hosts. However, as shown in Figure 13, this load-balancing
strategy may result in capacity fragmentation and so higher rejection. Thus,
although load balancing is adequate for providing best-effort QoS (optimizing
average measures), it is not adequate for providing guaranteed QoS (optimiz-
ing real-time measures). A more appropriate scheme is load profiling [20]. The
idea here is to have a more diverse profile of available capacity on the candi-
date hosts, so we increase the likelihood of finding a feasible host for future
requests.
100~
1 2 3 4 5 6 Alternative
Resources
Load-balanced System
................................. :ii: .......
L ieque~ t
1 2 3 4 5 $ Alternative
Rosourcss
Load-profiled System
Fig. 13. Load balancing versus load profiling.
210
Figure 14 shows how after choosing the least-loaded host (with idle ca-
pacity of 15) for a class-1 task, we can then only accept 4 consecutive class-2
tasks. On the other hand, if we choose the most-loaded host (with idle ca-
pacity of 11), we can then accept 5 consecutive class-2 tasks, as we won' t
have fragmented capacity in the system. Choosing the most-loaded candi-
date is a "load packing" strategy, which is only asymptotically optimal for
large systems and accurate feedback about system state. In a distributed sys-
tem with delayed (inaccurate) feedback, a strategy t hat has the same effect
but operates probabilistically is less sensitive to the inaccuracies in feedback
information and is more appropriate. We call it "load profiling" strategy.
IdleCapacity = 11
Class-1 Request = 1 0
Class-2 Roq~~ci~ @
14
Id/eCapacity ~; ~
, o
|
IdleCapacity = 15
Fig. 14. Example illustrates difference between (a) load balancing and (b) load
packing/profiling.
The main idea behind load profiling is illustrated in Figure 15, where the
probabilities of selecting each candidate resource are adjusted so we would
bring the distribution of QoS requests as close as possible to the distribu-
tion of available capacity. This is the well-known supply-demand matching
problem.
The gain from load profiling (due to reducing fragmentation) is more sig-
nificant when we have large requests, which is especially the case when a
request represents the aggregate of many micro-requests. This gain over load
balancing is also more pronounced as the system becomes more loaded. Ex-
tended models that consider the lifetimes of tasks, the costs of migration of
tasks, etc. require more careful and more complicated analysis. So in sum-
mary, resource selection is an important and difficult problem: how to select
resources so as to optimize some application-oriented measure(s) subject to
QoS constraints and possibly other constraints on the type of resources, inter-
dependencies between tasks to be assigned, etc. This is a multi-constrained
optimization problem that needs fast heuristics t hat can produce high-quality
solutions.
4.2 Network-aware Appl i cati ons
To make resource selection easier, applications should specify a range of ac-
ceptable QoS if possible. This has many other benefits: most i mport ant l y is
Percentage of resources as a
function of available capacity
(Desired) (Current)
Probabilistic selection of resources
QoS demands ~ Available resources
0.50
211
0.00 0.25 0.50 0.75 1.00
Available Capacity
Fig. 15. Maintaining a resource availability profile that matches the characteristics
of QoS requests.
that if the requested QoS can not be delivered by the network in a strictly
guaranteed manner, then the application can adapt to the currently delivered
QoS in a controlled way. For example, the application could send fewer dat a
that have a chance of making it through the network and so the application
gets a consistent (although of lesser quality) service. To do this, application-
specific QoS managers need to maintain for applications QoS measures of
interest to them, and inform applications about the current QoS operating
point so that applications adapt accordingly. Applications could also main-
tain quality by compensating for QoS violations. For example, knowing the
loss rate at the receiver, a video source could adjust its FEC (Forward Error
Correction) error recovery scheme to compensate for errors. QoS managers
could also try to hide QoS violations from applications, for example, by re-
ducing latency through caching or prefetching requested data, by overlapping
communication and computation, by migrating processes to where the dat a
resides, etc.
4.3 QoS Mul t i cast Tree Cons t r uct i on
Another important component of an end-to-end QoS architecture is efficient
network services that are sensitive to the QoS requested by applications.
One important service is multicast routing. A major goal here is to build a
delivery multicast tree with minimum cost and which satisfies QoS delivery
constraints. Again, this is a multi-constrained optimization problem and we
need good and fast heuristics.
An example heuristic is called QDMR (QoS Dependent Multicast Rout-
ing) [15]. A nice feature of this heuristic is t hat it constructs a low-cost tree
212
using a greedy st r at egy t hat augment s t he par t i al l y const r uct ed t ree wi t h
nodes of mi ni mum cost. However, since this can lead t o pat hs t hat vi ol at e
QoS delay bounds, t he t ree const ruct i on pol i cy is adapt ed on t he fly t o give
up some cost savings so as t o increase t he likelihood of sat i sfyi ng t he QoS
delay bound. Fi gure 16 i l l ust rat es t he idea.
D2 D2
D4 ~ D4
S S
.......................... ?o ;:;~;;v;; il I'L~,;,-O',aY Tr."~
co,~cv)- l , o~,e..ise Cos,C.) * COo. v)
(a) (b)
Delay Bound = 6
D2 D2
~~~N I/'2 D4
D4
~. . . D3 ~ 2/3~" D3
~. ~ - Ol / .
2/1 v
2~1
k
1 oN 2 2'2 X X ON1 oN2
3/1 TreeCost = 11 ~== TreeCost = 8
S S
(c) t , . . . (i Delay(u) u receiver . . . . . . . II
o-os,,v, =~ Oe,ayOoun~ Cos,ru) , Cru, v)I I
k 1 otherwi se
Legend:
9 Source node
O Non-Destination nodes
9 Destination nodes
(d)
__ Network links
Tree links
- - - - Removed links
Added least-delay links
c/d cost/delay
Fig. 16. QoS-aware multicast tree construction.
213
Fi gur e 16(a) shows a l ow-cost t r ee cons t r uct i on pol i cy. The cost of a new
node v, denot ed by Cost(v), is defi ned in t e r ms of Cost(u), t he cost of node
u t ha t is al r eady on t he t r ee, and C(u,v), t he cost of t he l i nk f r om u t o v.
Cost(u) does not cont r i but e t o Cost(v) if u is a recei ver. The i dea is t o gi ve
pr i or i t y t o t r ee pat hs goi ng t hr ough des t i nat i on nodes so t hey ar e e xt e nde d
t o add new nodes (as t hey woul d l i kel y have l ower cost ) . By l ever agi ng t he
cost of r eachi ng a des t i nat i on t o r each ot her des t i nat i ons , t he t ot a l cost of
t he t r ee is l owered. Thi s, however, ma y vi ol at e t he r eques t ed del ay bound
(whi ch is t he case for des t i nat i on nodes D3 and D4) . I n Fi gur e 16(b), t he
( sender - based) t r ee of l east - del ay pat hs is shown, whi ch woul d sat i sf y t he
del ay bound if t hi s were i ndeed feasi bl e. However , t he t r ee cost is hi gh.
I n Fi gur e 16(c), we modi f y t he t r ee in (a) by r epl aci ng t he i nf easi bl e pa t hs
for des t i nat i ons D3 and D4 by t he cor r es pondi ng l eas t - del ay pa t hs so as t o
obt a i n a t r ee t ha t sat i sfi es t he del ay bound. However , QDMR can gener at e a
l ower cost feasi bl e t r ee as shown in Fi gur e 16(d). The cost of a new node t o
be added t o t he cur r ent t r ee depends on how f ar we ar e f r om vi ol at i ng t he
del ay bound. Pa t hs t hr ough des t i nat i on nodes ar e no l onger gi ven pr i or i t y
as we get cl oser t o vi ol at i ng t he del ay bound. Thi s ma ke s t he t r ee "bus hi er "
and t he del ay bound is sat i sfi ed at a l ower cost .
4. 4 Scal abi l i ty
Scal abi l i t y is a not he r i mp o r t a n t as pect of an end- t o- end QoS ar chi t ect ur e,
especi al l y for l ar ge wi de- ar ea s ys t ems . One ma i n goal is t o r educe t he vi ew
a QoS ma na ge r has a bout t he s t a t e of t he s ys t em, a nd bas ed on whi ch i t
schedul es r esour ces (host s, pat hs , et c. ). Thi s ma na ge r coul d be a sender or a
recei ver or an agent on behal f of t he appl i cat i on, dependi ng on wher e r esour ce
s el ect i on/ al l ocat i on is done. A key t o scal abi l i t y is t o s e pa r a t e t he way t he
vi ew is col l ect ed f r om how r esour ces are sel ect ed. One way is t o have pr e-
defi ned cl asses of appl i cat i ons and col l ect cl ass s t at i s t i cs t o a t t a c h t o t he
vi ew, as oppos ed t o st at i st i cs a bout i ndi vi dual appl i cat i ons , for exampl e, t he
t ot al capaci t y used by a cl ass r a t he r t ha n t he i ndi vi dual capaci t i es used by
each appl i cat i on.
Anot he r a ppr oa c h is t o have speci al cont r ol ent i t i es, cal l ed view-servers
[1], wher e each vi ew- ser ver ma i nt a i ns onl y a smal l vi ew of i t s s ur r oundi ng
ar ea, as oppos ed t o a full vi ew of t he whol e s ys t em. I f a l ar ger a r e a is needed,
mor e t ha n one vi ew- ser ver coul d be quer i ed and t hei r vi ews mer ged. Fi gur e 17
i l l ust r at es t he i dea of vi ew- ser ver s.
Anot her mor e t r adi t i onal a ppr oa c h is ar ea- bas ed, wher e nodes ar e gr oupe d
i nt o level-1 ar eas, level-1 ar eas ar e gr ouped i nt o l evel -2 ar eas, et c. See Fi g-
ur e 18. Thi s is, for exampl e, t he scal i ng a ppr oa c h used in P NNI ATM r out i ng
[13]. The i dea is t ha t a node has onl y a det ai l ed vi ew of i t s own ar ea, and
less det ai l ed ( s ummar i zed or aggr egat ed) vi ews of r e mot e ar eas, i.e. s u mma -
ri zed vi ew of level-1 ar eas in t he s ame level-2 ar ea, of l evel -2 ar eas in t he
s ame level-3 ar ea, and so on. Di f f er ent schemes coul d be us ed t o a ggr e ga t e an
214
~ of 2
Fig. 17. Viewserver hierarchy.
area. For example, an area may be represented by a fully connected logical
graph connecting all B border nodes (those nodes connecting the area to
other areas), or by a logical star graph with a virtual node in the center, or
by a logical single node. The accuracy of the view, which is presented to the
resource selection process, decreases with more aggressive aggregation at the
benefit of lesser overhead.
4. 5 Robus t ne s s
Reliability is another important aspect of an end-to-end QoS architecture.
This involves replication of important control entities to survive their fail-
ures. It involves avoiding oscillations between alternative configurations t hat
would arise because of the performance interdependencies among the various
applications. This could happen if we blindly honor QoS requests, violating
existing QoS promises which are then reinstated by again violating other
promises. Also, robustness involves reliable switchover to new configurations.
For example, switching to a new multicast tree may require keeping trans-
mission over the old tree until the new tree is fully established and the new
QoS can be reliably delivered.
5 Conc l us i on
This article surveys some of the grand challenges in building integrated end-
to-end QoS architectures. As we have seen, this involves a plethora of issues in
finding fast and good heuristics, defining secure interfaces between different
components, investigating the interactions between these components hori-
zontally and vertically, etc. Another important issue is how to develop such a
complex software in an easy and reusable way. One recent approach is aspect-
oriented programming [18], which differs from the traditional object-oriented
in that different aspects of the application or protocol, such as communi-
cation, core behavior, structure, etc. are not tangled together, which makes
maintenance much easier.
215
Border Nodes
level- 1 area \ ~ Summarized Information
~ in the view of a node in area A
area
Logical Links
__ PhysicalLmks
Full-Mesh Simple-Node Star
O(B 2) overhead 0( I ) overhead O(B) overhead
(b)
Fig. 18. (a) Area hierarchy, (b) aggregation of area C.
Research and devel opment efforts t o build an end-t o-end QoS global sys-
t em are necessarily multi-disciplinary. For such syst em t o become reality,
solutions to different problems have to be i nt egrat ed into a flexible QoS ar-
chi t ect ure. and t he overall performance and cost be evaluated. Initiatives,
such as Int ernet 2 [17] and NGI (Next Generat i on Int ernet ) [16], are provid-
ing the i nfrast ruct ure t o deploy and t est such advanced architectures.
Ref erences
1. C. Alaettinoglu, I. Matta, and A.U. Shankar. A Scalable Virtual Circuit Rout-
ing Scheme for ATM Networks. In Proc. International Conference on Com-
puter Communications and Networks - I CCCN '95, pages 630-637, Las Vegas,
Nevada, September 1995.
2. C. Aurrecoechea, A. Campbell, and L. Hauw. A Survey of QoS Architectures.
A CM/ Spri nger Verlag Multimedia Systems, Special Issue on QoS Architecture,
May 1998.
3. A. Ballardie, P. Francis, and J. Crowcroft. Core Based Trees. In Proc. SIG-
COMM '93, San Francisco, California, September 1993.
4. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Archi-
tecture for Differentiated Services. RFC 2475, December 1998.
216
5. J-C. Bolot. Adapt i ve Applications Tutorial. ADAPTS BOF, I ETF Meeting,
Washington DC, December 1997.
6. B. Braden, D. Clark, and S. Shenker. Int egrat ed Services in t he I nt er net Ar-
chitecture: An Overview. Int ernet Draft, Oct ober 1993.
7. B. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jami n. Resource ReSerVat i on
Protocol (RSVP) - Version 1 Functional Specification. Int ernet Draft , March
1996.
8. S. Deering, D. Estrin, D. Farrinacci, V. Jacobson, C. Liu, and L. Wei. Prot ocol
Independent Multicast (PIM): Protocol Specification. Int ernet Draft , 1995.
9. S. Fischer, A. Hafid, G. Bochmann, and H. de Meer. Cooperat i ve QoS Manage-
ment for Multimedia Applications. In Proc. Fourth I EEE International Con-
ference on Multimedia Computing and Systems (ICMCS' 97), pages 303-310,
June 1997.
10. I. Foster and C. Kesselman. Globus: A Met acomput i ng Infrast ruct ure Toolkit.
Intl. J. Supercomputing Applications, 11(2):115-128, 1997.
11. M.R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979.
12. A. Grimshaw, W. Wulf, and Legion Team. The Legion Vision of a Worldwide
Virtual Comput er. Communications of the ACM, 40(1), Januar y 1997.
13. ATM Forum PNNI Subworking Group. Pri vat e Network-Network Specification
Interface vl . 0 (PNNI 1.0). Technical report , March 1996.
14. ATM Forum Traffic Management Working Group. ATM Forum Traffic Man-
agement Specification v4.0. Technical report , 1996.
15. L. Guo and I. Mat t a. QDMR: An Efficient QoS Dependent Multicast Rout i ng
Algorithm. Technical Report NU-CCS-98-05, College of Comput er Science,
Nort heast ern University, Boston, MA 02115, August 1998. To appear in I EEE
RTAS '99.
16. NGI (Next Generation Internet). ht t p: / / www. ngi . gov.
17. Internet2. ht t p: / / www. i nt ernet 2. edu/ .
18. G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J-M. Loingtier,
and J. Irwin. Aspect-Oriented Programmi ng. In Proc. European Conference
on Object-Oriented Programming, pages 220-242. Springer Verlag, 1997.
19. J. Kurose. Open Issues and Challenges in Providing Quality of Service Guar an-
tees in High-Speed Networks. A CM Computer Communication Review, Januar y
1993.
20. I. Mat t a and A. Bestavros. A Load Profiling Approach to Rout i ng Guar ant eed
Bandwi dt h Flows. In Proc. I EEE INFOCOM, March 1998. Ext ended version in
European Transactions on Telecommunications - Special Issue on Architectures,
Protocols and Quality of Service f or the Internet of the Future, February-March
1999.
21. I. Mat t a and M. Eltoweissy. A Scalable QoS Rout i ng Archi t ect ure for Real-
Ti me CSCW Applications. In Proc. Fourth I EEE Real-Time Technology and
Applications Symposium (RTAS' 98), June 1998.
22. I. Mat t a, M. Eltoweissy, and K. Lieberherr. From CSCW Applications to Mul-
ticast Routing: An Int egrat ed QoS Architecture. In Proc. I EEE International
Conference on Communications (ICC' 98), June 1998.
23. J. Moy. Multicast Extensions to OSPF. Int ernet draft, Network Working
Group, Sept ember 1992.
24. K. Nahrst edt and J. Smith. The QoS Broker. I EEE Multimedia, 2(1):53-67,
Spring 1995.
217
25. P. Steenkiste, A. Fisher, and H. Zhang. Darwin: Resource Management in
Application-Aware Networks. Technical Repor t CMU-CS-97-195, Carnegie
Mellon University, December 1997.
26. H. Topcuoglu, S. Hariri, D. Ki m, Y. Ki m, X. Bing, B. Ye, I. Ra, and J. Valente.
The Design and Evaluation of a Virtual Di st ri but ed Comput i ng Envi ronment .
Cluster Computing, 1(1):81-93, 1998.
27. D. Wai t zman, C. Partridge, and S. Deering. Distance Vector Multicast Rout i ng
Protocol. Request for Comment s RFC-1175, November 1988.
28. J. Zinky, D. Bakken, and R. Schantz. Architectural Support for Qual i t y of
Service for CORBA Objects. Theory and Practice of Object Systems, 3(1):55-
73, 1997.
A P r o t o t y p e of a Co mb i n e d Di gi t al and
Re t r o di g i t i z e d Se ar c habl e Ma t h e ma t i c a l
Jour nal
Gerhard O. Michler
Institute for Experimental Mathematics, Essen University, Ellernstr. 29, 45326
Essen, Germany
1 I nt r oduc t i on
Recently many mathematical journals appear in digital and in paper form.
The digital version offers many advantages in the future. It is searchable,
and in due course it will technically be possible that also a quoted article
can be retrieved and partially be shown in a second window on screen if
it will be part of a searchable digital library system. Over scientific wide
area networks like the German B-Win all digital libraries of the connected
universities can be combined to a national distributed research library, and
this is about to happen in Germany, and in many other countries. If the legal
problems concerning the authentication of the subscribers to a distributed
on-line digital library system will be solved, then the authorized members
of a German university or research institute will be able to view, read and
print the wanted text of digital issues of an available scientific journal at
their personal computers. In the future they also want to search in the whole
distributed research library, and not only in the recent articles. Therefore
the Deutsche Forschungsgemeinschaft (DFG) has provided financial support
for the establishment of two Centers for Retrospective Digitization, one at
the State and University Library at G5ttingen and the other at the Bavarian
State Library Munich.
The mathematicians are lucky, because the GSttingen library has been
the DFG-Sammelstelle for Mathematics for long, which means t hat almost
all essential mathematical journals and books have been collected there with
financial support of the Deutsche Forschungsgemeinschaft. If all the digital
versions of the mathematical journals were collected at the digital library
in GSttingen, and the document management system AGORA [2] was im-
plemented at the GSttingen Center for Retrospective Digitization, then the
authorized German mathematicians could use this searchable digital library
from their workstations at their university or research institute.
Of course there is a long way to achieve this ideal situation. Besides fi-
nancial and difficult legal problems there are also a lot of technical problems
which still have to be solved. It is the latter point of view which will be
addressed in this article.
22O
2 A p r o t o t y p e ma t h e ma t i c a l t e x t r e c o g n i t i o n s y s t e m
Si nce 1997 my s t udy gr oup is cooper at i ng wi t h t he publ i sher Birkh~iuser
(Basel) in or der t o r et r odi gi t i ze 6 vol umes of t he Ar chi v der Ma t he ma t i k
publ i shed by Bi rkh~user in t he years 1993 till 1996. Thi s is a wi del y dis-
t r i but ed mat hemat i cal j our nal whi ch publ i shes shor t art i cl es f r om all ma j or
areas of mat hemat i cs. Fur t her mor e, its t ypes et t i ng is excel l ent . Thi s offers
a chance for t he use of opt i cal char act er r ecogni t i on ( OCR) syt ems for t he
r et r ospect i ve di gi t i zat i on t ask.
Many or di nar y t ext s can aut omat i cal l y be r et r odi gi t i zed by means of com-
mer ci al OCR syst ems. In [1] you can read: " Wi t h Adobe Acr obat Ca pt ur e
2.01 soft ware, you can easily t ur n vol umes of paper i nt o sear chabl e Por t abl e
Document For mat ( PDF) libraries. I t ' s per f ect for forms, manual s, specifica-
t i ons, books - any i mpor t ant document you need t o make accessible on your
Web site or i nt r anet . " Besides Adobe [1], t her e ar e also ma ny ot her commer -
cial OCR syst ems, like t he one of t he Wor l d Scientific Publ i shi ng Compa ny
[4], and Fi neReader [8].
Wha t is so special wi t h mat hemat i cal t ext s? A convi nci ng answer is gi ven
by R. Fat eman in his art i cl e [5]. Ther e he wri t es in t he i nt r oduct i on: "Conven-
t i onal OCR pr ogr ams have low accur acy for mat hemat i cs for several reasons.
The ver y sensible heuri st i cs t ypi cal l y used for t ext r ecogni t i on i ncl ude com-
put i ng t he l ocat i ons of t ext lines and est i mat i ng char act er sizes usi ng gl obal
st at i st i cs as well as local processing. Thes e pr ogr ams may also use l anguage-
based st at i st i cs ( per haps a spelling di ct i onar y) as t ool s t o i mpr ove r ecogni t i on
r at es. By cont r ast , mat hemat i cs is not necessari l y ar r anged on lines, its char-
act er sizes vary, t he l et t er and symbol frequenci es ar e di st i nct f r om nor mal
t ext , and many ot her t ext - or i ent ed heuri st i cs ar e di r ect l y count er - pr oduct i ve.
Addi t i onal l y, even if t he mat hemat i cs were somehow recogni zed, convent i onal
OCR pr ogr ams whose t r adi t i onal out put is (say) ASCI I t ext , need t o be sub-
st ant i al l y augment ed wi t h some met a-l evel l anguage bef or e t hey can expr ess
' ma t h resul t s' as t hei r out put . Al t hough most advanced wor d- pr ocessi ng pro-
gr ams have some escape mechani sm for ' doi ng' mat hemat i cs , t her e is still no
uni f or m s t andar d for expressi ng t wo-di mensi onal l ayout s, subscr i pt posi t i on-
ing, vari abl e-si zed char act er s, unusual ma t h oper at or s , et c. "
In 1997 t he Deut sche Forschungsgemei nschaft has agr eed t o s uppor t t he
r esear ch pr oj ect "Ret r odi gi t i zat i on of t he mat hemat i cal j our nal Ar chi v der
Ma t he ma t i k" for one and t hen for t wo mor e years. It s pur pos e is t o r et r odi g-
i t i ze t he 6 vol umes of t hi s j our nal which have appear ed f r om 1993 t o 1996.
Wi t hout t he legal s uppor t of t he publ i sher Birkh~iuser t hi s pr oj ect coul d not
have been st ar t ed. The out come will be a pr ot ot ype of a t ext r ecogni t i on
syst em which allows t o search in t he or di nar y t ext of an ori gi nal l y pr i nt ed
art i cl e. Fur t her mor e, it pr oduces a versi on of t he mat hemat i cs f or mul as and
symbol s t ha t reflects t he semant i cs of t he mat hemat i cal par t of t he t ext such
t ha t t he r et r odi gi t i zed formul as are wr i t t en in t ex f or mat . Thus t hey can be
i ncor por at ed i nt o new mat hemat i cal manuscr i pt s.
221
Many computer algebra systems like AXIOM, MAPLE, MATHEMAT-
ICA or MAGMA allow symbolic formula manipulation and to display the
calculated formulas as typeset expressions. Unfortunately, at the moment
the computer algebra systems are not able to read in digitized mathematical
formulas. It is hoped that this deficiency will be overcome in the future. Then
the retrospectice digitization text systems will become even more important
for mathematical research.
2.1 Re c ogni t i on of mat he mat i c al expres s i ons and f ormul as
In [3], [6] and [7] R. Fateman, T. Tokuyasu and coauthors have described
a package of LISP programs for optical mathematical formula recognition
and translation of a scanned mathematical text into digital LISP format. My
former collaborator Dr. J. Rosenboom has trained this special OCR system
so that it became acquainted with the special typesetting of mathematical
formulas used by the printers of the Archiv der Mathematik. Furthermore,
he has extended the recognition algorithms for special mathematical symbols
and the geometry of combined formulas. Inspired by R. Fateman' s suggestions
Rosenboom has written a program to parse the digital mathematical LISP
formulas and then to produce an output in tex format. In particular, he
incorporated procedures enabling the LISP program to recognize different
types of fonts like normal, italics, Greek, bold face, etc.
In the LISP code there are several procedures for understanding the layout
of the printed pages of a journal. So it is necessary to have segmentation
procedures for dissecting a page into lines, lines into words, words into letters
or a mathematical formula into mathematical symbols.
Inspired by R. Fateman' s article [5] Rosenboom has written a program
which separates ordinary text of a scanned page from areas of the page con-
sisting only of mathematical formulas. Such a separation leads to a substantial
improvement of the retrospective digitization procedure. This was demon-
strated by J. Rosenboom in his lecture [14] at the international workshop on
"Retrodigitalization of mathematical journals and automated formula recog-
nition" organized by R. Fateman (Berkeley), E. Mittler (GSttingen) and the
author at the Institute for Experimental Mathematics of Essen University in
December 1997.
The idea of separating the ordinary text of a scanned page from the
remainder lead us to use a commercial OCR system for the recognition of
this part of a scanned page. In our project we use FineReader [8], because it is
very reliable and has a very good application programming interface for C++
and other modern programming languages. This is important, because we do
not have access to the source code of the commercial product FineReader.
The following copy of a Tiff file of a scanned page of an article of the
"Archiv der Mathematik" will be used to explain the different retrodigitizing
procedures.
222
Arch. Malh,, Vet. 65, 399 -407 {19951) 0003-;'l,l:,J!)X.,'YS.'05tJ,.~4)39~ S 3.),WU
c. 194'5 Birkh~i~;,'r Verl:ag. l.}:~s,:]
Tameness of biserial algebras
~,~."[LI,LA M C ~ , ~ W t I ? Y - tt,.Dl-: V l : : l t
ht thi~ paper ,,,,'e st udy :[iniee di mens i onal as4,~ciati',,c al gebras (wi t h t l over ari M[le-
btai~.~a, lty r field K. An a~gebra is, b~ena~ if every inde, c,>mp.usable proir162 left ~,r
righl modul e P cont ai ns two unJsenal subnrcKlules whaae sum i s tire uni que r na~hmd
s ubmodul e of P and who~e incer:~ec6on is ei t her zero r~r s l i t t p t c . (A ,t~o[.tulr is uni.seriM if
it has a uttiqur c:ompc:.~Jt[orl .sc-rie.4..j Arl a[ gcbr a A has ~am: r~-];r{,.'(r/].rafio~r ~.l.,fi, e Jl its
indeconlp(~sab~c modul es )k" in one-[~ararnercr N l ' t l i ~ i r , ' . 4 C ' ( ~ i i ' ~ r r 131 . Here we
pr ove the fol l owi ng r ~ul t .
Theor em A. Bi.~ri~:~f afgebre.~ ha~,e t.a~slr rel.,r~:'.~r .'.}:I.,~..
Thi.~: s o n a r s to, have t:~r coniecu4rcd for so.me mac: it ts al r ead' , kao,~.~ f o r so-
caUed special biser.ial al gebras [121 and Ibr ma ny ot her biscriaL al f ebr t l s [s~e h n exampl e
[?] and [I0, Chapt er 211. The two mahl hl ~rcdi cnt s irl our pr oof are V' il:~-Freyer' s ~truc~.u:-e
{hr for &a-sic biserial algebras, whi ch ,,,,'as mvti~'a~ed by ~his pr obl em, t~nd a modilS-
cat i on or' Oeif~'s Theor em thal al gebras wi{h t-t t ame degener at i on a t , t ame. i n v,'l~iclt we
var y {he rel at i ons in.stead of the st r uct ur e eucfficieat~ 0f l he a~lehra. IfA[g{dl t~. thr vt-mcty
of a~ocLat i ve Bat/at al gebr a slructureS on an d-dime~lsi~:,na] ; ' odor ~pace, II~e.q Gr
Theor em [61 sea ~es that [f tb.e cl~.~ure or the G L {d, .K )-orbiI r_, f A ff A.i$ (d) con t ai l ~ a ~arne
al gebra t hen A is entree. In our versi on the aLgebr;,s ma y have diII-crcnt dl t nen. ' 4ons:
Theor em & Lr ,4 be ~ J # J J [ 4 ' " d l ' / ' ; ' l e ? r l 4 " J O g ' 6 t L r , d r ~ , ' ~ - f # r ( W . ~e( .u ~'~' U/1 irred.~:~b[# varie O' omi
~r /~ . . . . . f , : X- , ,4 ~' morphi,~ms ,%/ vatJerie..~ I'.wh~:.'i,'~: ..4 hgs it.~ r~,~r162162 r xtractarr ,;~-
q l f l r I e sp~Ic-e'l. F o r x e X ~ , ' r J ' t r , 4 ~ , = A , , " { J I { , K ) . ) , L , # I - " : ' r j , , & t @ X . J r f , 4 b . . t , . 1 ~ , i t J J . h l f f 4 ,1 # ~ ,:,J
.4,: = .,t.,~, . f o r g e n e r , d x ~ . u Le. f o r all x ~ t ~ ~a~_m.empD: ,:~pea s r o f . C t h e n .4 , , L'.; u4~ne.
In practior GeJ0' s Theor em has ~,suatly been ~ c d i n ~he l s r m of Theor em B, bec in
this c ~ r one also has to check th~tt nil l hc al gebras ..4~ ha' re ttle same dLnlert~Lun_ Of cc)~rsr
Theor cr a B and OeiB' s t hear em ha~,e a cc~mm{m rct/neme[l!, in WhJCJl bo{h t he aJf, ebr a .4
and relaekJns ar e al l owed Io vary. ] l i e a u t h o r is sl~pportcd by {he F-PSR(; of Grcar
Britain,.
] . P r o o f o f TJ~eorel~l B . t f a n a l g e b r a i c g r o u p G u ~ t s o n a {JL(.~r []r i r r e d u c i b l e )
variety Y Ihcn the ,=umber qfpor~te~r of G on u is dim~; Y = max .',dim )i..,:, - ~1 s ~ 0'1
wher ~,) is the U~lion ,:,f the orbic.~ of di met l si on ~. By a G-startle ;ab.~r Z E "k we
223
2. 2 Re c o g ni t i o n of ordi nary ma t he ma t i c a l t e xt s
Fi neReader can recogni ze art i cl es wr i t t en in di fferent l anguages like Engl i sh,
French, Ge r ma n and Russi an. However, it does not like mi xt ur es of l anguages
in a gi ven paper . It achieves r ecogni t i on by checki ng its di ct i onar i es of words
in a chosen l anguage. For t he r et r ospect i ve di gi t i zat i on of ma t he ma t i c a l art i -
cles t hese di ct i onari es do not suffice, because t hey do not cont ai n t he speci al
mat hemat i cal t er ms, abbr evi at i ons, names of aut hor s, quot ed sci ent i st s or
mat hemat i cal j our nal s. Ther ef or e addi t i onal mat hemat i cal l y or i ent ed di ct i o-
nari es have been wr i t t en in C+ + . The y can be r ead by Fi neReader over i t s
appl i cat i on i nt erface. Thus t he or di nar y t ext of a scanned page of a mat he-
mat i cal art i cl e is recogni zed by Fi neReader al most perfect l y. I t can be r ead
on screen and its di gi t al versi on is wr i t t en in ASCII. Ther ef or e i t is possi bl e
t o search in t hi s di gi t i zed t ext for words. Thos e par t s of t he scanned page
whi ch have not been recogni zed by Fi neReader like mat hemat i cal formul as,
geomet r i c pi ct ur es or di agr ams ar e defi ned by our r et r os pect i ve di gi t i zat i on
syst em t o be mat hemat i cal formul as. Thes e sect i ons of t he scanned page ar e
sent t o t he LI SP pr ogr am. It s out put is an ASCI I t ext in t ex f or mat . Also t hi s
mat hemat i cal cont ent s of t he scanned page can be vi ewed on screen. How-
ever, t hi s is done in a di fferent wi ndow t han t he or di nar y Fi neReader t ext .
In sect i on 2.5 it is descri bed how t hese t wo par t s of t he t e xt of a scanned
art i cl e ar e linked t o each ot her .
2. 3 Ge t t i ng bi bl i ographi c dat a of an art i cl e
In or der t o recogni ze t he speci al l ayout of t he first page of a scanned ar t i cl e
anot her special pr ogr am has been wr i t t en. It r eads any scanned page. By
anal yzi ng t he first recogni zed l et t er s of it, it is able t o deci de whet her t hi s
page is t he first page of t he art i cl e. If so, t hen t he pr ogr am recogni zes all
bi bl i ographi c da t a about t hi s paper : Name of t he j our nal , number of t he
vol ume cont ai ni ng t he art i cl e, first and l ast page, year of appear ance, t he
i nt er nat i onal s t andar d serials number (ISSN), owner of copyr i ght ( Bi r kh~user
Verlag), t he t own of t he publ i sher (Basel).
Moreover, t he pr ogr am recogni zes t he t i t l e, t he number of aut hor s and
t hei r names t oget her wi t h t hei r first names and initials. Fr om t he Ti f f file of
t he exampl e gi ven in sect i on 2.1 it gives t he following out put .
Wi l l i am Crawl ey-Boevey, Tameness of biserial al gebras, Arch. Mat h. 65, 399-
407 (1995)
Fr om t hese da t a anot her pr ogr am pr oduces t he following bi bl i ogr aphi c
da t a in SGML f or mat :
224
(journal)
(volume)
(firstPage)
(lastPage)
(year)
(authors)
(author)
(lastName)
(firstName)
(firstInitial)
(secondInitial)
(/ aut hor)
(authors)
(title)
Arch. Math.
65
399
407
1995
( / j our nal )
(/ vol ume/
(/firstPage)
(/ l ast Page)
(/ year)
Crawley-Boevey
William
(/ l ast Name/
(/firstName)
(/firstInitial)
(/secondInitial)
Tameness of biserial algebras (/ t i t l e /
These dat a written in t he st andard generalized markup language (SGML)
will enable ot her digital library document management systems like AGORA
[2] or MILESS [11] to retrieve these bibliographic records. Such an example
is described in [10].
The Archiv der Mat hemat i k always prints t he complete addresses of all
authors at t he end of an article. In recent volumes also their e-mail adresses
are given. Both are recognized by our retrodigitizing programs. If necessary
these informations can also be produced in SGML format. The end of t he
last address is also used to mark t he end ot t he retrodigitized article.
2. 4 R e c o g n i z i n g t h e r e f e r e n c e s o f a n a r t i c l e
My collaborators Dr. G. Hennecke and Dr. H. Gollan have wri t t en anot her
special program for t he recognition of t he references. It reads any scanned
page and decides by itself, whet her or not this page contains t he beginning or
t he remaining part of t he references. Each reference is digitized in full t ext ,
including t he abbreviations of t he cited journals, volumes, years and page
numbers. In t he example of section 2.1 t he reference [12] ment i oned on t he
first page is recognized as follows.
[12] B. WALD and J. WASCHBUSCH, Tame biserial algebras, J. Algebra
95,480-500 (1985).
From these dat a t he program produces t he following SGML file:
225
(referenceNumber) [12]
(authors)
(author)
(lastName) Wald
(firstName)
(firstInitial) B.
(secondInitial)
( / aut hor )
(author)
(lastName)
(firstName)
(firstInitial)
(secondInitial)
( / aut hor )
(/ aut hors)
(title)
(journal)
(series)
(volume)
(firstPage)
(lastPage)
(year)
(/ referenceNumber)
(/ l ast Name)
(/firstName)
(/firstInitial)
(/secondInitial)
Waschbfisch (/ l ast Name)
(/firstName)
J. (/firstInitial)
(/secondInitial)
Tame biserial algebras (/ t i t l e /
Journal of Algebra (/ j ournal /
(/series)
95 (/volume)
480 (/firstPage)
500 (/lastPage)
1985 (/ year)
Using later t he functions of a distributed digital library document man-
agement system these dat a allow to search in t he digitized volumes of t he
mat hemat i cal journal and retrieve t he quoted digital articles. Such an exam-
ple is described in [10].
2. 5 I nc or por at i on of t he di f f erent di gi t i zed t e xt s i nt o a
mul t i va l e nt d o c u me n t s y s t e m
Many mat hemat i cal articles of t he Archiv der Mat hemat i k contain pictures,
complicated diagrams and tables which cannot be recognized by any OCR
system. These part s of a scanned page have to be stored as images. However,
t hey do not contain any information which is necessary for t he searchability
of a mat hemat i cal article.
In order to enable t he reader to view t he complete content of an article of
t he retrodigitized mat hemat i cal journal on screen each page is scanned with
600 dots per inch, and a TI FF file of t he whole page is produced. Applying
t hen t he procedures described in sections 2.1, 2.2, 2.3 and 2.4 we so obt ai n
t he following separate files:
1) TI FF file of t he whole scanned page,
2) ASCII file of its ordi nary t ext ,
3) Tex file of its mat hemat i cal formulas text,
226
4) Text and SGML files of the quot ed references,
5) SGML file of its bibliographic data.
These different files are only useful for t he scientists if t hey can be linked
to each other. This is done by the multivalent document syst em designed by
T.A. Phelps and R. Wilensky [13]. This software syst em has been produced by
T.A. Phelps [12], recently. It is a new general paradigm t hat regards complex
documents as multivalent documents comprising multiple layers of distinct
but intimately related content. Phelps and Wilensky write in [13]: "Small,
dynamically-loaded program objects, or ' behaviors' , activate the content and
work in concert with each other and layers of content to support arbi t rari l y
specialized document types. Behaviors bind t oget her t he di sparat e pieces of
a multivalent document to present the user with a single unified conceptual
document. Examples of the diverse functionality in multivalent document s
include: ' OCR select and paste' , where the user describes a geometric region
on the scanned image of a printed page and t he corresponding t ext characters
are copied out . "
Therefore this multivalent document syst em is very useful for our ret-
rospective digitization project. The following pi ct ure describes t he different
layers containing t he separate files 1) till 5), and t he layer for user annot a-
tions.
I F u t u r e c o n t e n t s
I Bibliographic dat a
I User annot at i ons
I Cited references
I Mathematical formulas
I Ordinary text parts
Tiff file of a whole
scanned page of a
printed mat hemat i cal
article
Multiple active semantic layers of the contents of a scanned page
227
The pr ogr am of Phel p' s mul t i val ent document syst em ( MVD) can aut o-
mat i cal l y r ead t he TI FF files 1) of t he scanned pages of an ar t i cl e and show
t hem on screen. Thus t he user can have a pr i nt out of t he whol e manus cr i pt .
The TI F F file of t he mat hemat i cal art i cl e is cont ai ned in l ayer 1 of i t s MVD
syst em.
In or der t o i ncor por at e t he or di nar y t ext file 2) i nt o t hi s s ys t em a not he r
pr ogr am is wr i t t en whi ch enabl es Fi neReader t o pr oduce t he di gi t i zed ordi -
na r y t ext in Xdoc f or mat . Thi s Xdoc file descri bes besi des t he ASCI I t e xt
also t he coor di nat es of each of i t s l et t ers. The MVD syst em now allows t o
search for a word in t he second l ayer and t o show t he r esul t on t he fi rst l ayer
on screen. Thus t he syst em pr ovi des full sear chabi l i t y in t he or di nar y t e xt of
r et r odi gi t i zed page of a mat hemat i cal art i cl e. The or di nar y t e xt par t s of t hi s
mat hemat i cal art i cl e are cont ai ned in l ayer 2 of its MVD syst em.
As an exampl e we now pr esent t he Xdoc file of t he or di nar y t e xt of W.
Cr awl ey- Boevey' s art i cl e "Tameness of bi seri al al gebr as" cor r es pondi ng t o t he
Ti f f file gi ven in sect i on 2.1. I t is free of mat hemat i cal f or mul as or expressi ons.
[ a ; " X D O C . i 0 . 0 " ; E ; " F r E n g i n e 4 0 - C B l " ]
[ d ; " 6 5 _ 5 _ 3 9 9 . x d c " ]
[ p ; I ; P ; 8 3 ; S ; O ; 1 6 6 6 ; 0 ; 0 ; 3 0 1 0 ; 4 6 1 8 ]
[ t ; i ; 1 ; O ; O ; A ; .... ; .... ; .... ; 0 ; 0 ; 0 ; 0 ; i ]
[ f ; 0 ; " < D E F A U L T > " ; R ; q ; I O ; V ; O ; O ; O ; i 0 ; i 0 0 ]
[ f ; i ; " C o u r i e r " ; R ; q ; I 0 ; V ; 6 0 ; 5 0 ; I 0 ; 1 5 ; i 0 0 ]
[ s ; l ; 8 8 ; O ; 7 0 ; p ; l ] A r c h . [ h ; 1 9 6 ; 3 2 ] M a t h . , [ h ; 4 2 6 ; 2 8 ] V o l . [ h ; 5 6 9 ; 2 5 ]
6 5 , [ h ; 6 7 7 ; 3 0 ] 3 9 9 - 4 0 7 [ h ; 9 6 5 ; 2 9 ] ( 1 9 9 5 ) [ h ; 1 1 7 7 ; 8 0 2 ] 0 0 0 3 - 8 8 9 X / 9 5 /
6 5 0 5 - 0 3 9 9 [ h ; 2 7 1 9 ; 2 8 ] $ [ h ; 2 7 7 8 ; 2 3 ] 3 . 3 0 / 0 [ y ; 2 9 7 8 ; 0 ; 7 0 ; 0 ; H I
[ s ; 1 ; 1 9 8 0 ; 0 ; 1 6 6 ; p ; i ] [ h ; 2 0 4 4 ; 3 4 ] 1 9 9 5 [ h ; 2 2 0 9 ; 3 2 ] B i r k h \ " a u s e r
[ h ; 2 5 6 5 ; 2 5 ] V e r l a g , [ h ; 2 7 9 5 ; 3 2 ] B a s e l [ y ; 2 9 7 8 ; 0 ; 1 6 6 ; 1 ; S ]
[ s ; i ; 8 4 4 ; 0 ; 7 7 4 ; p ; l ] T a m e n e s s [ h ; 1 2 7 1 ; 4 1 ] o f [ h ; 1 4 1 1 ; 2 9 ] b i s e r i a l
[ h ; 1 7 5 6 ; 4 2 ] a l g e b r a s [ y ; 2 1 6 5 ; 0 ; 7 7 4 ; 1 ; S ]
[ s ; i ; 1 4 6 1 ; 0 ; l O 0 9 ; p ; 1 ] B y [ y ; 1 5 3 7 ; 0 ; 1 0 0 9 ; I ; S ]
[ s ; 1 ; 1 0 9 2 ; 0 ; 1 1 4 8 ; p ; 1 ] W I L L I A M [ h ; 1 3 5 1 ; 2 8 ] C R A W L E Y - B O E V E Y [ y ; 1 9 1 4 ; 0 ;
I 1 4 8 ; 1 ; S ]
[ s ; 1 ; 1 0 8 ; 0 ; 1 4 4 1 ; p ; I ] I n [ h ; 1 7 6 ; 3 4 ] t h i s [ h ; 3 2 7 ; 3 3 ] p a p e r [ h ; 5 4 8 ; 3 2 ]
w e [ h ; 6 6 7 ; 3 2 ] s t u d y [ h ; 8 7 9 ; 3 1 ] f i n i t e [ h ; 1 0 7 6 ; 3 i ] d i m e n s i o n a l [ h ;
1 5 1 1 ; 3 3 ] a s s o c i a t i v e [ h ; 1 9 0 2 ; 3 2 ] a l g e b r a s [ h ; 2 2 0 9 ; 3 I ] ( w i t h [ h ;
2 4 0 7 ; 3 7 ] 1 ) [ h ; 2 4 9 6 ; 3 3 ] o v e r [ h ; 2 6 7 4 ; 3 1 ] a n [ h ; 2 7 8 3 ; 3 4 ] a l g e -
[ y ; 2 9 7 3 ; 0 ; 1 4 4 1 ; 0 ; H ]
[ s ; 1 ; 2 8 ; 0 ; 1 5 3 9 ; p ; i ] b r a i c a l l y [ h ; 3 1 5 ; 2 7 ] c l o s e d [ h ; 5 4 5 ; 3 0 ] f i e l d
[ h ; 7 1 2 ; 3 6 ] K . [ h ; 8 1 8 ; 3 i ] A n [ h ; 9 4 4 ; 2 9 ] a l g e b r a [ h ; 1 2 1 7 ; 2 9 ] i s [ h ; 1 2 9 3 ;
3 0 ] b i s e r i a l [ h ; 1 5 5 8 ; 2 6 ] i f [ h ; 1 6 3 2 ; 1 8 ] e v e r y [ h ; 1 8 2 6 ; 2 7 ]
i n d e c o m p o s a b l e [ h ; 2 3 9 3 ; 2 8 ] p r o j e c t i v e [ h ; 2 7 4 8 ; 2 7 ] l e f t [ h ; 2 8 7 4 ; 3 0 ]
o r [ y ; 2 9 7 5 ; 0 ; 1 5 3 9 ; 0 ; H ]
[ s ; 1 ; 2 8 ; 0 ; 1 6 3 6 ; p ; I ] r i g h t [ h ; 1 8 3 ; 3 5 ] m o d u l e [ h ; 4 6 6 ; 3 9 ] P [ h ; 5 4 7 ; 3 6 ]
c o n t a i n s [ h ; 8 6 2 ; 3 4 ] t w o [ h ; 1 0 1 8 ; 3 6 ] u n i s e r i a l [ h ; 1 3 3 5 ; 3 5 ] s u b m o d u l e s
228
[ h ; 1 7 6 5 ; 3 6 ] wh o s e [ h ; 2 0 0 6 ; 3 4 ] s um [ h ; 2 1 7 4 ; 3 5 ] i s [ h ; 2 2 5 7 ; 3 4 ] t h e [ h ;
2 3 9 1 ; 3 5 ] u n i q u e [ h ; 2 6 5 2 ; 34] m a x i m a l [ y ; 2 9 7 3 ; 0 ; 1 6 3 6 ; 0 ; HI
[ s ; 1 ; 2 7 ; 0 ; 1 7 3 4 ; p ; 1] s u b m o d u l e [ h ; 3 9 2 ; 2 7 ] o f [ h ; 4 8 7 ; 2 4 ] P [ h ; 5 5 3 ; 2 7 ]
a n d [ h ; 7 0 2 ; 2 7 ] wh o s e [ h ; 9 3 4 ; 2 6 ] i n t e r s e c t i o n [ h ; 1 3 4 6 ; 2 7 ] i s
[ h ; 1421 ; 2 5 ] e i t h e r [ h ; 1635 ; 2 3 ] z e r o [ h ; 1798 ; 2 8 ] o r [ h ; 1 8 9 7 ; 2 5 ] s i m p l e .
[ h ; 2 1 5 0 ; 2 7 ] (A [ h ; 2 2 5 2 ; 26] m o d u l e [ h ; 2 5 2 6 ; 2 6 ] i s [ h ; 2 6 0 0 ; 2 7 ]
u n i s e r i a l [ h ; 2 9 0 8 ; 26] i f [ y ; 2 9 8 2 ; 0 ; 1 7 3 4 ; 0 ; H]
[ s ; 1 ; 2 6 ; 0 ; 1 8 3 2 ; p ; 1] i t [ h ; 6 9 ; 4 1 ] h a s [ h ; 2 1 8 ; 3 8 ] a [ h ; 2 9 1 ; 3 9 ] u n i q u e
[ h ; 5 5 6 ; 3 6 ] c o m p o s i t i o n [ h ; 1005 ; 4 0 ] s e r i e s . ) [ h ; 1 2 6 2 ; 3 7 ] An [ h ; 1 3 9 5 ;
4 0 ] a l g e b r a [ h ; 1679 ; 4 2 ] A [ h ; 1772 ; 3 6 ] h a s [ h ; 1 9 1 7 ; 4 0 ] t a m e [ h ; 2 1 0 8 ; 3 7 ]
r e p r e s e n t a t i o n [ h ; 2 6 1 2 ; 37] t y p e [ h ; 2 7 8 3 ; 3 8 ] i f [ h ; 2 8 6 9 ; 2 9 ] i t s
[ y ; 2 9 7 2 ; O ; 1 8 3 2 ; O ; H ]
[ s ; 1 ; 26 ; 0 ; 1930 ; p ; 1] i n d e c o m p o s a b l e [ h ; 5 6 5 ; 3 8 ] m o d u l e s [ h ; 8 8 1 ; 3 9 ] l i e
[ h ; 9 9 4 ; 3 8 ] i n [ h ; 1 0 9 2 ; 3 9 ] o n e - p a r a m e t e r [ h ; 1621 ; 3 6 ] f a m i l i e s , [ h ;
1 9 2 6 ; 4 0 ] s e e [ h ; 2 0 6 1 ; 38] f o r [ h ; 2 1 9 2 ; 3 6 ] e x a m p l e [ h ; 2 5 0 3 ; 4 0 ] [ [ 3 ] . [ h ;
2 6 4 6 ; 4 1 ] H e r e [ h ; 2 8 4 7 ; 3 7 ] we [ y ; 2 9 7 2 ; 0 ; 1930 ; 0 ; H]
[ s ; 1 ; 26 ; 0 ; 2 0 4 5 ; p ; 1] p r o v e [ h ; 2 1 3 ; 3 1 ] t h e [ h ; 3 4 4 ; 3 0 ] f o l l o w i n g [ h ; 6 8 4 ;
3 0 ] r e s u l t . [ y ; 911 ; 0 ; 2 0 4 5 ; 1 ; S]
[ s ; 1 ; 106 ; 0 ; 2 2 2 4 ; p ; 1] T h e o r e m [ h ; 3 9 8 ; 2 4 ] A. [ h ; 4 9 2 ; 3 6 ] B i s e r i a l [ h ; 7 7 2 ;
3 0 ] a l g e b r a s [ h ; 1073 ; 3 1 ] h a v e [ h ; 1245 ; 3 2 ] t a m e [ h ; 1428 ; 3 1 ]
r e p r e s e n t a t i o n [ h ; 1 9 2 7 ; 2 7 ] t y p e . [ y ; 2 1 0 6 ; 0 ; 2 2 2 4 ; 1 ; S ]
[ s ; 1 ; 1 0 6 ; 0 ; 2 3 7 2 ; p ; 1 ] T h i s [ h ; 2 5 0 ; 3 9 ] a p p e a r s [ h ; 5 4 5 ; 3 9 ] t o [ h ; 6 4 9 ; 3 8 ]
h a v e [ h ; 8 4 2 ; 3 6 ] b e e n [ h ; 1 0 3 0 ; 3 7 ] c o n j e c t u r e d [ h ; 1 4 5 3 ; 3 8 ] f o r [ h ; 1 5 8 4 ;
3 7 ] s o m e [ h ; 1 7 9 1 ; 3 8 ] t i m e ; [ h ; 1 9 9 7 ; 3 6 ] i t [ h ; 2 0 7 7 ; 3 8 ] i s [ h ; 2 1 6 3 ; 3 7 ]
a l r e a d y [ h ; 2 4 4 5 ; 3 6 ] k n o w n [ h ; 2 7 0 8 ; 3 8 ] f o r [ h ; 2 8 4 0 ; 3 6 ] s o - [ y ; 2 9 7 2 ; 0 ;
2 3 7 2 ; 0 ; H ]
[ s ; 1 ; 2 7 ; 0 ; 2 4 6 9 ; p ; 1 ] c a l l e d [ h ; 2 1 7 ; 2 6 ] s p e c i a l [ h ; 4 6 4 ; 2 8 ] b i s e r i a l
[ h ; 7 2 9 ; 2 6 ] a l g e b r a s [ h ; 1 0 3 0 ; 2 6 ] [ [ 1 2 ] [ h ; 1 1 7 1 ; 2 8 ] a n d [ h ; 1 3 2 1 ; 2 5 ]
f o r [ h ; 1 4 3 9 ; 2 5 ] m a n y [ h ; 1 6 4 8 ; 2 6 ] o t h e r [ h ; 1 8 5 0 ; 2 5 ] b i s e r i a l [ h ; 2 1 1 2 ;
2 6 ] a l g e b r a s [ h ; 2 4 1 3 ; 2 4 ] ( s e e [ h ; 2 5 5 6 ; 2 5 ] f o r [ h ; 2 6 7 4 ; 2 4 ] e x a m p l e
[ y ; 2 9 7 3 ; 0 ; 2 4 6 9 ; 0 ; H ]
[ s ; 1 ; 2 9 ; 0 ; 2 5 8 1 ; p ; 1 ] [ [ 7 ] [ h ; 1 0 4 ; 2 2 ] a n d [ h ; 2 4 8 ; 2 2 ] [ [ 1 0 , [ h ; 3 8 2 ; 2 2 ]
C h a p t e r [ h ; 6 7 6 ; 2 2 ] 2 ] ) . [ h ; 8 0 4 ; 2 3 ] T h e [ h ; 9 5 4 ; 2 1 ] t w o [ h ; 1 0 9 6 ; 2 0 ]
m a i n [ h ; 1 2 8 1 ; 2 2 ] i n g r e d i e n t s [ h ; 1 6 6 8 ; 2 1 ] i n [ h ; 1 7 5 0 ; 2 1 ] o u r [ h ; 1 8 8 6 ;
1 8 ] p r o o f [ h ; 2 0 9 1 ; 1 2 ] a r e [ h ; 2 2 0 3 ; 2 3 ] V i l a - F r e y e r ' s [ h ; 2 6 5 6 ; 1 9 ]
s t r u c t u r e [ y ; 2 9 7 2 ; 0 ; 2 5 8 1 ; 0 ; H ]
I s ; 1 ; 2 8 ; 0 ; 2 6 6 5 ; p ; i ] t h e o r e m [ h ; 3 0 0 ; 2 6 ] f o r [ h ; 4 1 9 ; 2 7 ] b a s i c [ h ; 6 1 1 ; 2 5 ]
b i s e r i a l [ h ; 8 7 3 ; 2 8 ] a l g e b r a s , [ h ; 1 1 9 2 ; 2 7 ] w h i c h [ h ; 1 4 1 4 ; 2 7 ] w a s
[ h ; 1 5 6 3 ; 2 6 ] m o t i v a t e d [ h ; 1 9 2 5 ; 2 8 ] b y [ h ; 2 0 3 2 ; 2 6 ] t h i s [ h ; 2 1 7 5 ; 2 6 ]
p r o b l e m , [ h ; 2 4 9 6 ; 2 7 ] a n d [ h ; 2 6 4 6 ; 2 6 ] a [ h ; 2 7 0 8 ; 2 5 ] m o d i f i [ y ; 2 9 7 1 ; 0 ;
2 6 6 5 ; 0 ; H ]
[ s ; 1 ; 2 6 ; 0 ; 2 7 6 3 ; p ; 1 ] c a t i o n [ h ; 2 3 0 ; 2 8 ] o f [ h ; 3 2 7 ; 1 7 ] G e i ' s [ h ; 5 5 2 ; 2 7 ]
T h e o r e m [ h ; 8 7 8 ; 2 9 ] t h a t [ h ; 1 0 3 7 ; 2 7 ] a l g e b r a s [ h ; 1 3 3 9 ; 2 6 ] w i t h [ h ;
1 5 0 9 ; 2 7 ] a [ h ; 1 5 7 2 ; 2 5 ] t a m e [ h ; 1 7 5 8 ; 2 7 ] d e g e n e r a t i o n [ h ; 2 2 1 6 ; 2 7 ] a r e
229
[ h; 2343 ; 28] t a me , [ h; 2548 ; 27] i n [ h; 2636 ; 25] wh i c h [ h; 2857 ; 27] we
[ y ; 2 9 7 1 ; O ; 2 7 6 3 ; O ; H ]
[ s ; 1 ; 2 7 ; 0 ; 2 8 8 0 ; p ; 1 ] v a r y [ h ; 1 7 1 ; 2 1 ] t h e [ h ; 2 9 1 ; 2 1 ] r e l a t i o n s [ h ; 5 9 8 ;
2 0 ] i n s t e a d [ h ; 8 5 3 ; 2 0 ] o f [ h ; 9 4 1 ; 1 4 ] t h e [ h ; 1 0 5 4 ; 2 0 ] s t r u c t u r e
[ h ; 1 3 7 0 ; 1 8 ] c o e f f i c i e n t s [ h ; 1 7 5 3 ; 2 1 ] o f [ h ; 1 8 4 3 ; 1 2 ] t h e [ h ; 1 9 5 6 ; 1 9 ]
a l g e b r a . [ h ; 2 2 3 5 ; 2 2 ] I f i s [ h ; 2 8 0 4 ; 2 1 ] t h e [ h ; 2 7 2 5 ; 1 9 ] v a r i e t y
[ y ; 2 9 7 1 ; O ; 2 8 6 0 ; O ; H ]
[ s ; 1 ; 2 6 ; 0 ; 2 9 5 8 ; p ; 1 ] o f [ h ; 9 5 ; 2 6 ] a s s o c i a t i v e [ h ; 4 7 8 ; 3 6 ] u n i t a l [ h ;
7 0 4 ; 3 5 ] a l g e b r a [ h ; 9 8 4 ; 3 3 ] s t r u c t u r e s [ h ; 1 3 4 3 ; 3 6 ] o n [ h ; 1 4 6 1 ; 3 6 ]
a n [ h ; 1 5 7 5 ; 3 4 ] v e c t o r [ h ; 2 3 2 1 ; 3 4 ] s p a c e , [ h ; 2 5 5 0 ; 3 5 ] t h e n
[ h ; 2 7 2 8 ; 3 5 ] G e i ' s [ y ; 2 9 7 1 ; 0 ; 2 9 5 8 ; 0 ; H ]
[ s ; 1 ; 2 6 ; 0 ; 3 0 5 4 ; p ; 1 ] T h e o r e m [ h ; 3 2 5 ; 2 6 ] [ [ 6 ] [ h ; 4 2 6 ; 2 5 ] s t a t e s
[ h ; 6 3 5 ; 2 4 ] t h a t [ h ; 7 8 9 ; 2 5 ] i f [ h ; 8 6 1 ; 1 7 ] t h e [ h ; 9 7 7 ; 2 3 ] c l o s u r e
[ h ; 1 2 3 6 ; 2 3 ] o f [ h ; 1 3 2 8 ; 1 5 ] t h e [ h ; 1 4 4 3 ; 2 5 ] o f c o n t a i n s [ h ; 2 7 2 6 ; 2 4 ]
a [ h ; 2 7 8 6 ; 2 3 ] t a m e [ y ; 2 9 7 0 ; 0 ; 3 0 5 4 ; 0 ; H]
[ s ; 1 ; 2 6 ; 0 ; 3 1 5 2 ; p ; 1 ] a l g e b r a [ h ; 2 7 0 ; 3 0 ] t h e n [ h ; 4 4 3 ; 3 6 ] A [ h ; 5 3 0 ; 2 6 ] i s
[ h ; 6 0 4 ; 3 2 ] t a m e . [ h ; 8 1 2 ; 3 3 ] I n [ h ; 9 1 4 ; 3 1 ] o u r [ h ; 1 0 6 0 ; 3 0 ] v e r s i o n
[ h ; 1 3 2 8 ; 3 2 ] t h e [ h ; 1 4 5 9 ; 3 1 ] a l g e b r a s [ h ; 1 7 6 4 ; 3 0 ] m a y [ h ; 1 9 3 6 ; 3 0 ] h a v e
[ h ; 2 1 1 9 ; 3 0 ] d i f f e r e n t [ h ; 2 4 2 3 ; 2 8 ] d i m e n s i o n s : [ y ; 2 8 5 0 ; 0 ; 3 1 5 2 ; 1 ; S ]
[ s ; 1 ; 1 0 2 ; 0 ; 3 3 4 8 ; p ; 1 ] T h e o r e m [ h ; 3 9 4 ; 2 5 ] B . [ h ; 4 8 4 ; 3 7 ] L e t [ h ; 6 2 7 ; 3 8 ]
b e [ h ; 8 1 3 ; 3 4 ] a [ h ; 8 8 4 ; 2 0 ] f i n i t e [ h ; 1 0 7 9 ; 3 5 ] d i m e n s i o n a l
[ h ; 1 4 9 4 ; 3 0 ] a l g e b r a , [ h ; 1 7 8 1 ; 3 4 ] l e t [ h ; 1 8 9 1 ; 3 7 ] [ h ; 1 9 8 0 ; 3 8 ] b e
[ h ; 2 0 8 6 ; 3 3 ] a n [ h ; 2 1 9 5 ; 3 0 ] i r r e d u c i b l e [ h ; 2 5 6 7 ; 3 3 ] v a r i e t y [ h ; 2 8 1 9 ;
3 5 ] a n d [ y ; 2 9 7 1 ; 0 ; 3 3 4 8 ; 0 ; H ]
[ s ; 1 ; 2 7 ; O ; 3 4 4 5 ; p ; 1 ] l e t [ h ; 1 0 1 ; 3 4 ]
b e [ h ; 8 0 2 ; 4 4 ] m o r p h i s m s [ h ; 1 1 8 9 ; 4 7 ] o f [ h ; 1 3 0 2 ; 3 9 ]
v a r i e t i e s [ h ; 1 6 0 9 ; 4 4 ] ( w h e r e [ h ; 1 8 6 6 ; 5 3 ] A [ h ; 1 9 7 0 ; 4 6 ] h a s [ h ; 2 1 2 0 ;
4 7 ] i t s [ h ; 2 2 4 0 ; 4 6 ] n a t u r a l [ h ; 2 5 2 2 ; 4 6 ] s t r u c t u r e [ h ; 2 8 5 4 ; 4 8 ]
a s [ y ; 2 9 6 8 ; 0 ; 3 4 4 5 ; 0 ; H ]
[ s ; 1 ; 2 4 ; 0 ; 3 5 4 3 ; p ; 1 ] a f f i n e [ h ; 2 0 9 ; 5 5 ] s p a c e ) . [ h ; 4 7 5 ; 6 0 ] F o r [ h ; 6 4 7 ;
5 7 ] w r i t e [ h ; l l O l ; 5 8 ] L e t [ h ; 1 8 1 9 ; 5 5 ] I f [ h ; 2 3 3 8 ; 4 9 ] i s
[ h ; 2 5 9 4 ; 5 4 ] t a m e [ h ; 2 7 9 8 ; 5 6 ] a n d [ y ; 2 9 7 0 ; 0 ; 3 5 4 3 ; 0 ; H ]
[ s ; i ; 2 8 ; 0 ; 3 6 4 1 ; p ; 1 ] f o r [ h ; 4 4 7 ; 2 7 ]
g e n e r a l [ h ; 7 1 2 ; 2 2 ] i . e . [ h ; 1 0 4 4 ; 1 8 ]
f o r [ h ; 1 1 5 7 ; 2 5 ] a l l [ h ; 1 2 6 2 ; 2 3 ] i n [ h ; 1 4 0 8 ; 2 3 ] a [ h ; 1 4 6 8 ; 2 1 ]
n o n - e m p t y [ h ; 1 8 2 5 ; 2 7 ] o p e n [ h ; 2 0 0 3 ; 2 5 ] s u b s e t [ h ; 2 2 2 4 ; 2 3 ]
t h e n [ h ; 2 5 6 8 ; 2 3 ] i s [ h ; 2 7 7 3 ; 2 5 ] t a m e . [ y ; 2 9 6 6 ; 0 ; 3 6 4 1 ; 1 ; S ]
[ s ; i ; i 0 1 ; 0 ; 3 7 8 7 ; p ; i ] I n [ h i 1 6 9 ; 3 3 ] p r a c t i c e [ h ; 4 6 3 ; 3 1 ] G e l ' s [ h ; 7 0 2 ;
3 3 ] T h e o r e m [ h ; 1 0 3 4 ; 3 2 ] h a s [ h ; 1 1 7 5 ; 3 4 ] u s u a l l y [ h ; 1 4 4 3 ; 3 2 ] b e e n
[ h ; 1 6 2 6 ; 3 4 ] u s e d [ h i 1 8 0 6 ; 3 3 ] i n [ h ; 1 9 0 0 ; 3 2 ] t h e [ h ; 2 0 3 2 ; 3 0 ] f o r m
[ h ; 2 2 1 8 ; 3 4 ] o f [ h ; 2 3 2 1 ; 2 2 ] T h e o r e m [ h ; 2 6 4 4 ; 2 5 ] B , [ h ; 2 7 3 2 ; 3 3 ] b u t
[ h ; 2 8 7 4 ; 3 2 ] i n [ y ; 2 9 6 7 ; 0 ; 3 7 8 7 ; 0 ; H ]
[ s ; 1 ; 2 2 ; 0 ; 3 8 8 4 ; p ; i ] t h i s [ h ; 1 3 9 ; 1 8 ] c a s e [ h ; 2 9 3 ; 2 2 ] o n e [ h ; 4 3 2 ; 1 9 ] a l s o
[ h ; 5 8 1 ; 2 1 ] h a s [ h ; 7 1 1 ; 2 1 ] t o [ h ; 7 9 6 ; 2 0 ] c h e c k [ h ; I 0 0 4 ; 2 1 ] t h a t
230
[ h; 1155; 20] a l l [ h; 1253 ; 22] t h e [ h; 1375; 18] a l g e b r a s [ h; 1669 ; 25]
A ^ [ h; 1772 ; 20] have [h; 1946 ; 20] t h e [ h; 2066 ; 20] same [ h; 2251 ; 20]
d i me n s i o n . [ h; 2630; 22] 0f [ h; 2742 ; 11] c o u r s e [ y; 2967; 0; 3884; 0; H]
[ s ; 1 ; 20; 0; 3982; p; 1] Theor em [ h; 319 ; 28] B [h ; 394; 25] and [ h; 541 ; 26]
Gei ' s [h; 775; 26] t he or e m [h; 1073 ; 27] have [h; 1253 ; 25] a [ h; 1314; 24]
common [ h; 1630; 26] r e f i n e me n t [ h; 2008; 25] i n [ h; 2093; 26] whi ch
[ h; 2315; 25] b o t h [h; 2493; 26] t h e [ h; 2620; 24] a l g e b r a [h ; 2888 ; 31]
A[ y; 2971; O; 3982; O; H]
I s ; 1 ; 20 ; 0 ; 4080 ; p ; 1] and [h; 141 ; 39] r e l a t i ons [h; 467 ; 38] a r e [h ; 606 ;
38] a l l o we d [h; 900 ; 38] t o [h; 1003; 38] v a r y . [ h; 1201 ; 39] The
[ h; 1367 ; 38] a u t h o r [h; 1630 ; 36] i s [h; 1714 ; 38] s u p p o r t e d [ h; 2087 ; 39]
by [h; 2205; 37] t h e [ h; 2342 ; 39] EPSRC [h; 2638 ; 39] of [ h; 2746 ; 29] Gr e a t
[y ; 2966 ; 0; 4080 ; 0 ; H]
[ s ; 1 ; 2 1 ; O; 4 1 7 8 ; p ; 1 ] Br i t a i n . [ y ; 2 6 6 ; 0 ; 4 1 7 8 ; 1; S]
[ s ; 1; 1 0 3 ; 0 ; 4 3 7 5 ; p ; 1 ] 1. [h; 151; 29] Pr o o f [h ; 372 ; 21] of [h ; 462 ; 20]
Theor em
[ h ; 7 7 5 ; 2 5 ] B. [ h; 865; 29] I f [ h ; 9 5 0 ; 2 0 ] a n[ h; 1048; 30] a l g e b r a i c
[ h; 1379 ; 27] gr oup [h; 1604; 31] g [h; 1685 ; 29] a c t s [h ; 1841; 29]
on [ h; 1952; 29] a [h; 2017 ; 27] ( not [h; 2176; 29] n e c e s s a r i l y [h ; 2562 ; 27]
i r r e d u c i b l e ) [ y; 2967; 0 ; 4375 ; 0 ; HI
[s ; 1 ; 20; 0 ; 4 4 7 2 ; p ; 1] v a r i e t y [h; 246 ; 31] Y [h; 324; 24] t h e n [h; 491 ; 27]
t h e [h; 618 ; 25] number [h ; 879 ; 26] of [ h; 972; 15] p a r a me t e r s [ h; 1346 ; 26]
of [h ; 1441; 20] G [h ; 1510 ; 27] on [h ; 1619; 31] Y[ h; 1698; 22] i s
[h ; 1769 ; 24]
[ s ; 1 ; 19 ; 0 ; 4570 ; p ; 1] wher e [h; 215 ; 43] i s [h; 436 ; 38] t h e
[h ; 574; 40] uni on [ h; 804 ; 40] of [ h; 913 ; 30] t h e [ h; 1042 ; 41] o r b i t s
[ h; 1273 ; 40] of [h ; 1381 ; 29] di me ns i on [h ; 1764; 39] s . [h; 1835 ; 42] By
[ h; 1964 ; 38] a [h; 2038 ; 40] G- s t a b l e [h ; 2345 ; 40] s u b s e t [ h; 2590 ; 37]
we [y ; 2965 ; 0; 4570; 1 ; S]
[g; 1 6 6 6 ; 0 ; 0 ; 3 0 1 0 ; 4 6 1 8 ]
In or der t o i ncor por at e t he mat hemat i cal f or mul a l at ex file 3) i nt o t he
MVD syst em t he soft ware of t he MVD has been ext ended such t ha t i t can
r ead in file 3) as its t hi r d layer. To each mat hemat i cal f or mul a or expr essi on
its coor di nat es ar e at t ached. Thi s is necessar y in or der t o keep t he connect i on
bet ween t he or di nar y t ext of t he second l ayer and t he mat hemat i cal t ext of
t he t hi r d layer. Her e t he r eader can mar k a r el evant ma t he ma t i c a l f or mul a
and show it on screen. Anot her pr ogr am will be wr i t t en enabl i ng t he vi ewer
t o t r ansf er t he obt ai ned mat hemat i cal f or mul a i nt o a new mat hemat i cal l at ex
manuscr i pt .
As an exampl e we now pr i nt out t he mat hemat i cal f or mul a t ext of Cr awl ey-
Boeyey' s art i cl e "Tameness of bi seri al al gebras".
Alg(d) [h; 2537; 19]
d-di mensi onal [h; 2080; 35]
231
GL(d, K) - or bi t [h; 1967;24] of A E [h; 2188; 23]Alg(d) [h; 2426; 20]
T h e o r e m [h; 394; 25] S . [h; 484; 37] Let [h; 627; 38] A [h; 715; 31]
X [h; 1980; 38]
f l , . . . , f r : [h; 434; 31]X[h; 515; 30] --+ [h; 607, 33]A [h; 691; 36]
x C X [h; 881; 58]
A= = [h; 1320; 32]A/(fi(x)). [h; 1654; 59]
Xo, xl E X [h; 2222; 59]
A= o [h; 2494; 52]
A=[h; 107; 29] ~ [h; 187; 39]A~ 1 [h; 325; 33]
x[h; 774; 22] 9 [h; 833; 28]X [h; 934; 26]
x [h; 1324; 26]
X [h; 2406; 26]
A= 1 [h; 2696; 29]
di mo[ h; 1961; 21]Y[h; 2030; 21] = [h; 2105; 27] max[h; 2272; 23]
{dim[h; 2450; 2 2 ] Y / ~ ) [ h ; 2564; 27] - [h; 2645; 28Is[h; 2700; 16] I [h; 2743; 18]
s[h; 2788; 27] _ [h; 2867; 28]0}[y; 2962; 0; 4472; 0; H][s; 1; 19; 0, 4570,p; 1]
Y(~) [h; 349; 39]
Z[h; 2682; 27] C_ [h; 2762; 36]]z [h; 2848; 29]
We r emar k t ha t T h e o r e m B has onl y be put i nt o t hi s pr i nt out of t he
mat hemat i cal formul a t ext in or der t o help t he r eader finding t he correspond-
ing t ext in t he Xdoc file of t he or di nar y t ext and in t he Tiff file.
The f our t h layer cont ai ns file 4) wi t h t he bi bl i ographi c da t a of t he quot ed
articles and t hei r SGML format s.
The layer wi t h t he user annot at i ons allows t he communi cat i on bet ween
t he aut hor or l i br ar y and t he possible r eader s of a r et r odi gi t i zed art i cl e. So
mi spri nt s, comment s and suggestions for f ur t her i mpr ovement of t he ar t i cl e
and its pr esent at i on in t he di gi t al l i br ar y can be ment i oned here.
The si xt h layer is used for t he pr epar at i on of t he searchabi l i t y bet ween t he
different articles of t he r et r odi gi t i zed vol umes of t he mat hemat i cal j our nal .
232
Therefore it contains all the bibliographic data about the given article in
SGML format.
Since our retrospective digitization programs can recognize the special
layout of the first page and of the references, the MVD system of the proto-
type allows to mark and call a reference of a retrodigitized article contained
in the digital library within the MVD format.
3 Combi ni ng r e t r o d i g i t i z e d a n d r e c e n t di g i t a l vol ume s
All the programs described in the second section will be applied to retrodig-
italize the 6 volumes of the Archiv der Mathematik published in the years
1993 till 1996. The result can be considered to be a prototype for retrospec-
tive digitization of general mathematical journals. Of course for each other
periodical the training systems of the OCR programs have to be modified.
However, the developed and applied technologies can be adjusted.
It is therefore even more important to combine the retrodigitized volumes
of the Archiv der Mathematik with the recent digital issues of this periodical.
Since 1997 this journal is published in paper and in digital form. The digital
articles can be received from the Springer Link in Heidelberg over the in-
ternet by the authorized members of the universities and research institutes,
because the Springer data base contains them in portable document format
(PDF). This format allows to view each page of the received article on screen.
Furthermore, the reader can produce a printout of it. Also the full text can
be searched for words, but not for mathematical formulas or expressions. At
the Springer Link the bibliographic records and abstracts of the articles are
stored in SGML format.
The publisher Birkh/iuser has agreed to extend our present collaboration
in order to produce a software allowing to connect the retrodigitized and the
recent digitized volumes of the Archiv der Mathematik in one experimental
digital library system. For that he will provide the digital texts of the recent
6 volumes in PDF format together with bibliographic records about their
articles in SGML format.
Since we then will have the bibliographic dat a of the retrodigitized and
of the digital volumes in SGML format, we can use the MILESS system [11]
to produce a common platform for the both parts of the digital articles of
the Archiv der Mathematik. MILESS is a library server, developed at Es-
sen University in a joint research project of the Computer Center and the
Central Library. It uses the IBM DB2 Digital Library product [9] to provide
access to digital and multimedia documents in a reliable and systematic way.
As described in [10] MILESS allows the storage of such material in any for-
mat such as audio, video, HTML, XML, PDF. Since both the retrodigitized
and the new digital articles of the Archiv der Mathematik have an SGML-
description of their bibliographic data, we can use these to provide MILESS
with the necessary information about any given article. For the old retrodig-
233
itized articles we will use the produced MVD system described in the second
section as the storage format. The new articles will be stored as PDF-files, as
it is done in the Springer LINK. So both parts of the digital journal can be
linked within the MILESS library system. Furthermore, the search functions
of MILESS allow now to search in the bibliographic data and also for words
in ASCII-versions of the articles of both parts of the Archiv der Mathematik.
This part of the project will be pursued in due course. The director of the
Essen Computer Center has agreed to provide access to the MILESS digital
library program. We do not need any access to the source code of the IBM
DB2 Digital Library product which is called by the useful public domain
programs of MILESS.
There is no doubt that this experiment will be successful. It is the last
corner stone for building a prototype of a combined digital and retrodigitized
searchable mathematical journal. Instead of MILESS the publisher could also
use the digital library data base of the Springer LINK. Since the Center for
Retrospective Digitization of the GSttingen State and University Library is
about to introduce AGORA, a new document management system [2], it
is very likely that also by means of t hat library system retrodigitized and
recent digital volumes of a mathematical journal can be linked. The director
of the GSttingen State and University Library has agreed to support such
experiments with our prototype in another joint project.
4 Fut ur e i mpr ove me nt s
As the reader will have observed so far the multivalent document system
(MVD) does not produce the ordinary mathematical text parts of scanned
mathematical article in PDF format. On the other side the programs used at
the Springer LINK do not provide any access to the mathematical formulas.
They are also not able to call the text of a quoted article in the references
which is available in the publishers digital library data base.
In order to show that referencing within the digital articles of the Archiv
der Mathematik is possible one could also retrodigitize the recent digital
texts and put the obtained ordinary text part of a page into the multivalent
document system (MVD) in Xdoc format. Such an experiment is planned, and
will have the support of the publisher Birkh~iuser. However, it will only prove
that it is very promising to start a new joint project, in which the software
of the present prototype is extended in such a way t hat the retrodigitized
and the recent digital volumes will be stored in a common PDF or postscript
format. Furthermore, the bibliographic and referencing records have to be
written in a common SGML or XML format. Such a future project requires
joint efforts of the software developers for the publisher, for the MILESS
project and for the Essen retrodigitization project.
If successful, then the resulting new prototype of one combined retrodigi-
tized and digital mathematical journal will show t hat such a software system
234
can also be used for r et r ospect i ve di gi t i zat i on and combi ni ng i t s r esul t s wi t h
t he digital vol umes of ot her mat hemat i cal j our nal s cont ai ned in a di s t r i but ed
di gi t al research library. In par t i cul ar , it will t hen be possi bl e t o r et r i eve and
r ead quot ed art i cl es of di fferent j our nal s as l ong as t hey are pa r t of a di gi t al
l i br ar y syst em. Fur t her mor e, such a soft ware will allow t o mar k a r et r odi gi -
t i zed or recent di gi t al mat hemat i cal f or mul a on screen and have it t r ans f er r ed
i nt o a new mat hemat i cal manuscr i pt .
Ac k n o wl e d g e me n t s
The aut hor ki ndl y acknowl edges fi nanci al s uppor t by DFG gr ant I I I N 2 -
542 81(1) Essen BI B45 ENug 01-02.
He is ver y gr at ef ul t o t he Birkh~iuser Verl ag for its gener ous t echni cal and
legal suppor t . The a ut hor also t hanks Pr of essor R. Fat eman, T. A. Phel ps
Ph. D and Professor D. Wi l ensky of t he Uni ver si t y of Cal i f or ni a (Berkel ey)
for t hei r advi ce and t hei r pr ogr ams. Finally, he owes his t hanks t o his f or mer
col l aborat ors Dr. G. Hennecke, Dr. J. Rosenboom, and pr esent col l abor at or s
C. Begall, Dr. H. Gol l an and Dr. R. St aszewski who have done or do all t he
pr ogr ammi ng and har d work.
Re f e r e n c e s
1. Adobe Acrobat Capture 2.01 for Windows 95 and Windows NT(R),
ht t p: / / www. adobe. com/ prodi ndex/ acrobat / capt ure. ht m
2. Agora - Digitales Dokumentenmanagementsystem ffir die Inhalte Ihrer Biblio-
thek, http://www. agora. de
3. Benjamin P. Berman, Richard J. Fateman, Nicholas Mitchell and Taku
Tokuyasu, "Optical character recognition and parsing of typeset mat hemat i cs",
Journal of Visual Communication and Image Representation, vol. 7 (1996), 2-
15.
4. D. Blostein and A. Grbavec, "Recognition of Mathematical Notation", Chapt er
22 in P.S.P. Wang and H. Bunke (eds.), Handbook on Optical Character Recog-
nition and Document Analysis, World Scientific Publishing Company, 1996.
5. R. J. Fateman, "How to find mathematics on a scanned page", Prepri nt 1997,
Univ. Calif. Berkeley.
6. R. Fateman and T. Tokuyasu, "A suite of programs for document st ruct uri ng
and image analysis using Lisp", UC Berkeley, technical report, 1996.
7. R. Fateman and T. Tokuyasu, "Progress in recognizing typeset mat hemat i cs",
1997, Univ. California Berkeley.
8. FineReader OCR Engine, Developer' s Guide, ABBYY Software House (BIT
Software), Moscow, 1993-1997.
9. IBM DB2 Digital Library, ht t p: / / www. soft ware. i bm. com. / i s/ di g-l i b/
10. H. Gollan, F. Lfitzenkirchen, D. Nastoll, "MILESS - a learning and teaching
server for multi-media documents", Preprint.
11. MILESS - Multimedialer Lehr- und Lernserver Essen, ht t p: / / mi l ess. uni -
essen.de
235
12. T. A. Phelps, "Multivalent Documents: Anytime, Anywhere, Any Type, Ev-
ery Way User-Improvable Digital Document s and Syst ems", dissertation, UC
Berkeley, 1998.
13. T. A. Phelps, R. Wilensky, "Multivalent Documents: Induci ng St ruct ure and
Behaviors in Online Digital Document s", in Proceedings of the 29th Hawaii
International Conference on Syst em Sciences Maui, Hawaii, Januar y 3-6, 1996.
14. J. Rosenboom "A pr ot ot ype mat hemat i cal t ext recognition syst em", Lecture
at t he international workshop on "Retrodigitalization of mat hemat i cal journals
and aut omat ed formula recognition", Inst i t ut e for Experi ment al Mat hemat i cs,
Essen University, 10 - 12 December 1997.
Gigabit Networking in Norway
Infrastructure, Applications and Projects
Thoma s Pl agemann
UniK - Center for Technology at Kjeller
University of Oslo
http://www.unik.no/.,~plageman
Ab s t r a c t . Norway is a count ry with large geographical dimensions and a very
low number of inhabitants. This combination makes advanced telecommunication
services and distributed multimedia applications, like telemedicine and distance
education, very i mport ant for t he Norwegian society. Obviously, an appropriate
networking infrastructure is necessary to enable such services and applications. In
order to cover all i mport ant locations in Norway, this network represents a very
large Wide-Area Network (WAN) within a single nation. This paper describes the
Norwegian academic networking infrastructure, and gives an overview of Norwegian
research institutions, programs, and projects. Furthermore, we describe in two case
studies one examplary multimedia application and one ongoing research project in
the area of gigabit networking and multimedia middleware.
1 I n t r o d u c t i o n
Tradi t i onal l y, Nor way is a ver y advanced count r y in t he ar ea of net wor ki ng
and da t a communi cat i ons. For exampl e, in 1975 Kj el l er (whi ch is si t uat ed
30 ki l omet ers nor t h- eas t of Oslo) and London were t he onl y eur opean nodes
in t he f or mer i nt er net , called ARPANET. Today, Nor way is one of t he lead-
i ng count r i es in t he worl d wi t h r espect t o access and usage of t he i nt er net .
Accordi ng t o t he Norwegi an Gal l up i nst i t ut e t ha t is speci al i zed in i nt ervi ew
based mar ket anal ysi s, 46% of all Norwegi ans have access t o t he i nt er net ,
33% are r egul ar l y usi ng t he i nt er net , and 24% of all Nor wegi an househol ds
ar e connect ed t o t he i nt er net [8]. Ther e are t wo f ur t her f act s about Nor way
t ha t make a s t udy of advanced net wor ki ng in Nor way qui t e i nt erest i ng:
9 Nor way has appr oxi mat el y 4.3 mi l l i on i nhabi t ant s. Consequent l y, t her e
are onl y a few uni versi t i es and r esear ch i nst i t ut i ons.
9 Nor way st r et ches over 2000 ki l omet er s f r om s out h t o nor t h. Thi s geo-
gr aphi cal di mensi on combi ned wi t h t he low numbe r of i nhabi t ant s makes
advanced t el ecommuni cat i on services and di s t r i but ed mul t i medi a appli-
cat i on, like t el emedi ci ne and di st ance educat i on, ver y i mpor t a nt for t he
Norwegi an society. Obvi ousl y, an appr opr i at e net wor ki ng i nf r as t r uct ur e
is necessar y t o enabl e such services and appl i cat i ons. In or der t o cover
all i mpor t a nt l ocat i ons in Norway, t hi s net wor k r epr esent s a ver y l arge
Wi de- Ar ea Net wor k (WAN) wi t hi n a single nat i on.
238
This paper has two main goals: (1) to give a general overview of Norwegian
research activities in the area of gigabit networking and to provide appro-
priate references; and (2) to give a more detailed description of two typical
examples for research projects and the usage of the Norwegian networking in-
frastructure. In this survey, we consider also distributed multimedia systems
and applications. These systems typically operate only with several Mbit/s
bandwidth per user, but the potentially large number of concurrent users
imposes considerable requirements onto gigabit networks.
The first part of this paper describes the Norwegian academic research
network infrastructure, the connected research institutions, research pro-
grams, and relevant projects. In the second part, we present two case studies.
The first case study describes the electronic classroom system t hat is used for
teaching regular university courses. The second case study presents the on-
going MULTE (Multimedia Middleware for Low-Latency High-Throughput
Environments) project.
2 Na t i ona l Ne t wor ki ng I nf r a s t r uc t ur e
Since 1987, the organization UNINETT has the responsibility for the acad-
emic networking infrastructure in Norway. This includes [17]:
9 to develop and maintain the national data network for research and higher
education,
9 to propagate the usage of open standards, and
9 to stimulate research and development that is important in the context
of UNINETT' s activities.
It is a strategic goal for Norway, to keep up with research and development
on new network services, like Internet 2 and Next Generation Internet, in
the USA; and to actively participate in the 5th Framework of the European
Union. In this context, multimedia and real-time services, like IP telephony,
digital libraries, distance education, and virtual reality, play an important
role and require a considerable amount of bandwidth. Table 1 summarizes
the bandwidth requirements UNINETT estimates for the periode 1998 - 2003
for the Norwegian backbone network, regional networks, access networks, and
internal educational networks [18].
In order to meet the current and future requirements, UNINETT, the Nor-
wegian Research Council, and Telenor officially opened in September 1998 the
National Research Network. This network comprises two parts: the research
network and the test network. The research network is a stable network to be
used for productive services and for new (multimedia) applications. All Nor-
wegian Universities, four Engineering Schools (Mo i Rana, Stavanger, Grim-
stad, and Halden) and research institutions at Kjeller are connected by the
research network. Additionally, Lillehammer will be connected during 1999.
Tabl e 1. Estimated bandwidth requirements in Mbit/s [18]
1998 1999 2000 2003
239
Backbone network 40-150 100-300 300-600 2000-4000
Regional networks 0,25-30 1 0 - 6 0 2 0 - 1 5 0 150-600
Access networks 0, 1-10 0 , 5 - 2 0 2 - 1 5 0 150-300
Internal educational networks 2-10 10-40 1 0 - 1 0 0 80-300
Tromsr [
Bergen I ~ f
30 Mb/s I Trondheim ]
70 M~s
S i v a n g e r m s t a d
Halden t 15 Mb/s
60 M b / s /
/ 1 2 3 25 Mb/s/ISLillehammerl
!
Oslo ]
15 Mb/s
] Kjeller ]
Fi g. 1. Topology of the Norwegian test network (according to ]19])
Fi gure 1 specifies t he links and t he available amount of bandwi dt h bet ween
t hese i nst i t ut i ons.
In cont rast , t he t est net work enabl es academi c research i nst i t ut i ons t o
exper i ment wi t h new net work prot ocol s, e.g., I Pv6 and RSVP, and appl i ca-
tions in a Wi de- Ar ea Net work (WAN) wi t hout i nt erferi ng wi t h t he pr oduc-
t i ve services in t he research net work. Onl y t he l eadi ng (academi c) r esear ch
i nst i t ut i ons in t he ar ea of net worki ng and di st r i but ed syst ems, i.e., t he four
Universities, Uni K - Cent er for Technol ogy at Kj el l er and Tel enor Resear ch
240
and Devel opment have access t o t he t es t ne t wor k3 I n par t i cul ar , t he t es t
net wor k offers an i nf r as t r uct ur e t o [18]:
9 real i ze t he nat i onal I Pv6 i nf r as t r uct ur e,
9 exper i ment wi t h pr ot ocol mechani s ms t o s uppor t QoS on t op of ATM,
I Pv6, or ot her i nt er net bas ed servi ces,
9 i nt r oduce rel i abl e mul t i cast , and
9 per f or m e xpe r i me nt a l r esear ch wi t h new pr ot ocol s and servi ces.
Bot h net wor ks - t he r esear ch net wor k and t he t es t ne t wor k - ar e bas ed on t he
commer ci al 155Mbi t / s ATM/ S DH WAN f r om Tel enor , cal l ed Nordicom. I n
or der t o ma na ge a nd cont r ol t he access t o t he r es ear ch net wor k, UNI NE T T
connect s each node 2 in t he Nat i onal Res ear ch Ne t wor k wi t h an ATM swi t ch
(Cisco Li ght St r e a m A1010) t o Nor di com. Bas ed on Vi r t ua l Pa t h s ( VPs) in
Nor di com, t hese ATM swi t ches est abl i sh Vi r t ual Ci r cui t s ( VCs) , by usi ng t he
Pr i vat e Net wor k- t o- Net wor k I nt er f ace ( PNNI ) si gnal l i ng pr ot ocol . Thus , t he
Cisco swi t ch s uppor t s t he Us er - Net wor k I nt er f ace ( UNI ) si gnal l i ng pr ot ocol .
Addi t i onal l y, t he ATM swi t ches f r om UNI NE T T ar e connect ed t o I P r out e r s
t ha t r out e I Pv4 packet s in t he r esear ch ne t wor k ( and I Pv 6 packet s in t he
t est net wor k) over t he VCs t owar ds t hei r des t i nat i on. Res ear ch i ns t i t ut i ons
wi t h local ATM net wor ks can choose whet her t hey want t o use I P ser vi ces or
di r ect l y ATM servi ces. Fi gur e 2 i l l ust r at es t he basi c ar chi t ect ur e at a node.
Fi g. 2. Node architecture
3 Ove r vi e w of Re s e ar c h Ac t i v i t i e s
3. 1 Re s e a r c h I n s t i t u t i o n s
The mai n academi c r esear ch i nst i t ut i ons in t he ar ea of gi gabi t net wor ki ng
and r el at ed ar eas are:
z Commercial institutions can appl y for access to t he t est network on a per-proj ect
basis.
2 A node corresponds to Gigabit Point of Presence ( Gi gaPOP) in t he Int ernet 2
terminology.
241
9 At t he University of Bergen, t he Depart ment of Informat i on Science does
research in t he area of information syst ems and t he Depart ment of Infor-
matics in t he areas of algorithms, bioinformatics, code theory, numerical
analysis, program development, and optimization.
9 The Depart ment of Informatics at t he University of Oslo is actively work-
ing in the areas: comput er science, microelectronics, mat hemat i cal mod-
eling, systems development, and image processing. Relevant activities
include: Swipp (Switched Interconnection of Parallel Processors), SCI
(Scalable Coherent Interface), Mul t i medi a Communi cat i on Labor at or y
(MMCL), and ENNCE (Enhanced Next Generat i on Net worked Comput -
ing Environment).
9 At the Norwegian University of Science and Technology in Trondheim
(NTNU), t he Depart ment of Telematics is working in t he areas of distrib-
ut ed systems, traffic analysis, and reliability. The Depar t ment of Com-
put er Science and Information Science is doing research in t he areas of
artificial intelligence, image processing, human-comput er interfaces and
systems development, information management , algorithms, and dat a-
base systems. Import ant proj ect s include PaP (plug-and-play) proj ect
and WI RAC (Wi deband Radio Access).
9 UniK is a foundation at which faculty members are either affiliated with
t he University of Oslo or t he NTNU. Areas of research interest at UniK
are: di st ri but ed multimedia systems, telecommunications, opto-electronics,
and mat hemat i cal modeling. Relevant activities include: OMODIS (Obj ect -
Oriented Modeling and Dat abase Support in Di st ri but ed Systems), IN-
STANCE (Intermediate St orage Node Concept), ENNCE/ MULTE (Mul-
timedia Middleware for Low-Lat ency Hi gh-Throughput Environments).
9 The Depart ment of Comput er Science at t he University of Tromsr is
focussing its activities on di st ri but ed operat i ng syst ems and open dis-
t ri but ed systems. Relevant activities include: TACOMA, MacroScope,
Vortex; ENNCE/ MULTE.
9 The Norwegian Comput i ng Center performs applied research in t he fields
of information technology. Selected activities include: LAVA (Delivery of
video over ATM), IMiS (Infrast ruct ure for Mul t i medi a Services in Seam-
less Networks), and ENNCE.
9 SI NTEF Telecom and Informatics performs research in t he areas of com-
put er science, telecommunications, electronics, and accoustics. Relevant
activities include IMiS and OMODIS.
9 NORUT IT is working in t he areas of eart h observation, information
and communication technology. Selected activities include NorTelemed
(telemedicine applications) and LAVA.
9 The Norwegian Defence Research Est abl i shment (FFI) is a st at e oper-
ated, civilian research establishment report i ng directly t o t he Norwegian
Ministry of Defence. FFI is an interdisciplinary establishment represent-
ing most of t he engineering fields, as well as biology, medicine, political
242
science, and economics. Relevant and not classified research activities
include ENNCE/MULTE.
The leading commercial research intitution is Telenor Research and De-
velopment (R&D), a part of the former national PTT. Main areas of interest
include service development and network solutions. Interesting projects in-
clude: Project I (next generation IP protocols and applications) and DOVRE
(Distributed Object-oriented Virtual Reality Environment). Further compa-
nies that perform research in the scope of this paper are Ericsson, Alcatel,
and Thomson-CSF Norcom.
3.2 Research Programs and Proj ect s
National research projects are mainly founded by the Norwegian Research
Council (NFR). NFR is organized in areas, like the area of Natural Sciences
and Technology (NT) which in turn are organized in research programs and
activities. Ongoing research programs of interest in NT include:
9 The Distributed IT Systems (DITS) lasts from 1996 to 2000. This pro-
gram has a budget of 70 Million Norwegian Kronws (MNOK). The Pro-
gram supports basic research within the following three main areas: (1)
construction and usage of distributed IT systems, (2) methods for con-
struction and maintenance of systems and applications for distributed
information handling, and (3) basic software and hardware technology
for distributed IT systems.
9 Main goal of the Basic Telecommunication Research Program (GT) is to
support strategic and basic telecom research at universities and research
institutes in the following four areas: (1) mobile systems, (2) broadband
systems, (3) transport networks and end-systems, and (4) telecommuni-
cation systems for people with special needs. The program has a budget
of 78 MNOK for the period 1997 - 2001.
9 Main goal of the Super Computing program is to provide access to na-
tional super computer resources to scientific research projects. These
resources include: one Cray J90 and one Cray T3E at NTNU, Silicon
Graphics Origin2000 at the University of Bergen, and an IBM SP2 at
the University of Oslo. NFR contributes 110 MNOK to the budget in the
program periode 1999 - 2003.
Furthermore, universities and industry are financially supporting research
projects. Following, we briefly describe some research projects in three promi-
nent areas: cluster technology, middleware, and multimedia applications. A
list of these and further research projects, and references to online documen-
tation can be found in the Appendix.
Cl ust er t echnol ogy: The SCI (Scalable Coherent Interface) research group
at the University of Oslo studies how cluster software and hardware can be
243
created, analyzed, efficiently utilized, and maintained. Especially, I/ O and
network access within SCI and ATM, and performance studies of SCI are of
interest in this context [13]. At the University of Tromso, two projects are
concerned with multicomputer systems, clusters, and distributed operating
systems. The primary goal of the MacroScope project is to design and build
a multicomputer via a distributed operating system based on distributed
shared memory. The experimental hardware development platform consists
of eight Hewlett-Packard NetServers, each equipped with four Intel Pentium
Pro processors running at 166 MHz. Each NetServer has 128 MB memory
and two peer PCI buses. The NetServers are connected via a Myrinet in-
terconnect from Myricom. The Myrinet has a peek bandwidth of 2xl.28
Gb/s, and hosts one megabyte of SRAM on the network interface. The Vor-
tex operating system is currently running on uniprocessors, 2, 4 and 8-way
Pentium II/Pentium Pro based multiprocessors. The current implementa-
tion includes support for multithreaded processes, virtual memory, network
communication over UDP/ IP (100 Mbit ethernet), gigabit network communi-
cation (Myrinet), a RAM based file system, basic synchronization (mutexes,
semaphores, etc.), APIC symmetric I/ O mode, a Plan9 like namespace for
resources, and other features.
Mi ddl ewar e: The TACOMA project (Troms0 And COrnell Moving Agents)
focuses on operating system support for agents and how agents can be used
to solve problems traditionally addressed by other distributed computing
paradigms, e.g., the client/server model [10]. The plug-and-play project is
financed by GT and will specify and explore aspects of a self-configuring
and self-adapting architecture with plug-and-play functionality for transport
and teleservice components. The goal is to develop the technology and to
demonstrate the ideas central to this plug-and-play concept. The objective of
OMODIS is to create basic research results within the domain of modeling for
distributed multimedia systems with emphasis on object-oriented modeling
and Quality of Service (QoS) modeling, based on a distributed persistent
object architecture [9]. OMODIS is financed by DITS in the periode 1996 -
2001. The ENNCE/MULTE project is described in more detail in Section 5.
Appl i cat i ons: The NorTelemed project finishes in 1999. New services for
telemedicine, like remote diagnostics and remote consultation, are developed,
tested, and evaluated in this project. The focus of the main LAVA project
(1995 - 1996) was the delivery of video and audio over ATM, including video
compression technology, transport protocols for ATM, and multimedia data-
bases. Results of LAVA include a MPEG System Stream player application
and a server for delivery of streams [5]. In 1998, a LAVA extension was
started, called LAVA Education, which focuses on the use of interactive mul-
timedia systems for educational purposes. DOVRE is a project at Telenor
R&D. DOVRE is a software platform for developing networked real-time 3D
244
applications. The primary goal of DOVRE is to provide a platform for work,
education, entertainment and co-operation in distributed digital 3D worlds
[3]. The Electronic Classroom is described as case study in the following
Section.
4 Ca s e St udy 1: The El e c t r oni c Cl a s s r oom
The first case study we present in this paper discusses an example of an
advanced application that is used on top of the research network to provide
a reliable service.
In the MUNIN project [4] and the MultiTeam project [1], the Center for
Information Technology Services at the University of Oslo (USIT), Telenor
R&D, and the Center for Technology at Kjeller (UniK) have developed the
so-called electronic classroom for distance education. Since 1993, the two
electronic classrooms at the University of Oslo are used for teaching reg-
ular courses to overcome separation in space by exchanging digital audio,
video, and whiteboard information. Currently, four electronic classrooms are
established in Norway: two at the University of Oslo, one at the University
of Bergen, and one at the Engineering School of Hedmark. Since 1997, the
electronic classroom system is commercially available from New Learning AS.
The following sections give an overview of the application and describe
the system architecture. A detailed analysis of QoS aspects in this system
can be found in [15].
4.1 Application
The main goal of the distributed electronic classroom is to make the teach-
ing situation in a distributed classroom as similar as possible to an ordinary
classroom. Thus, the number of seats for students is limited to maximal 20
in each classroom. During a lecture, at least two electronic classrooms are
connected. Teacher and students can freely interact with each other, this is
not dependent on whether they are in the same or in different classrooms.
This interactivity is achieved through the three main parts of each electronic
classroom: electronic whiteboard, audio system, and video system. All par-
ticipants can see each other, can talk to each other, and may use the shared
whiteboard to write, draw, and present prepared material from each site.
The electronic whiteboard, audio, and video system in t urn consist of several
components. Figure 3 shows the students in the classroom with the teacher
and Figure 4 shows the two whiteboards (one with the picture of the teacher)
in the remote classroom.
In addition to the ordinary classroom structure t hat is visible on these
pictures, i.e., student and teacher area, a technical back room is located
behind the classroom. Figure 5 illustrates the basic layout of an electronic
classroom.
245
Fig. 3. Electronic classroom with teacher
Fig. 4. Remote electronic classroom with students only
The electronic whi t eboard is a synonym for a collection of soft ware and
hardware el ement s to display and edit lecture not es and t ransparenci es t hat
are wri t t en in Hyper t ext Mar kup Language ( HTML) . The whi t eboar d itself
is a 10ft' semi -t ransparent shield t hat is used t oget her wi t h a video canon and
a mi rror as a second moni t or of an HP 725/ 50 wor kst at i on in t he back room.
A light-pen is t he i nput device for t he whi t eboard. A di st ri but ed appl i cat i on
has been devel oped t hat can be charact eri zed as Worl d-Wi de Web ( WWW)
browser with editing and scanning features. When a WWW page is displayed,
l ect urer and st udent s in all connect ed cl assrooms can concurrent l y write,
draw, and erase comment s on it by using t he light-pen. Thus, floor cont rol
246
o o o
9 9 9
cn 9 9 9
[} O O O B
Studonls Teach er Back Room
Fi g. 5. Basic layout of an electronic classroom
is achi eved t hr ough t he soci al pr ot ocol - as in an or di na r y cl as s r oom - a nd
is not enforced by t he syst em. Fur t he r mor e , a s canner can be used t o s can
and di spl ay on t he fly new mat er i al , like a page f r om a book, on t he s ha r e d
whi t eboar d. The ent i r e appl i cat i on can be ma n a g e d f r om a wor ks t a t i on in
t he cl assr oom.
The vi deo s ys t em compr i ses t hr ee camer as , a vi deo swi t ch, a set of mon-
i t ors, and a H.261 codi ng/ decodi ng devi ce (codec) t o gener at e a compr es s ed
di gi t al vi deo s t r eam. One c a me r a is a ut oma t i c a l l y fol l owi ng and f ocusi ng on
t he l ect urer. The ot her t wo c a me r a s c a pt ur e all event s ha ppe ni ng in t he t wo
sl i ght l y over l appi ng par t s of t he s t udent ar ea in t he cl assr oom. The audi o
s ys t em det ect s t he l ocat i on in t he cl as s r oom wi t h t he l oudest audi o sour ce,
i.e., a st udent or t he t eacher t ha t is t al ki ng. A vi deo swi t ch sel ect s one of
t he t hr ee camer as t ha t capt ur es t hi s l ocat i on and per s on in or der t o pr oduc e
t he out goi ng vi deo signal. Two cont r ol moni t or s ar e pl aced in t he back of
each cl assr oom. The uppe r moni t or di spl ays t he i ncomi ng vi deo s t r e a m, i.e.,
pi ct ur es f r om t he r e mot e cl assr oom, and t he l ower moni t or di spl ays t he out -
goi ng vi deo s t r eam, i.e., vi deo i nf or mat i on f r om t he l ocal cl assr oom. Thus ,
t he t eacher can see t he s t udent s in t he r e mot e cl as s r oom and can cont r ol
t he out goi ng vi deo i nf or mat i on whi l e faci ng t he l ocal s t udent s . The s t udent s
in t ur n can see t he r e mot e cl as s r oom on a second l ar ge scr een whi ch is al so
assembl ed out of a whi t eboar d, a vi deo canon t h a t is connect ed t o t he out put
of t he H.261 codec, and a mi r r or in t he back r oom.
The audi o s ys t e m i ncl udes a set of mi cr ophones t h a t ar e mo u n t e d on
t he ceiling. The mi cr ophones ar e evenl y di s t r i but ed in or der t o c a pt ur e t he
voice of all t he par t i ci pant s and t o i dent i f y t he l ocat i on of t he l oudest audi o
si gnal in t he cl assr oom. Fur t her mor e, t he t eacher is equi pped wi t h a wi rel ess
247
microphone. To generate a digital audio stream, two codecs are available: the
audio codec from the workstation and the audio codec in the H.261 codec.
Thus, one of the three coding schemes can be selected: 8 bit 8 Khz PCM
coding (64 Kbit/s), 8 bit 16 Khz PCM coding (128 Kbi t / s), and 16 bit 16 Khz
linear coding (256 Kbit/s). Speakers are mounted at the ceiling to reproduce
the audio stream from the remote site.
4 . 2 P l a t f o r m
The aim of the electronic classroom system is to be an open system. There-
fore, standardized internet protocols have been used as far as possible (see
Figure 6). There are four streams, which are using IPv4 as network protocol:
management, audio, video, and whiteboard stream. The management part
of the classroom, e.g., setting up a session, is performed in a point-to-point
manner and utilizes the reliable TCP protocol. The dat a exchange (audio,
video, and whiteboard stream) during a lecture requires multicast capable
protocols, because more than two classrooms can be interconnected. There-
fore, UDP is used on top of IP multicast. The audio and video streams have
stringent timing requirements, and audio and video packets are time-stamped
with the RTP protocol. For both streams, software modules are used in the
application to adapt the streams from the codecs to the RTP protocol, i.e.,
fill a certain number of samples or parts of video frames into a protocol data
unit (PDU). In contrast to audio and video, which tolerate errors, the white-
board application cannot tolerate errors and is therefore placed on top of
a proprietary multicast error control protocol (based on retransmissions) on
top of UDP.
W h i t e b o , A p p , i c a , i o n 1
I Video I Management of
Audio [ Whiteboard Application
Multicast Error RTP
Control
UDP
IP Multicast
TCP
IP
Fig. 6. Protocol stacks used in the electronic classroom
The network topology between the electronic classrooms is basically de-
fined by the research network. The classrooms are connected either via a
local ATM switch (ForeRunner 200) or via dedicated ethernet, routers and a
FDDI ring to the research network. Addressing and routing of traffic on the
248
IP layer is mainly performed from t he workstations in t he back rooms and
the routers t hat are directly at t ached t o t he research network. As a backup
solution, six ISDN lines can be used t o interconnect two classrooms.
5 Ca s e S t u d y 2: T h e E NNCE / MUL T E P r o j e c t
In this section, we discuss the ENNCE/ MULTE proj ect , because it est ab-
lishes a Metropolitan-Area Network (MAN) with Gi gabi t / s capaci t y and uti-
lizes the particular features of the t est network.
5 . 1 O v e r v i e w
The ENNCE/ MULTE project is a collaboration proj ect bet ween t he Univer-
sity of Tromsr University of Oslo, FFI, Thomson-CSF Norcom, and UniK.
The project is funded by the Norwegian Research Council under t he Basic
Telecommunication Research Program in t he periode 1997 - 2001.
The need for QoS, real-time behavior, and high performance in distrib-
ut ed multimedia applications like the electronic classroom, or command-and-
control systems is the starting point for t he ENNCE/ MULTE proj ect . On a
first glance, it seems t hat the necessary technology t o build an appropri at e
system platform for such applications is already commercially available: ATM
networks t hat offer high bandwi dt h and (guaranteed) QoS t o higher layer pro-
tocols, and implementations of the di st ri but ed obj ect comput i ng middleware
st andard Common Obj ect Request Broker Architecture ( CORBA) from the
Obj ect Management Group (OMG). Obj ect Request Brokers (ORBs) repre-
sent the heart of CORBA and enable t he invocation of met hods of remot e
objects, despite their location, underlying network and t r anspor t protocols,
and end-system heterogeneity [20]. However, nearly all CORBA implemen-
tations (e.g., IONA' s Orbix and Visigenic' s VisiBroker) are based on the
communication protocols TCP/ I P. It is well known t hat TCP/ I P is not able
t o support the wide range of multimedia requirements, even if it runs over
high-speed networks like ATM. Furthermore, CORBA itself is not well-suited
for performance sensitive multimedia and real-time applications, because it
lacks streams support , st andard QoS policies and mechanisms, real-time fea-
tures, and performance optimizations [16].
The main hypothesis of the ENNCE/ MULTE proj ect is t hat satisfying
the broad range of requirements of current and fut ure di st ri but ed multimedia
applications requires flexible and adapt abl e middleware t hat can be dynam-
ically tailored t o specific application needs. A further hypot hesi s is t hat a
flexible protocol syst em is an appropri at e basis for the const ruct i on of flexible
and adapt abl e middleware. Based on these hypotheses, t he MULTE proj ect
breaks down into t he following areas of concern:
9 Analysis of application requirements, based on mul t i medi a applications
t hat are developed from the proj ect partners: video j ournal i ng at the
249
University of Troms0, command-and-cont rol systems on naval vehicles at
FFI, and t he electronic classroom at UNIK [15].
9 Low l at ency high t hroughput transmission is based on t he Gi gabi t ser-
vice offered from the Gigabit ATM Network Kit from t he Washi ngt on
University in St. Louis, an SCI network, and Gigabit Et hernet .
9 St ream binding and enhanced interoperable multicast for het erogeneous
environments requires appropri at e abst ract i ons at t he upper API and
configuration and management of filters at intermediate and end-syst ems
[ 7 ] .
9 Flexible connection management t hat comprises mechanisms t o adapt
connection set-up and release mechanisms t o QoS requi rement s and t o
make t hem independent of t he particular protocol functionality.
9 Flexible protocol systems t hat perform t he communi cat i on t asks of t he
ORB-core.
In the following subsections, we describe t he architecture of t he first pr ot ot ype
of a flexible multimedia ORB and t he MULTE Gigabit net work over which
the ORB will be used.
5. 2 Fl e x i bl e Mu l t i me d i a ORB
A flexible protocol syst em allows dynamic selection, configuration and recon-
figuration of protocol modules t o dynamically shape t he funct i onal i t y of a
protocol t o satisfy specific application requirements and/ or adapt t o chang-
ing service properties of the underlying network. The basic idea of flexible
end-to-end protocols is t hat t hey are configured to include only t he necessary
functionality required to satsify the application for the part i cul ar connection.
This might even include filter modules t o resolve incompatibilities among
st ream flow endpoints and/ or to scale st ream flows due t o different net work
technologies in intermediate networks. The goal of a particular configuration
of protocol modules is to support t he required QoS for request ed connec-
tions. This will include point-to-point, point-to-multipoint, and mul t i poi nt -
to-multipoint connections. As a starting point, we use the Da CaPo (Dynami c
Configuration of Protocols) syst em [14] t o build a flexible mul t i medi a ORB.
Ov e r v i e w of Da CaPo: Da CaPo splits communication syst ems into t hree
layers denot ed A, C, and T. End-syst ems communi cat e via t he t r anspor t in-
frast ruct ure (layer T), representing the available communi cat i on infrastruc-
t ure with end-to-end connectivity (i.e., T services are generic). In layer C
the end-to-end communication support adds functionality to T services such
t hat at the AC-interface, services are provided to run di st ri but ed applica-
tions (layer A). Layer C is decomposed into protocol functions i nst ead of
sublayers. Each protocol function encapsulates a typical prot ocol t ask like
error detection, acknowledgment, flow control, encrypt i on/ decrypt i on, etc.
Dat a dependencies between protocol functions are specified in a prot ocol
250
graph. T layer modules and A layer modules terminate the module graph of
a module configuration. T modules realize access points to T services and A
modules realize access points to layer C services. Both module types "con-
sume" or ' ~roduce" packets. For example, in a distributed video application
a frame grabber and compression board produces video data. Applications
specify their requirements within a service request and Da CaPo configures in
real-time layer C protocols that are optimally adapted to application require-
ments, network services, and available resources. This includes determining
appropriate protocol configurations and QoS at runtime, ensuring through
peer negotiations that communicating peers use the same protocol for a layer
C connection, initiates connection establishment and release, and handles er-
rors which cannot be treated inside single modules. Furthermore, Da CaPo
coordinates the reconfiguration of a protocol if the application requirements
are no longer fulfilled. The main focus of the Da CaPo prototype is on the
relationship of functionality and QoS of end-to-end protocols as well as the
corresponding resource utilization. Applications specify in a service request
their QoS requirements in form of an objective function. On the basis of this
specification, the most appropriate modules from a functional and resource
utilization point of view are selected. Furthermore, it is ensured t hat sufficient
resources are available to support the requested QoS without decreasing the
QoS of already established connections (i.e., admission control within layer
C).
I nt egrat i on of Da CaPo i n COOL: At UniK, we develop a new multi-
threaded version of Da CaPo on top of the real-time micro-kernel operating
system Chorus, that takes full advantage of the real-time support of Chorus
[11]. Furthermore, we integrate Da CaPo into the CORBA implementation
COOL such t hat the COOL-ORB is able to negotiate QoS and utilizes opti-
mized protocol configurations instead of TCP/ I P [2]. Figure 7 illustrates the
architecture of the extended COOL-ORB on top of Chorus.
The COOL communication subsystem is split in two parts to separate the
message protocol, i.e., the Inter-ORB Protocol (IIOP) and the proprietary
COOL Protocol, from the underlying transport protocols, i.e., TCP/ I P and
Chorus Inter-Process Communication (IPC). A generic message protocol pro-
vides a common interface upwards, thus generated IDL stubs and skeletons
are protocol independent. A generic transport protocol provides a common
interface for the different transport implementations.
There are to alternatives to integrate Da CaPo in this architecture: (1) Da
CaPo represents simply another transport protocol. This alternative is our
first prototype implementation for Da CaPo in COOL, accompanied with an
extended version of IIOP called QoS-IIOP, or QIOP. QIOP encapsulates QoS
information from application level IDL interfaces and conveys this informa-
tion down to the transport layer and performs at the peer system the reverse
operation. Da CaPo uses this information for configuration of protocols.
COOL Communication Subsystem
I Gesmic P~,~:,col Masa~
IIOP
COOL
Protocol
QIOP
Gene~c Trsmlx~ Protocol ]
TCP/IP
Chonls
IPC
Da CaPo
O)
Da CaPo
(ii)
251
Chorus Operating System
Fig. 7'. Integration of Da CaPo into COOL and Chorus
The next step is to implement the second alternative, where Da CaPo
additionally configures a message protocol. The message protocols are then
Da CaPo modules formatting requests for marshaling and demarshaling in
stubs and skeletons.
5. 3 MULTE Gi gabi t Ne t wor k
At FFI and UNIK, we currently establish a metropolitan area network t hat
combines traditional network technologies, like 100 Mbit/s Ethernet and 155
Mbit/s ATM, with the following Gigabit network technologies [12]:
9 The Gigabit ATM Network Kits are based on technology t hat has been
developed at the Washington University in St. Louis [6]. The ATM switches
support several different link speeds up to 2.4 Gb/s. The ATM Network
Interface Cards (NICs) operate at up to 1.2 Gb/s.
9 SCI is standardized by ANSI-IEEE (Std 1596-1992). SCI provides dis-
tributed shared memory to a cluster of nodes, e.g., workstations, mem-
ory, disks, high speed network interfaces etc. Hardware-supported shared
memory can he used in various applications, ranging from closely synchro-
nized parallel programming to LAN support. The aggregated bandwidth
of the SCI ring used at FFI is 1.2 Gbit/s.
9 Gigabit Ethernet supports data transfer rates of 1 Gbit/s and is stan-
dardized in the IEEE 802.3z standard.
Figure 8 illustrates the network topology and infrastructure. Two Gigabit
ATM switches, one at FFI and one at UNIK, t hat are connected with a 1.2
252
Gi gabi t / s link build the core of t he network. At FFI, five PCs are connect ed
with PCI cards t o this switch and addi t i onal l y t o an SCI ring. At UNIK,
we connect SunUl t ra workstations and PCs t o t he Gi gabi t switch, t o Giga-
bit Et hernet , and t o the local 155 Mbi t / s ATM net work from ForeSyst ems.
The available access t o different t ypes of gigabit networks in one end-syst em
enables us to directly compare these technologies. Especially, we will experi-
ment al l y evaluate with the flexible mul t i medi a ORB the possibility t o select
an appropri at e net work service on t he fly. The gigabit net work is di rect l y
connect ed to the test network via the Fore ATM switch. Thi s enables t he
University of Tromsr t o access t he gigabit net work in Kjeller, and we are
current l y using t he flexibility of Da CaPo and t he possibilities of t he t est
network t o st udy the influence of various prot ocol configurations combi ned
with different network reservations ont o t he QoS of st reamed video t ransfer
between Kjeller and Tromsr (distance of approx. 1500 km).
Gigabit Ethernet
Fore ATM Gigabit ATM
UniK
~
I W
l
Gigabit ATM
FFI
Fig. 8. Gigabit network at Kjeller
6 Conc l udi ng Re ma r ks
The aim of this paper is twofold, on the one hand we intend t o give an
overview of gigabit networking and related areas in Norway. Thus, the first
part desribes t he Norwegian academic networking i nfrast ruct ure, research
institutions, programs, and projects. On the ot her hand, we want t o provi de a
more detailed description of two exempl ary activities, t he electronic classroom
and t he ENNCE/ MULTE project. In this cont ext , the definition of relevant,
interesting, and important activities is always based t o a cert ai n amount on
subjective measures, even if it is i nt ended t o present an obj ect i ve selection of
activities and projects. Therefore, we apologize if we have missed i mpor t ant
activities.
253
Ac k n o wl e d g e me n t s : I wish t o t ha nk Tom Kr i st ensen for a l ot of hel p
in pr epar i ng t he paper . Fur t her mor e, I woul d like t o acknowl edge Pe t t e r
Kongshaug and Ol av Kvi t t em f r om UNI NETT for pr ovi di ng det ai l s a bout
t he research and t est net work.
Re f e r e n c e s
1. Bakke, J. W., Hestnes, B., Martinsen, H. (1994) Distance Educat i on in t he
Electronic Classroom. Technical Report Telenor Research and Development, TF
R 20/94, Kjeller (Norway)
2. Blindheim, R. (1999) Extending the Object Request Broker COOL with flexible
Quality of Service Support", Master Thesis at t he University of Oslo, Depart ment
of Informatics, February 1999
3. Bot t ar, E. (1997) Telepresence t hrough Distributed Augment ed Reality. Scien-
tific Report R&D 44/97 Telenor
4. Bringsrud, K., and Pedersen, G. (1993) Distributed Electronic Classrooms with
Large Electronic White Boards. Proceedings of 4th Joint European Networking
Conference (JENC 4), Trondheim (Norway), May 1993, 132-144
5. Bryhni, H., Lovett, H., Maartmann-Moe, E., Solvoll, T., Sorensen, T. (1996)
On-demand regional television over the Internet. Proceedings of t he ACM Mul-
timedia 96 Conference, Boston, November 1996, 99-108
6. Chaney, T., Fingerhut, A., Flucke, M., Turner, J. (1997) Design of a Gigabit
ATM Switch. Proceedings of IEEE Infocom, April 1997
7. Eliassen, F., Mehus, S. (1998) Type Checking Stream Flow Endpoints. Proceed-
ings of Middleware'98, Chapman & Hall, Sept. 1998
8. Gallup (1998) Intertrack December 1998 (in Norwegian). available at:
ht t p: / / www. gal l up. no/ menu/ i nt ernet t / defaul t . ht m
9. Goebel, V., Plagemann, T., Berre, A.-J., Nyg~rd, M. (1996) OMODIS - Object-
Oriented Modeling and Database Support for Distributed Systems. Proceedings
of Norsk Informatikk Konferanse (NIK' 96), Alta (Norway), November 1996, 7-18
10. Johansen, D., Schneider, F. B., van Renesse, R. (1998) What TACOMA Taught
Us. To appear, Mobility, Mobile Agents and Process Migration - An edited Col-
lection, Milojicic D., Douglis, F., Wheeler, R. (Eds.), Addison Wesley Publishing
Company
11. Kristensen, T. (1999) Extending t he Object Request Broker COOL with flexible
Quality of Service Support (in Norwegian), Master Thesis at t he University of
Oslo, Depart ment of Informatics, in progress
12. Macdonald, R. (1998) End-to-end Quality of Service Architecture for t he TDF.
ENNCE/ WP2 Technical report TR02/ 98F
13. Omang, K. (1998) Performance of a Cluster of PCI Based UltraSpaxc Work-
stations Interconnected with SCI. Proceedings of Network-Based Parallel Com-
puting, Communication, Architecture, and Applications, CANPC' 98, Las Vegas,
Nevada, Jan/ Feb 1998, Lecture Notes in Comput er Science, No.1362, 232-246.
14. Plagemann, T. (1994) A Framework for Dynamic Protocol Configuration", PhD
Thesis, Swiss Federal Institute of Technology Zurich (Diss. ETH No. 10830),
Zurich, Switzerland, September 1994
15. Plagemann, T., Goebel V. (1999) Analysis of Quality-of-Service in a Wide-Area
Interactive Distance Learning System. To appear in Telecommunication Systems
Journal, Balzer Science Publishers
254
16. Schmidt, D. C., Gokhale, A. S., Harrison, T. H., Parulkar, G. (1997) A High-
Performance End System Architecture for Real-Time CORBA, I EEE Commu-
nications Magazine, Vol. 35, No. 2, February 1997, 72-77
17. UNI NETT (1993) Research Network and Internet2 (in Norwegian), UNI NyTT
hr. 1 1993, electronically available at: ht t p: / / www. uni net t . no/ UNI NyTT- 1-
93.html
18. UNI NETT (1998) Research Network and Internet2 (in Norwegian), UNI NyTT
hr. 1/2 1998, electronically available at: ht t p: / / www. uni net t . no/ UNI NyTT/ 1- 98
19. UNI NETT (1998) Research Network Estabished! (in Norwegian), UNI NyTT
hr. 3 1998, electronically available at: ht t p: / / www. uni net t . no/ UNI NyTT/ 3- 98
20. Vinkoski, S. (1997) CORBA: Integrating Divers Applications Wi t hi n Distrib-
ut ed Heterogeneous Environments, I EEE Communications Magazine, Vol. 35,
No. 2, February 1997, 46-55
A Appe ndi x
255
Table 2. Institutions
University of Bergen, Department
of Information Science
University of Bergen, Department
of Informatics
University of Oslo, Department of
Informatics
UniK - Center for Technology at
Kjeller
University of Tromsr Department
of Computer Science
NTNU, Department of Telematics
NTNU, Department of Computer
Science and Information Science
Norwegian Defence Research Es-
tablishment
Norwegian Computing Center
SINTEF Telecom and Informatics
NORUT IT
Telenor Research and Development
Ericsson
Alcatel
Thomson CSF Norcom
UNINETT
http://www.ifi.uib.no/index.html
http://www.ii.uib.no/index_ e.shtml
http://www.ifi.uio.no/
http://www.unik.no
http://www.cs.uit.no/EN/
http://www.item.nt nu.no/index-
e.html
http://www.idi.ntnu.no/
http:www.ffi.no
http://www.nr.no/ekstern/engelsk/
www.generelt.html
http://www.informatics.sintef.no/
http://www.norut.no/itek/
htt p://www.fou.telenor.no/english/
http://www.ericsson.no
http://www.alcatel.no/telecom/
http://www.thomson-csf.no/
http://www, uninett, no/index.en, ht ml
256
Table 3. Programs, projects, and activities
DITS
GT
Supercomputing program
Telecom 2005 - Mobile com-
munication
http://www.ifi.uio.no/ dits/translate/ in-
dex.html
http://www.sol.no/forskningsradet/program/
profil/gtele/
http://www.sol.no/forskningsradet/program/
tungregn/index.htm
http://www.item.ntnu.no/~tc2005/
ADAPT-FT
CAGIS
ClustRa
DOVRE
ENNCE
ENNCE/MULTE
GDD
GOODS
INSTANCE
IMiS
LAVA
MacroScope
Mice-nsc
MMCL
Multimedia Databases
Network Management
Plug and Play (PAP)
project I
SCI activities at the Univer-
sity of Oslo
Swipp (Switched Intercon-
nection of Parallel Proces-
sors)
TACOMA
Vortex
Wirac
http://www.ifi.uio.no/,-~adapt /
http://www.idi.ntnu.no/~cagis/
http://www.clustra.com/
http://televr.fou.telenor.no/html/dovre.html
http://www.unik.no/~paal/ennce.html
http://www.unik.no/~plageman/multe.html
http://www.gdd.cs.uit.no/gdd/
http://www.ifi.uio.no/~goods/
http://www.unik.no/,,-plageman/inst ance.html
http: //www.informatics.sintef.no/www/prosj /
imis/imis.htm
http://www.nr.no/lava/
http://www.cs.uit.no/forskning/DOS/
MacroScope
http://sauce.uio.no/mice-nsc/
http://www.ifi.uio.no/~mmcl/
http://www.idi.ntnu.no/grupper/DB-
grp/projects/multimedia.html
http://www.item.ntnu.no/--,netman/
http://www.item.ntnu.no/,-,plugandplay/
http://pi.nta.no/indexe.html
http://www.ifi.uio.no/~sci
http://www.ifi.uio.no/~swipp/
http://www.tacoma.cs.uit.no/
http://www.vortex.cs.uit.no/vortex.ht ml
http://www.tele.ntnu.no/wirac/
Low s p e e d ATM over ADS L a nd t h e Ne e d for
Hi g h S p e e d Ne t wo r k s
A case s t udy i n G6 t t i ng e n
Ge r ha r d J. A. Schnei der
Gesellschaft ffir wissenschaftliche Dat enverarbei t ung GSttingen, Am Fassberg,
D-37077 GSttingen, Germany
Ab s t r a c t . The use of modern technology from non-st andard sources allows I T
centres to find t empor ar y solutions for operational needs, for exampl e to provide
access quickly to a networking i nfrast ruct ure for t he local scientific communi t y.
This paper describes the experiences of GWDG with ADSL equi pment . Al t hough
originally a consumer technology it can also be used in a scientific envi ronment .
Despite of having some clear limitations, ADSL technology can be used quite well
if other means are not readily available.
In addition various networking issues t hat arise in a scientific envi ronment are
discussed, using the situation in GSttingen as an example.
1 I nt r oduc t i on
GWDG, t he Gesellschaft fiir wissenscha#liche Datenverarbeitung GSttingen,
is t he j oi nt I T cent r e of t he Uni ver si t y of G6t t i ngen and t he Ma x Pl a nc k
Society. Fi ve ma j o r r esear ch i ns t i t ut es of t he Soci et y ar e s i t ua t e d in t he
G6t t i ngen ar ea. Whi l e four of t he m ar e wi t hi n t he ci t y boundar i es , t he fifth
is s ome 30 kms away. I n or der t o pr ovi de a de qua t e access t o t he net wor k
i nf r as t r uct ur e al so for t hi s i nst i t ut e, a da r k fi bre l i nk has been i nst al l ed in
cooper at i on wi t h t he local wat er s uppl y company. I t t ur ne d out t h a t t hi s
sol ut i on was cheaper t ha n a 3 year l ease of a hi gh speed l i nk f r om a ny of t he
t el ecommuni cat i on carri ers.
Apa r t f r om doi ng r esear ch in appl i ed c omput e r sci ence, t he ma j o r t as ks
of GWDG ar e t o pr ovi de s t r at egi c servi ces for i t s cus t omer s , as well as t he
oper at i on of l ocal mi dr ange par al l el c omput e r s and of t he hi gh s peed d a t a
backbone G6 NET in t he G6t t i nge n ar ea.
2 The WAN i nf ras t ruct ure
The I nt er net connect i vi t y for t he Ge r ma n science c o mmu n i t y is pr ovi de d by
t he Deutsches Forschungsnetz DFN ([1]). I t ope r a t e s B- Wi N, a nat i on- wi de
ATM ba c kbone (whi ch is physi cal l y pa r t of t he net wor k of Deutsche Telekom)
wi t h access poi nt s of 34 Mb i t / s and 155 Mbi t s / s . Thi s ne t wor k will mi gr a t e
t o Gi gabi t speed in t he year 2000 and it will t hen offer access s peeds of up
258
to 622 Mbi t / s and later even more. The main B-Wi N nodes are current l y
housed on t he premises of the Deutsche Telekom. Cust omers have ei t her own
access points (pricing is i ndependent of the location and varies from EUR
400000 p.a. for 34 Mbi t / s t o EUR 600000 p.a. for 155 Mbi t / s) or connect t o
t he nearest access point via leased or private lines. While t he second opt i on
allows t he sharing of an access point, it may add t he overhead of t he cost
for leased lines. Although prices for leased lines have st ar t ed to fall since
t he liberalisation of t he t el ecommuni cat i on market in earl y 1998, t hey still
provide a problem for remot e sites.
The ATM backbone is primarily used t o t r anspor t IP traffic bet ween mem-
ber sites. Thus PVCs exist between the various rout ers in t he mai n nodes.
For detailed i nformat i on about the current i nfrast ruct ure of t he net work, in-
cluding an up-t o-dat e map, see [2]. The map in Fig. 1 reflects t he si t uat i on
in April 1999.
Fig. 1. High-speed network B-WiN of the Deutsches Forschungsnetz
259
In addi t i on t o pl ai n I P traffic, it is also possi bl e t o or der PVCs and qui t e
r ecent l y SVCs bet ween i ndi vi dual sites. Pr i ces for such connect i ons ar e ver y
moder at e and t her ef or e are not prohi bi t i ve. The r easons for t hese qual i t y
of servi ce connect i ons can be speci al demands f r om r esear ch gr oups - like
pr i or i t y access t o s uper comput er s - or vi deo conferences. I t shoul d be added
t ha t t he maj or i t y of such i ndi vi dual PVCs also car r y onl y I P t raffi c, but
in a guar ant eed envi r onment . Thus ma ny aspect s of t he cur r ent di scussi on
on qual i t y of servi ce connect i ons over t he i nt er net have successful l y been
sol ved wi t hi n t he Ger man science net wor k by usi ng a ppr opr i a t e t r a ns por t
t echnol ogi es. Whi l e ATM to the desktop may still be di scussed and per haps
never ar r i ve, ATM is cer t ai nl y well sui t ed as a backbone t echnol ogy, especi al l y
wi t h r espect t o qual i t y of service.
The B- Wi N also offers connect i vi t y t o t he US net wor ks, cur r ent l y at 155
Mbi t / s , wi t h anot her upgr ade t o 310 Mbi t / s due in J ul y 1999, vi a i t s Han-
nover node.
The US connect i vi t y hi ghl i ght s vari ous pol i t i cal pr obl ems ma ny Eur o-
pean net wor k provi ders t o t he scientific communi t i es ar e facing: al t hough
t her e is a si gni fi cant flow of da t a f r om Eur ope t o t he US, commer ci al US
pr ovi der s r ej ect t he i dea of cof undi ng t r ans at l ant i c lines. In addi t i on t he di-
verse pr ovi der i nf r ast r uct ur e in t he Uni t ed St at es basi cal l y forces Eur ope a ns
t o bui l d t hei r own di st r i but i on net wor k i nf r as t r uct ur e in t he US t o allow
for adequat e connect i vi t y wi t h di fferent l eadi ng I P subnet wor ks, in or der t o
achi eve decent t hr oughput r at es t o US uni versi t i es and ot her r esear ch par t -
ners. So in essence, while US sites are benef i t t i ng f r om Eur ope a n da t a sources,
t he Eur opean net wor ks have t o pay for t hi s.
DFN' s B- Wi N is also par t of t he Eur ope a n ATM net wor k TEN- 155 whi ch
provi des i nt er connect i vi t y bet ween t he di fferent Eur ope a n sci ence net wor ks.
Rat her t ha n rel yi ng on an obscur e peer i ng vi a CIXes, t he ATM net wor k
allows for bi l at er al agr eement s bet ween t he vari ous i nst i t ut i ons. I t seems
t ha t t hi s model is super i or t o t he US model wi t h its shor t comi ngs descr i bed
above, at l east f r om t he poi nt of view of i nt er nat i onal access.
Exchange wi t h t he commer ci al I nt er net in Ge r ma ny is ens ur ed vi a a 34
Mbi t / s link 1 t o t he DE- CI X in Fr ankf ur t . Al t hough l oad on t hi s link is heavy,
ma ny commer ci al provi ders cur r ent l y seem unwi l l i ng or unabl e t o al l ow an
i ncrease of t he link speed, as t hei r net wor ks ma y be unabl e t o cope wi t h
t he flow of da t a r equest ed by t hei r users f r om t he servers in t he sci ence
communi t y. As a resul t , some commer ci al I nt er net pr ovi der s or der ed di r ect
links t o t he Ger man Science Net wor k DFN.
3 T h e L a n d e s wi s s e n s c h a f t s n e t z No r d
In t he past cent uri es it was cus t omar y t o f ound uni versi t i es off t he mai n
pol i t i cal cent res, t o keep t he i nfl uence of r i ot ous st udent s away f r om pol i t i cs.
1 this link will be upgraded to 68 Mbi t / s in May 1999
260
In fact , GSt t i ngen is a per f ect exampl e, since it was l ocat ed at t he f ar s out her n
end of t he Ki ngdom of Hannover in 1737. Similarly, in t he mi d sevent i es in
t hi s cent ur y many newl y f ounded uni versi t i es and pol yt echni cs were pl aced in
r emot e areas, mai nl y t o pr ovi de some local i nf r as t r uct ur e in ot her wi se poor
areas.
Smal l er sites which do not have t he bandwi dt h r equi r ement s or t he fi nan-
cial power t o j ust i f y a B- Wi N node now face t he addi t i onal cost for l eased
lines t o access t he near est B- Wi N si t e and t he i nt er net . Thus t he t wo Ge r ma n
st at es of Lower Saxony (Ni edersachsen) and Br emen j oi ned forces t o i mpr ove
t he si t uat i on for such r emot e sites by pr ovi di ng a st at ewi de i nf r as t r uct ur e
for t el et eachi ng and vi deo conferences as well as t el emedi ci ne. I t still seems
uncl ear whet her t el esemi nars will be t he choi ce of t he f ut ur e for uni ver s i t y
educat i on. In any case it is necessar y t o t r ai n st udent s t o use t hese t ool s,
whi ch no doubt will become par t of t hei r worki ng life l at er on. In addi t i on
t he expor t of knowl edge f r om uni versi t i es t o compani es ma y become an addi -
t i onal challenge. Tel et eachi ng met hods coul d be used t o enabl e di r ect t r ans f er
of i deas and moder n devel opment s t o i ndus t r y in or der t o pr ovi de a compet -
i t i ve advant age. Si mi l ar ar gument s hol d for t he medi cal sect or . The dense
popul at i on in Ger many may not r equi r e t he needs for t el eoper at i ons, but de-
vel opi ng and pr ovi di ng t he necessar y t ool s and met hods ma y event ual l y be
i mpor t ant for t he expor t or i ent ed medi cal i ndust r y.
The new s t at e net wor k LWN (LandesWissenschaftsnetz Nord) became
oper at i onal in Mar ch 1999. I t consists of a 155 Mbi t / s ATM ri ng (see fig. 2)
connect i ng t he maj or i nst i t ut i ons as well as access lines for smal l er r emot e
sites oper at i ng wi t h at l east 2 Mbi t / s . Thus access t o I nt er net t echnol ogy
at r at es r equi r ed by moder n devel opment s is now guar ant eed for all s t at e
i nst i t ut i ons in hi gher educat i on. Since mor e and mor e l ocal school s connect
t o t he near est uni ver si t y or pol yt echni c (at t hei r own cost ) t he avai l abi l i t y of
appr opr i at e connect i vi t y also has a posi t i ve effect on s econdar y educat i on.
The LWN is fully compat i bl e wi t h DFN' s Sci ence Net wor k B- Wi N and
i nt er connect s vi a t he t hr ee sites in Br emen, Hannover and GSt t i ngen. Each
i nt er connect oper at es at 155 Mbi t / s . The pri ce s t r uct ur e of t he DFN r esul t ed
in par t of t he f undi ng of t he St at e net wor k comi ng f r om mer gi ng vari ous 34
Mbi t / s access poi nt s, by benef i t t i ng f r om t he economy of scale. Tha nks t o
t he compat i bi l i t y, PVCs vi a LWN and B- Wi N do not pr esent any pr obl ems.
In par t i cul ar t hi s ensures t ha t par t i ci pant s of t he LWN ar e not cut off f r om
f or t hcomi ng devel opment s but on t he cont r ar y can par t i ci pat e much be t t e r
t ha n before.
The ri ng i nf r ast r uct ur e allows bet t er use of t he avai l abl e r esour ces as t her e
are always t wo pat hs bet ween any t wo i nst i t ut i ons. In addi t i on it pr ovi des an
obvi ous faul t t ol er ance in accessi ng t he t hr ee i nt er connect ed sites. In par t i cu-
lar, in GSt t i ngen t he lines for LWN and B- Wi N physi cal l y ar r i ve at di fferent
l ocat i ons on campus.
261
Fig. 2. Landeswissenschaftsnetz Nord
4 Di al - i n
The classical situation with respect to providing dial-in support for users
required universities t o purchase appropri at e equi pment and t o lease access
lines from Deutsche Telekom. The ongoing liberalisation forces carriers t o
generat e traffic to compensat e for declining revenue. As a result Deutsche
Telekom is now placing rout ers on university premises and is providing t he
necessary lines at no cost for the scientific institutions. Thus t he dial-in ca-
pacity has been boost ed significantly in the past months. As t he Telekom in-
frast ruct ure is very modern this means t hat rat her t han large modem banks,
a few S2m ISDN t runk lines provide the requi red capaci t y bot h for ISDN
as well as for analogue connections, including V.90. Since connect i on charges
are identical for ISDN and analog users, more and more users switch t o ISDN
because of its faster and more reliable performance. Connect i ng t o t he univer-
sity for one hour at 64 kbi t / s current l y costs less t han EUR 1 duri ng off-peak.
Since t he basic ISDN So subscriber access offers two (virtual) 64 kbi t / s lines,
connections at 128 kbi t / s are also possible, but cost twice as much. Since
connections are charged based on t i me intervals (which may be up t o 4 min-
utes long), demand-dri ven aut omat i c dialing of t he second connect i on of t he
So t r unk may not be wise in all cases.
Al t hough 64 kbi t / s don' t seem t o be much in these days of high speed
networking, especially with respect t o t he 155 Mbi t / s WAN technology, t he
262
number of connections may present a challenge. GWDG is currently operat-
ing some 10 S2m dial-in trunks which may result in a demand of up to 20
Mbit/s on WAN performance, in addition to the LAN traffic. Fortunately
dial-in demand is typically at its peak in the evening and night hours and
compensates nicely daytime LAN demand. The dial-in characteristics are
best shown by the following diagram - see fig. 3 - which seems typical for
a German science institution, but also reflects the pricing structure of the
carrier (currently rates for local calls are cheaper after 18:00 and cheapest
after 21:00). It also shows that scientists and students tend to work late.
Fig. 3. usage of dial-in lines over 24 hours
5 LAN
While access to WAN technologies can now be aquired at relatively short
notice due to the abundance of fibre optic cables with the long distance
carriers, bringing the local LAN onto modern technologies is a time and
money consuming exercise. Shortage of funds in the public sector mean t hat
many construction plans have to be postponed.
GSttingen is a nice in-town University, with many departments housed
in old and picturesque buildings, including C. F. Gauss' original observatory.
Although this does boost the academic atmosphere and makes the Univer-
sity very attractive, it turns into a nightmare for networkers. Connecting a
building to the backbone not only means installing the cabling locally while
conforming to the requirements of conservation laws but also digging across
roads and public premises. Only recently has the legal framework been lib-
eralised in this respect.
As a result the backbone infrastructure in GSttingen is up-to-date with a
622 Mbit/s ATM backbone as well as an FDDI ring connecting the various
central points of the university. Most science faculties now have either 100
Mbit/s or 10 Mbit/s access to the backbone.
263
Ar t s and Soci al Sciences ar e t ypi cal l y not pl aced hi gh on t he list of pr i -
ori t i es si nce t he need for net wor ki ng was not obvi ous for t hes e di sci pl i nes
when pr i or i t i es were set 10 year s ago. Thus GWDG is now f aced wi t h r i si ng
de ma nds f r om a new gr oup of users but wi t h l i t t l e or no e x t r a f unds t o me e t
t hi s specific demand.
However t he Uni ver si t y owns a l ar ge and ext ensi ve c oppe r ne t wor k whi ch
was i nst al l ed t oget her wi t h t he PABX in t he l at e 1960s. Bas i cal l y each of-
rice is on t hi s t el ephone net wor k and t her e ar e pl ent y of s par e wi r es i nt o
each bui l di ng. Whi l e t he PABX i t sel f is now mor e of hi st or i c i nt er es t a nd up
for r epl acement , t he copper net wor k still seems t o be in excel l ent condi t i on.
Al t hough cl assi cal mo d e m connect i ons pr ovi de a fi rst way t o access t he back-
bone over t hese wires, speed is not a de qua t e even for mode s t r e qui r e me nt s
f r om science.
6 ADSL
In ear l y 1998, GWDG t e a me d up wi t h Ericsson t o i nves t i gat e t he possi bi l i t y
of pr ovi di ng hi gher speeds over t hi s copper net wor k. The t hen newl y r el eased
Ericsson ANxDSL equi pment was t o be in t he cent r e of t he i nves t i gat i on.
Anal ysi s of t he equi pment showed at a ver y ear l y s t age t h a t it offered s ome
i nt er est i ng advant ages over ot her sol ut i ons in t he ADSL s ect or whi ch ma d e
it par t i cul ar l y i nt er est i ng for depl oyment in a LAN envi r onment . Th e mos t
appeal i ng f eat ur e is t ha t Ericsson's ANxDSL is del i ver i ng na t i ve ATM t o
t he c us t ome r pr emi ses. The net wor k t e r mi na t i ng e qui pme nt offers t wo nat i ve
ATM pl ugs ( ATM 25 Mbi t / s as a ma t t e r of f act ) as well as an Et h e r n e t por t
t o car r y I P LAN t raffi c over ATM. Thi s por t is br i dge- t unnel ed accor di ng
t o RFC 1483. Ther ef or e it is ver y eas y t o t r a n s p o r t at l east t wo di fferent
LANs, e.g. a VLAN for a dmi ni s t r a t i ve pur pos es as well as t he s t a n d a r d LAN
i nf r as t r uct ur e for science. Thus t he t ypi cal pa r a noi a of t he a dmi ni s t r a t i ve
sect or wi t h r es pect t o I P t raffi c can be over come at no e x t r a cost . Ot he r
i nst i t ut i ons in t he St at e were forced t o i nst al l a s e pa r a t e a dmi ni s t r a t i ve LAN,
and si nce f unds ar e avai l abl e onl y once t hi s me a nt t h a t essent i al l y sci ent i fi c
needs had t o be sacri fi ced t o a c c o mmo d a t e a dmi ni s t r a t i ve de ma nds .
The mai n ATM hub for t he ADSL equi pment was pl aced ne xt t o a ma i n
ATM swi t ch on t he c a mpus net wor k. Thus a seaml ess i nt egr at i on be c a me
possi bl e.
Since copper lines are r eadi l y avai l abl e it be c a me possi bl e t o del i ver a
connect i on t o t he GSNET t o ma n y si t es al mos t i mmedi at el y. Th e t i me for
wai t i ng for a LAN connect i on was r educed f r om sever al year s t o sever al days.
Al t hough ADSL is pr i mar i l y a cons umer t echnol ogy, offeri ng hi gh ba nd-
wi dt h t o t he cus t omer si t e and c ompa r a t i ve l y l i t t l e ba ndwi dt h in t he oppos i t e
di r ect i on, it t ur ns out t ha t it offers a ful l y f unct i onal sol ut i on al so in an hi gh
per f or mance LAN envi r onment . I t enabl es users t o exper i ence t he possi bi l i -
264
ties of high speed i nt ernet so t hat skills will have been devel oped when t he
proper connection will be installed in the future.
The actual ADSL system consists of two parts:
9 At t he cust omer site a network t ermi nat i ng device ANxDSL- NT (fig.
4)is installed. This device has 3 ports: two port s offering nat i ve 25.6 Mbi t / s
ATM access and a twisted pair port for direct Et her net access.
Fig. 4. network termination at customer site
A filter in t he NT allows the splitting of plain old t el ephony service
( POTS) and ADSL on the cust omer premises. A similar spl i t t er for t he Eu-
ropean ISDN i nfrast ruct ure will soon be available. Thus POTS and ISDN
are not affected by a power failure, while t he NT requires electrical power t o
operate.
9 At the central site a line t ermi nat i ng device ANxDSL- LT (fig. 5) is
installed. This consists of a shelf holding up to 15 cards (two port s each, at
t he time of writing 4-port-cards were about to be released), connect ed via
the backplane t o a 155 Mbi t / s STM-1 interface.
Fig. 5. AnXDSL equipment overview
Ericsson also provides a concent rat or allowing t he connect i on of 16 such
shelves ont o one 155 Mbi t / s STM-1 interface. In principle, up t o 480 ADSL
265
lines can t hus be connect ed t o t he WAN, dependi ng on t raffi c and per f or -
mance r equi r ement s.
The ori gi nal set up at GWDG consi st ed of one ANxDSL- LT l ocat ed close
t o t he cent r al PBAX of t he uni ver si t y and 30 ANxDSL- NT devi ces. Now
al most 50 lines are in oper at i on. The t el ephone f unct i onal i t y was not t est ed.
I t is cl ear, however, t ha t cur r ent ADSL t echnol ogy c a nnot be seen as a
r epl acement for t r adi t i onal LAN t echnol ogy, especi al l y wi t h r es pect t o mul -
t i medi a appl i cat i ons and hi gh end cent r al services, like a cent r al i zed backup
for l arge dat a.
In par t i cul ar t he core syst em is not desi gned t o handl e massi ve LAN t raffi c
but r at her t o s uppor t t he occasi onal di al -i n f r om ma ny users at di fferent t i mes
wi t h a r esul t i ng moder at e de ma nd on net worki ng. The coupl i ng of LANs vi a
ADSL however t ends t o cr eat e a cont i nuous demand for bandwi dt h, especi al l y
when s t udent dor ms are on t he net . In our t ri al , t he 30 lines gener at ed a
t heor et i cal bandwi dt h of 240 Mbi t / s and a pr act i cal peak de ma nd of al most
100 Mbi t / s .
6. 1 E x p e r i e n c e s
Pr ovi ded t he available copper wire is of r easonabl e qual i t y, t he ADSL t ech-
nol ogy works amazi ngl y well and is ver y easy t o set up. Under r eas onabl y
good condi t i ons, t he bandwi dt h achi eved wi t h t hi s t echnol ogy is i ndeed up
t o t he promi ses made in t he manual s. Yet ADSL put s a seri ous de ma nd on a
net wor k and it seems t hat car r i er s who are t hi nki ng of offeri ng t he t echnol ogy
t o consumer s may under es t i mat e t he necessar y upgr ades in t he backbone.
Most of t he per f or mance pr obl ems obser ved dur i ng t he t r i al s or i gi nat ed
f r om pr evi ousl y unnot i ced defect s in t he copper wires.
Apar t f r om pr i vat e copper lines, GWDG also exper i ment ed wi t h a 64
kbi t / s leased line f r om Deutsche Telekom (in f act a cheap copper wire) t o
r un ADSL, whi ch at t he begi nni ng was not (officially) known t o Deutsche
Telekom. The resul t s were j ust as promi si ng. As a consequence Deutsche
Telekom is now moni t or i ng t he progress in GSt t i ngen and is a bout t o sign a
cont r act wi t h GWDG. Thi s cont r act will enabl e GWDG t o r ent addi t i onal
copper lines at a ver y moder at e cost (well bel ow t he t r adi t i onal cost of l eased
lines) t o connect r emot e sites as well as t he homes of some Uni ver si t y st af f
member s t o GSNET. Al t hough Deutsche Telekom is a bout t o l aunch t hei r
own ADSL pi l ot s in vari ous Ge r ma n cities, t hi s cont r act will al l ow for a
special i nf r as t r uct ur e in GSt t i ngen, offering mor e f unct i onal i t y and opt i ons
because of t he scientific i nt er est behi nd t he set up.
To hi ghl i ght t he exper i ences t he following t abl e 1 gives an over vi ew of
some of t he speeds obt ai ned over t he net work. Lengt h of cabl e is cer t ai nl y
one l i mi t i ng fact or, but t her e ar e obvi ousl y ot her s whi ch we coul d not di scover
due t o t he lack of measur i ng equi pment .
266
Tabl e 1. Line lengths and link speeds of ADSL:
Location line l ength downl i nk speed upl i nk speed
(meter) (kbi t / s) (kbi t / s)
Kolosseum 1700 7968 640
Studentendorf 1900 7968 640
Neuer botanischer Gart en 2750 7712 832
Gerichtsmedizin 5400 2304 640
Studienzentrum Subtropen 2000 7968 800
Studienzentrum U. of California 3600 4320 608
Medizinische Physik 2000 7456 640
Botanischer Gart en 2900 6688 608
ZENS 2000 7264 736
Volkskunde 3500 6592 544
VSlkerkundemuseum 3200 6560 224
Sprachlehrzentrum 2900 5824 704
Anthropologie 5100 3872 640
Ibero-Amerika-Institut 2600 6720 672
]Umweltgeschichte 2700 7968 160
Heizkraftwerk 800 4352 448
Akademie der Wissenschaften 3650 5536 544
Restaurierungsstelle 3500 4736 832
7 City of GSttingen interconnected
In anot her a t t e mpt t o speed up t he bui l di ng of t he GSNET, t al ks s t a r t e d wi t h
t he Ci t y of GSt t i ngen wi t h t he ai m of j oi nt l y usi ng t he avai l abl e i nf r ast r uc-
t ure. Bot h ATM and secure end- t o- end encr ypt i on pr ovi de t he t echnol ogy
t o r un LANs wi t h cont r adi ct i ng secur i t y r equi r ement s over t he same cabl e
i nf r ast r uct ur e, t hus pr ovi di ng t he pot ent i al for t he cost -effect i ve shar i ng of
resources.
The Ci t y of GSt t i ngen also owns a fi bre opt i c net wor k (used t o cont r ol
traffic lights and t o connect sites like publ i c swi mmi ng pool s t o t he cent r al
admi ni st r at i on in t he Ci t y Hall) as well as a copper net wor k. The copper
net wor k is of i nt er est t o offer pe r ma ne nt i nt er net connect i ons t o pr i ma r y and
secondar y schools, cur r ent l y at mode m speed.
It t ur ned out t ha t at some specific l ocat i on GSNET and t he Ci t y net wor k
are j ust 30m apar t . As a resul t , t he Uni versi t y, t he Ci t y and GWDG si gned
a cont r act in l at e 1998 and deci ded t o j oi n forces.
Aft er t he net works became physi cal l y connect ed in Mar ch 1999, vari ous
uni versi t y buildings can now easily be r eached vi a exi st i ng fibres of ci t y net -
work as some of t hem are close t o publ i c sites or t raffi c lights. The posi t i ve
experi ences at GWDG wi t h ADSL equi pment has t r i gger ed t he deci si on t o
267
connect all local schools in GSttingen via ADSL to a central site in the City
Hall and from there via a 2 Mbit/s PVC over ATM directly to the German
Science Network. Access to GSNET will be at a higher speed, so t hat local
schools may also gain insight into the paradigm changes caused by high speed
networking.
8 S u mma r y
Modern telecommunication systems allow the rapid deployment of currently
adequate bandwidth to a large number of sites. Protocols like ATM as well
as encryption permit the operation of different LANs over the same infras-
tructure. In addition, issues concerning quality of services like guaranteed or
restricted bandwidth can be solved easily with ATM, both locally and on a
nationwide basis.
The sudden decline in prices for WAN connection leads to an inverse
networking pyramid: While backbone and WAN are capable of delivering
the bandwidth required by modern communication, the local infrastructure
both to and in the buildings does not keep up with this development, due to
funding issues. ADSL provides a way to quickly connect sites at reasonable
speed and to bridge the time gap until fibre is installed.
In fact the Deutsche Forschungsgemeinschaft (DFG) has acknowledged
this inverted networking phenomenon and is working on a memorandum to
highlight the need for additional resources for local networks.
Re f e r e n c e s
1. Verein zur FSrderung eines deutschen Forschungsnetzes
http://www.dfn.de
2. B-WiN-Karte, http://www.dfn.de/b-winkarte.html
3. ADSL-Projekt der GWDG, http://www.gwdg.de/adsl
D F N , Berlin,
Th e NRW Me t a c o mp u t i n g I n i t i a t i v e *
Uwe Schwi egel shohn I and Ra mi n Ya hya pour 1
Comput er Engineering Inst i t ut e, University Dor t mund, 44221 Dort mund,
Ger many
A b s t r a c t . In this paper the Nort hrhi ne-West phal i an met acomput i ng initiative is
described. We st ar t by discussing various general aspects of met acomput i ng and
explain the reasons for founding t he initiative with t he goal t o build a met acomput er
pilot. The initiative consists of several subproj ect s t hat address met acomput i ng
applications and the generation of a suitable infrastructure. The l at t er includes the
component s user interface, security, di st ri but ed file-systems and t he management
of a met acomput er. At last, we specifically discuss the aspect of job scheduling in
a met acomput er and present an approach t hat is based on a brokerage and t radi ng
concept.
1 The Need for Hi gh Perf ormance Comput i ng
Hi gh Per f or mance Comput i ng (HPC) has be c ome an i mp o r t a n t t ool for re-
sear ch and devel opment in ma n y di fferent ar eas [2,14]. Ori gi nal l y, s uper com-
put er s were mai nl y used t o addr ess pr obl ems in physi cs. Then, t he t e r m
Grand Challenges has been i nt r oduced in t he ei ght i es t o descr i be a var i et y
of t echni cal pr obl ems whi ch r equi r e t he avai l abi l i t y of si gni fi cant c omput i ng
power. Today t he numbe r of fields, whi ch ar e not in need of comput er s , are
r api dl y decr easi ng. I n addi t i on t o t he core ar eas of physi cs and engi neeri ng,
hi gh pe r f or ma nc e c omput e r equi pment is essent i al for e.g. t he desi gn of new
drugs, an accur at e weat her f or ecast [13] or t he cr eat i on of new movi es. Ot her
new appl i cat i ons especi al l y in t he field of educat i on ar e cur r ent l y under de-
vel opment .
But HPC is not j us t a necessi t y for a few compani es or i nst i t ut i ons. For
i nst ance, ma n y compani es in var i ous ar eas of engi neer i ng ar e faced t oda y
wi t h t he t as k t o cons t ant l y desi gn new compl ex s ys t ems and br i ng t he m
i nt o t he pr oduct i on line as soon as possi bl e. Time to market has become an
i mpor t a nt p a r a me t e r for hi gh t ech compani es t r yi ng t o grow in t he gl obal
mar ket . Ther ef or e, ma n y of t hose compani es use s ys t e m si mul at i on as a key
el ement for r api d pr ot ot ypi ng t o r educe devel opment cycles. On t he ot her
hand t he compl exi t y of ma n y pr oduc t s in t he fields of t el ecommuni cat i on and
i nf or mat i on t echnol ogy makes t he avai l abi l i t y of l ar ge comput i ng r esour ces
i ndi spensabl e. Hence, t he access t o hi gh pe r f or ma nc e c omput i ng for a br oa d
r ange of users ma y be a key f act or t o f ur t her s t i mul at e i nnovat i on [10].
* Support ed by the NRW Met acomput i ng grant
270
In r ecent year s t he c omput e r i ndus t r y cons t ant l y i ncr eased t he c omput -
ing power of t hei r pr oduct s . Whi l e a t op- of - t he- l i ne PC was equi pped wi t h
a 33MHz pr ocessor and 4MB DRAM in 1991 [7] a si mi l ar c omput e r in 1998
i ncl uded a 450MHz pr ocessor and 128MB me mo r y [8]. Not e t h a t t hi s does
not consi der t he addi t i onal t echni cal advances in t he ar chi t ect ur e of t he pr o-
cessor. But t he de ma nd is growi ng at an even f as t er pace. For i nst ance, t he
numbe r of comput er s in an aver age c ompa ny has i ncr eased si gni fi cant l y f r om
1991 t o 1998. The s ame devel opment can al so be obser ved for HPC. We
cl ai m t ha t no ma t t e r how much t echnol ogy advances t her e will al ways be a
non negl i gi bl e numbe r of appl i cat i ons whi ch r equi r e t he f ast est c omput i ng
equi pment avai l abl e.
Unf or t unat el y, t he ment i oned shor t devel opment cycl es in i nf or mat i on
t echnol ogy equi pment resul t in severe dr awbacks for a HPC pr ovi der . By
defi ni t i on, HPC uses t echnol ogy at t he l eadi ng edge. Ther ef or e, HPC c ompo-
nent s ar e expensi ve. I n addi t i on, t odays HP C equi pment will cer t ai nl y not fall
i nt o t hi s cat egor y five year s f r om now unl ess it is f r equent l y upgr a de d dur i ng
t hi s t i me s pan [19]. Thi s resul t s in hi gh cost s and a si gni fi cant ma i nt e na nc e
effort. In or der t o bal ance t hose cost s a hi gh degr ee of ut i l i zat i on is a mus t
for HPC resources. For i nst ance, few commer ci al owners of s upe r c omput e r s
can afford t o see t hei r machi nes idle dur i ng t he ni ght or t he weekend as it is
c ommon pl ace wi t h mos t PCs.
Smal l compani es ma y t her ef or e face a di l emma. Whi l e t he access t o HP C
r esour ces is necessar y for t he devel opment of new pr oduc t s t hey do not have
enough appl i cat i ons t o efficiently use such equi pment . I t cer t ai nl y does not
pay off t o r un s ecr et ar y pr ogr ams like wor d pr ocessi ng on a s upe r c omput e r .
Thos e users woul d need a c ompa ny or an i ns t i t ut i on wher e t hey can easi l y
get access t o HPC r esour ces when needed. I n t he academi c envi r onment t he
c omput e r cent ers as s ume such a role for t he var i ous r esear ch l abs wi t hi n a
uni versi t y. As it cannot be expect ed t ha t all pot ent i al users ar e l i vi ng in close
pr oxi mi t y f r om each ot her and f r om HPC r esour ces, powerful net wor ks ar e
an essent i al c ompone nt of a sui t abl e HPC i nf r as t r uct ur e.
Fur t her , t her e ar e si gni fi cant ar chi t ect ur al di fferences bet ween t oda ys
hi gh per f or mance comput er s [12], like par al l el vect or c omput e r s (e.g. Fu-
j i t su VPP700 [20]), l arge s ymme t r i c mul t i pr oces s or s (e.g. SGI Power Chal -
l enge [16]), t i ght l y coupl ed paral l el c omput e r s wi t h di s t r i but ed me mo r y (e.g.
I BM RS/ 6000 SP [3]) and l arge cl ust er s of wor ks t at i ons (e.g. Beowul f, [4,17]).
Also, machi nes f r om di fferent ma nuf a c t ur e r s t ypi cal l y r equi r e or s uppor t dif-
ferent soft ware. On t he ot her hand, some HPC appl i cat i ons need a speci fi c
ar chi t ect ur e as
9 t hey ar e not por t a bl e for hi st ori c r easons,
9 t hey ar e opt i mi zed for t hi s machi ne or
9 t hey can make best use of t he avai l abl e ar chi t ect ur al pr oper t i es.
I t is t her ef or e unl i kel y t ha t a single s upe r c omput e r will be sufficient for all
pot ent i al users in a regi on. But if no ot her HPC r esour ces ar e l ocal l y avai l abl e
271
some users face the choice to run their application on the local equi pment
or to ask for an account at anot her location. The first approach results in a
decreased efficiency while the second approach is typically quite cumbersome.
2 Me t a c omput i ng I nf r a s t r uc t ur e
Such a HPC infrastructure may be based upon a single HPC center which
provides all required resources, t hat is all HPC equipment is concent rat ed and
maintained at one location. Access to resources is rented to users for their
applications. Therefore, HPC users only pay for their actual usage while t hey
are not forced to care for support and mai nt enance of the system. Unfort u-
nately, this approach also has a few drawbacks:
9 All HPC use needs network bandwidth. Therefore, large investments in
a dedicated network st ruct ure are necessary.
9 The center may either be a potential single point of failure or special
care must be taken to prevent situations like the disruption of the whole
infrastructure by a single power failure.
9 The center is completely decoupled from the applications. This may be
a disadvantage for some users like e.g. those designing new applications.
In addition, a single HPC center requires central planning and may show
little flexibility.
Alternatively, the concept of a distributed heterogeneous supercomput er
can be used. Such an infrastructure is also called a met acomput er. It consists
of geographically distributed HPC installations which are linked by an effi-
cient network. The location of a HPC component will depend of the demand
of local users. A suitable distribution of HPC resources allows a significant
reduction of the network load in comparison to the central approach. Further,
HPC resources from different providers may be included into the infrastruc-
ture and can compete for customers. This absence of a single institution
controlling all HPC resources may be a significant advant age especially for
commercial users. In addition, the failure of any single component will not
lead to a breakdown of the whole met acomput er.
While met acomput i ng offers a variety of promising prospects it is not
clear whether this concept is actually feasible. To this end several questions
must be addressed:
9 What are the technological requirements for met acomput i ng?
9 Will this concept find acceptance in the user communi t y including po-
tentially new users from i ndust ry?
9 Which problems will arise in the management of a met acomput er ?
9 What will be the performance of a met acomput er in compari son to a
large installation of a supercomput er?
9 What are the costs for building and maintaining a met acomput er ?
272
2.1 Met acomput i ng Scenari os
Before finding a me t hod t o answer t hose quest i ons it is neces s ar y t o pr eci sel y
define t he use of a me t a c omput e r . I n gener al t her e ar e t hr ee scenar i os wi t h
different degrees of user i nvol vement and wi t h di fferent s ys t e m r equi r ement s .
Si ngl e Si te Appl i cat i on I n t hi s scenar i o each j ob is execut ed on a si ngl e
HPC component in t he me t a c omput e r . I f any c ompone nt has not enough
resources, like e. g. pr ocessor s, t o execut e a j ob compl et el y, i t will al so not r un
par t s of t ha t j ob. Of course, a j ob ma y be assi gned in par al l el t o sever al HP C
component s for r easons of per f or mance, t ha t is t o i ncr ease t he pr oba bi l i t y
t ha t t he j ob will be compl et ed at a gi ven deadl i ne. But in t hi s case all copi es
of t he j ob ar e i ndependent f r om each ot her . For si ngl e si t e appl i cat i ons t he
ma x i mu m j ob size for t he me t a c o mp u t e r is de t e r mi ne d by t he size of t he
l ar gest component .
The user need not modi f y any of his appl i cat i ons. I t is onl y neces s ar y t o
speci fy t he execut i on r equi r ement s of his j ob like e.g. t he a mount of me mo r y
or t he mi ni mal numbe r of pr ocessor s or t he neces s ar y sof t war e. Taki ng t he
r equi r ement s and possi bl y addi t i onal r est r i ct i ons i nt o account t he me t a c o m-
put er picks i t s bes t sui t ed c ompone nt for t he execut i on of t he j ob (location
transparency). Even if all HPC r esour ces in t he me t a c o mp u t e r ar e wor ki ng
at full load, t he me t a c o mp u t e r can i ncr ease overal l efficiency by r unni ng j obs
on t he HPC c ompone nt bes t sui t ed for t hem.
Homogeneous Mul t i Si t e Appl i cat i ons I n addi t i on t o si ngl e si t e appl i -
cat i ons, some j obs ma y al so be execut ed in par al l el on di fferent HPC c ompo-
nent s of t he s ame t ype, e.g. several I BM RS/ 6000 c omput e r s ar e combi ned
t o j oi nt l y r un a l arge j ob. I n a l arge me t a c o mp u t e r t hi s scenar i o si gni f i cant l y
expands t he numbe r of HPC r esour ces whi ch ar e pot ent i al l y avai l abl e t o a
single j ob by a f or mi ng a virtual supercomputer. As t he cost for mos t t ypes
of s uper comput er s grows super l i near l y wi t h t he size, t hi s a ppr oa c h ma y be
an i nt er est i ng opt i on for all cases wher e such bi g j obs mus t onl y be execut ed
once in a while.
However, mul t i si t e appl i cat i ons requi re t he concur r ent avai l abi l i t y of sev-
eral HPC component s . Thi s i ncl udes t he net wor k t h a t links t he c omput e com-
ponent s. Ther ef or e, ma n a g e me n t of such a s ys t e m becomes mor e difficult. I n
addi t i on, t he user will not recei ve t he s ame c ommuni c a t i on pe r f or ma nc e as
on a single l arge s uper comput er . Hence, she mus t desi gn her appl i cat i ons ac-
cordingly. Fur t her , s ome pr obl ems wi t h l ar ge r a n d o m c ommuni c a t i on pa t t e r s
ma y not r un as a mul t i si t e appl i cat i on or t ake a huge pe r f or ma nc e hi t . Nev-
ert hel ess, t her e ar e numer ous appl i cat i ons t ha t r equi r e l i mi t c ommuni c a t i on
over head and can t her ef or e benefi t f r om a mul t i si t e execut i on. Thi s is espe-
cially t r ue when appl i cat i ons ar e devel oped wi t h a me t a c o mp u t i n g s ys t e m in
mi nd.
273
He t e r oge ne ous Mul t i Si t e Appl i cat i ons Thi s scenar i o f ur t her expands
t he homogeneous mul t i site concept by allowing t ha t pot ent i al l y all HPC re-
sources of a me t a c omput e r ar e used for t he execut i on of a single j ob. However,
i t is not necessar y t ha t all t hose component s ar e act ual l y r unni ng t he same
execut abl e. It is also possi bl e t ha t a j ob is aut omat i cal l y pi ped f r om one set
of HPC component s t o t he next . Nevert hel ess, t hi s will r esul t in a subst an-
t i al coor di nat i on effort. Fur t her , t he workflow of t he j ob must be careful l y
pl anned t aki ng i nt o consi der at i on vari ous r esour ce const r ai nt s like net wor k
bandwi dt h or t he size of di fferent machi nes in t he me t a c omput e r . Thi s re-
qui res a new pr ogr ammi ng par adi gm and si gni fi cant addi t i onal user effort.
On t he ot her hand, t he pr ospect of a gi gant i c vi r t ual s upe r c omput e r may be
well wor t h t he work.
2. 2 Re qui r e me nt s for a Me t a c o mput i ng Pi l ot Pr oj e c t
The best appr oach t o answer t he pr evi ousl y posed quest i ons is t he est abl i sh-
ment of a pi l ot pr oj ect i ncl udi ng some appl i cat i ons. Thi s will hel p t o det er -
mi ne t he act ual t echni cal pr obl ems and sui t abl e sol ut i ons for t hem. Ear l y user
par t i ci pat i on will pr ovi de a hel pful f eedback for t he syst em designers. Such
a close cooper at i on bet ween devel oper s and users of a me t a c omput e r is an
essent i al el ement of a pi l ot pr oj ect . Unf or t unat el y, any new HPC i nst al l at i on
l eads t o hi gh i ni t i al costs. I t is t her ef or e hi ghl y advi sabl e t o sel ect l ocat i ons
for t hi s pi l ot s t udy where most if not all of t he r equi r ed HPC equi pment is
al r eady in place.
Taki ng all t hese r equi r ement s i nt o consi der at i on t he Ge r ma n s t at e of
Nor t hr hi ne West phal i a (NRW) offers an excel l ent basis for t he r eal i zat i on
of t he pr oj ect . It host s t he l argest concent r at i on of uni versi t i es and ot her
r esear ch i nst i t ut i ons in Ger many. Most of t hese i nst i t ut i ons al r eady own
HPC equi pment in t hei r comput i ng cent ers whi ch oper at e i ndependent l y.
Thi s equi pment includes al most all common HPC pl at f or ms like e.g. Sun
Ent er pr i se 10000, I BM RS/ 6000 SP, Cr ay T3E, SGI Ori gi n and ot her s. The
i ncl usi ons of ma ny di fferent pl at f or ms guar ant ees t he desi red degree of flex-
ibility. I t also makes t he new me t a c omput e r at t r act i ve for a wide r ange of
users as al most ever yone can find her f avor i t e HPC har dwar e in t he syst em.
In addi t i on, t he s t at e has a powerful net wor k i nf r as t r uct ur e whi ch links t hese
i nst i t ut i ons. It is pr esent l y based on an ATM backbone whi ch allows Qual i t y-
of-Servi ce f eat ur es and vi r t ual channel s ( PVC/ SVC) . Thi s all t oget her con-
st i t ut es a sui t abl e syst em i nf r as t r uct ur e for met acomput i ng.
Not e f ur t her t ha t t he l arge number of r esear ch i nst i t ut i ons f r om ma ny
areas also guar ant ees a di ver si t y of r esear ch pr oj ect s r equi r i ng HPC. Finally,
several t echnol ogy cent ers wi t h smal l hi gh- t ech compani es are also l ocat ed in
NRW r esul t i ng in a l arge pool of pot ent i al users. Ther ef or e, all r equi r ement s
for a met acomput i ng pi l ot pr oj ect are met in Nor t hr hi ne West phal i a. On t he
ot her hand, t he set up of a me t a c omput e r may pr ovi de si gni fi cant benefi t s for
t he economy and r esear ch pr oj ect s in Nor t hr hi ne West phal i a.
274
3 T h e NRW Me t a c o mp u t i n g I n i t i a t i v e
Based on t hese t hought s, t he NRW Met acomput i ng I ni t i at i ve was pr opos ed
by A. Bachem, B. Moni en and F. Ra mme in 1996 ([1]). It s t ar t ed in J ul y
1996 and is pl anned t o concl ude in J une 1999. The pr oj ect is coor di nat ed
by B. Moni en of Uni versi t y Pader bor n. It is j oi nt l y f unded by t he s t at e of
Nor t hr hi ne- West phal i a and t he par t i ci pat i ng r esear ch i nst i t ut i ons whi ch ar e
named below:
9 Pader bor n Cent er for Paral l el Comput i ng ( PC 2, Uni ver si t y of Pa de r bor n)
9 Uni versi t y of Col ogne
9 Uni versi t y of Dor t mund
9 Techni cal Uni versi t y ( RWTH) Aachen
9 Cent r al I ns t i t ut e for Appl i ed Mat hemat i cs (ZAM), For s chungs zent r um
Jfilich
9 GMD Nat i onal Research Cent er for I nf or mat i on Technol ogy, Bonn
Besides gener at i ng a worki ng me t a c omput e r t he i ni t i at i ve has t he goal t o
find answers t o t he following specific quest i ons:
9 Wha t are t he syst em r equi r ement s for HPC component s in a met acom-
pur er ?
9 Does t he me t a c omput e r gener at e a need for a new t ype of HPC compo-
nent or for significant modi fi cat i ons of t he exi st i ng ones?
9 Whi ch appl i cat i ons can benefi t most f r om a me t a c omput e r ?
The i ni t i at i ve consists of several syst em and appl i cat i on pr oj ect s t ha t
work on di fferent aspect s of met acomput i ng, see Fig. 1.
3. 1 Ap p l i c a t i o n P r o j e c t s
The inclusion of appl i cat i on pr oj ect s f r om t he begi nni ng had t he goal of
suppor t i ng a const ant communi cat i on process bet ween users and s ys t em de-
signers. These appl i cat i ons can f ur t her be used for t est and eval uat i on of t he
me t a c omput e r pilot. Thi s includes f unct i onal i t y and per f or mance aspect s.
The first user pr oj ect s may also give a i ndi cat i on about t he char act er i st i c
pr oper t i es of f ut ur e mul t i site appl i cat i on r egar di ng
9 communi cat i on pat t er ns,
9 net wor k r equi r ement s, and
9 soft ware adapt at i ons.
The subj ect s of t he appl i cat i on pr oj ect s ar e Mol ecul ar Dynami c Si mul a-
t i on, Traffic Si mul at i on, and Weat her Forecast . However, in t hi s descr i pt i on
we will pr i mar i l y focus on t he syst em design of t he me t a c omput e r and t her e-
fore not go i nt o t he det ai l s of t hose pr oj ect s.
275
Fig. 1. Projects of the Initiative
3. 2 Sys t e m Pr oj e c t s
As al ready ment i oned the met acomput er uses existing HPC installations.
This includes bot h hardware and syst em software (operat i ng syst ems, local
management software). In order to combi ne those resources into a worki ng
met acomput er the following probl ems must be addressed:
9 Coordi nat ed management
9 Interfaces
9 Security
These probl ems became t he subj ect of several proj ect s in t he initiative. In
the next sections all those syst em proj ect s are briefly descri bed while t he
proj ect Schedule is discussed in more detail.
4 Dat a Di st ri but i on and Aut hent i cat i on wi t h
DCE/ DFS
As met acomput i ng in this initiative is done over t he public Int ernet , insecure
channels are used for communi cat i on. Also, comput er s of different political
admi ni st rat i on domai ns are part of t he met acomput er . Thi s requires aut hen-
t i cat i on of remot e users. Finally, hardware and software of t he HPC com-
ponent s must be prot ect ed from unaut hori zed access. Hence, t here is need
for secure aut hent i cat i on and secure communi cat i on. On t he ot her hand, it
is i mpor t ant t o limit t he resulting overhead for users and admi ni st r at or s t o
achieve a high degree of accept ance and part i ci pat i on.
276
In t he i ni t i at i ve, it was deci ded t o use t he s t andar di zed Di s t r i but ed Com-
mon Envi r onment ( DCE) as an exi st i ng and pr oven sof t war e sol ut i on. DCE
allows secure aut hent i cat i on and c ommuni c a t i on as well as cross a ut he nt i c a -
t i on bet ween cells, t ha t is s e pa r a t e a dmi ni s t r a t i ve domai ns. Ther ef or e, user
login or j ob s t a r t up is possi bl e wi t hout t he need t o s uppl y a pas s wor d for ev-
ery machi ne. Fur t her mor e, t he Di s t r i but ed Fi l es ys t em Sys t e m ( DFS) is used
t o gener at e a s har ed file s ys t e m t ha t pr ovi des a dedi cat ed home or pr oj e c t
di r ect or y on ever y pl at f or m. As DFS uses DCE f eat ur es for e nc r ypt i on and
aut hent i cat i on, s ys t e m and user files ar e secured. DCE / DF S has f ur t her t he
advant age of bei ng avai l abl e for mos t c ommon pl at f or ms .
Thi s s ys t em pr oj ect has t he goal t o set up DCE / DF S cells for var i ous
NRW i nst i t ut i ons and t o pr ovi de cross aut hent i cat i on bet ween t he m for me t a -
comput i ng users. Fur t her , mechani s ms ar e devel oped t o ent er t he a ut he nt i c a -
t i on cells f r om out si de t he DCE / DF S f r amewor k. Thi s al l ows j ob s ubmi s s i on
f r om machi nes t ha t ar e not usi ng DCE. The pr oj ect f ur t her i ncl udes per f or -
mance me a s ur e me nt s for DFS and t he avai l abl e net wor k i nf r as t r uct ur e. The
resul t s show a si gni fi cant s peedup in compar i s on t o NFS. Never t hel ess, d a t a
pr ef et chi ng is still benefi ci al for d a t a i nt ensi ve appl i cat i ons.
5 Me t a c o mp u t i n g Us e r I n t e r f a c e
Thi s pr oj ect deals wi t h t he devel opment of a user i nt er f ace t o t he me t a -
comput er [21]. To achi eve t r a ns pa r e nc y and a hi gh degr ee of usabi l i t y t he
i nt er f ace shoul d be uni que and avai l abl e for all pl at f or ms . Thus , t he i nt er f ace
is wr i t t en in J a va and is abl e t o r un over t he net on all c ommon J a v a Vi r t ual
Machi nes in e.g. web browsers. Ther ef or e, new versi ons of t he i nt er f ace ar e
i nst ant l y avai l abl e t o all users who downl oad it f r om t he web on s t a r t u p as
a J a va appl et .
The i nt er f ace allows t he set t i ng of ma n d a t o r y and vol unt ar y p a r a me t e r s
for a j ob. I t provi des s t at us i nf or mat i on a bout j obs and avai l abl e machi nes.
To mai nt ai n secur i t y for passwor ds and j obs, t he c ommuni c a t i on is e nc r ypt e d
vi a t hi r d- pa r t y sof t war e ( Cr ypt i x) . Si gned appl et s ensur e t h a t onl y t he au-
t hor i zed appl et f r om t he ori gi nal si t e is used.
The J a va User i nt er f ace connect s t o t he HPCM ma n a g e me n t of t he NRW
me t a c omput e r . I t t r a ns mi t s j ob r equest s and a ut he nt i c a t i on i nf or mat i on. I f
t he user is not wor ki ng f r om a DFS enabl ed host s, t he a ppl e t can upl oa d
appl i cat i on da t a i nt o t he DFS cell.
6 Ma n a g e me n t Ar c h i t e c t u r e HP CM
The HPCM pr oj ect pr ovi des t he i nf r as t r uct ur e for t he me t a c o mp u t i n g ma n-
agement . I t consi st s of a ma na ge me nt da e mon and several coupl i ng modul es
which communi cat e pl at f or m specific i nf or mat i on t o t he HP CM l ayer. Th e
ma na ge me nt da e mon execut es on t he HPCM ser ver machi ne and recei ves
5;t*b~t~i I J ob ~ T
|
:!!i j ii:!:i~:ii::~: ~:~:~iiiiii~iiiiiiiiii~:ii!iiii?iiiiiiiiiiiiiiiiiiiiii~iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~iiiiiiii:,i~iiiii1iiiiiiiN
St ~d=r t Error,: i V,tev/r, ull
i ~:..222~.32:.::::L..>..::..2. . . . . . . . . . . . . .
.... ~ int~r~cti~' ~ l ob i i : )i: :;: ; : : : :;:::::;:i:,:
ii::i::iiii i~i~iii!ii:.ii:~ii:ii ==
Fi g. 2. Screenshot of the Java User Interface for HPCM
277
request s from tile Java user i nt erface. It is t he admi ni st r at i ve i nst ance t hat
generat es t he global view of t he met acomput er wi t h its component s and t he
par t i ci pat i ng users. Not e t hat similar mul t i -t i er ar chi t ect ur es can be f ound
in ot her met acomput i ng proj ect s, as e.g. in Gl obus [6].
The coupl i ng modul e is an i nt erface t o vari ous comput i ng pl at f or ms. Be-
sides abst r act i ng t he available i nf or mat i on and access met hods f r om t he man-
agement it i nt eract s wi t h t he available local management of t he HPC com-
ponent . The cur r ent i mpl ement at i ons of coupl i ng modul es in t he i ni t i at i ve
r ange f r om NQE [5], CCS [9] t o LoadLevel er. Addi t i onal modul es as e.g. for
LSF [22] can easily be deri ved from t he exi st i ng i mpl ement at i ons.
7 Metacomputing Scheduling
Typi cal l y, owners of HPC i nst al l at i ons are onl y willing t o i ncl ude t hei r re-
sources i nt o a met acomput er , if t he per f or mance of t hei r component s will not
degr ade in t he new envi r onment . Thi s is especi al l y t r ue for commer ci al own-
ers. Similarly, users expect a bet t er per f or mance for t hei r j obs. Not e t hat t he
expressi on performance has not been defined as different peopl e may at t ach a
different meani ng t o it. In most cases however, an owner want s a hi gh syst em
278
load for her machine while a user is interested in a short response time for
his job or at least a / a i r resource allocation.
Therefore, job scheduling and resource allocation are one of the core prob-
lems in the met acomput i ng architecture. As existing system software is used,
the met acomput er scheduler must interact with the local schedulers on all
HPC components. To avoid the bottleneck of a centralized scheduler and
to increase flexibility a distributed approach is employed. Therefore, the
paradigms for job schedulers of parallel comput ers and for met acomput er
schedulers differ significantly, see Table 1.
Tabl e 1. Different Scheduling Paradigms
Parallel Computer Scheduling Metacomputer Scheduling
The network is ignored.
Load information is instantly available.
Homogeneous system environment
Mostly first-come-first-serve scheduling
Central scheduler
The network is a resource.
Load information must be collected.
Heterogeneous system environment
Resource reservation for future time
frames
Distributed scheduler
To implement a distributed met acomput i ng scheduler we use an archi-
tecture which is based upon so called Met aDomai ns . All Met aDomai ns of a
met acomput er form a redundant network. Typically, a Met aDomai n is asso-
ciated with local HPC resources. That is all HPC resources at one site are
connected to a single MetaDomain. For local HPC access any user can choose
to either submit her job directly to the local component or to use the local
MetaDomain.
Therefore, the met acomput er does not require exclusive access to a HPC
resource. The logical st ruct ure of such a scheduler is described in Fig. 3. This
network can be dynamically extended or altered. Such a propert y is advan-
tageous as, for instance, individual HPC component s may be t emporari l y
unavailable due to maintenance or new HPC resources may be i nt roduced
into the met acomput er. The presented architecture guarantees a high degree
of flexibility.
MetaDomains communi cat e among one anot her by t ransmi t t i ng or re-
questing information about resources and jobs. To this end a Met aDomai n
inquires local schedulers about system load and job status. A Met aDomai n
can also allocate local HPC resources to requests. The distributed schedul-
ing itself is based upon a brokerage and t radi ng concept which is executed
between the MetaDomains. In detail, a Met aDomai n tries to
9 satisfy local demand if possible,
279
Fig. 3. Logical Infrastructure
9 ask ot her Met aDomai ns for resources, if t he local demand cannot be
satisfied,
9 offer local HPC resources to ot her Met aDomai ns for sui t abl e r emot e jobs,
and
9 act as an i nt ermedi ary for r emot e requests.
Not e t hat we did not address t he act ual j ob submission. Thi s process is
not necessarily a t ask of t he scheduler. Once a sui t abl e al l ocat i on of HPC
resources (including net work resources) to a j ob has been found t he act ual
submi ssi on is i ndependent of t he scheduler. Also, t he scheduling obj ect i ves
are not specified. As al ready ment i oned t here may not be a single scheduling
obj ect i ve in a met acomput er . Each HPC component can define its specific
objectives. Similarly, each user may associ at e specific const rai nt s with his
j ob like a deadline or a cost limit. It is t he t ask of t he t radi ng syst em to
find mat ches bet ween requests and offers. This way not all users and all
component s are forced to fit into a single framework as it is usually done in
convent i onal scheduling. Now, it is t hei r responsibility to define their own
objectives. The i mpl ement at i on of t he met acomput i ng scheduler must only
provi de t he framework for such a definition and it must be able t o compar e
any request with any offer t o find a mat ch.
In our met acomput er scheduling concept only t he local HPC scheduler
is responsible for t he load di st ri but i on on t he correspondi ng HPC resource.
Therefore, it can also accept j obs from sources ot her t han t he met acomput er .
The met acomput er scheduler only addresses t he load i mbal ance bet ween dif-
ferent HPC resources. To execut e mul t i site appl i cat i ons however, t he concur-
rent availability of different HPC resources and sufficient net work bandwi dt h
bet ween t hem becomes necessary as al ready described in Sec. 2.1. For reasons
of efficiency this requires resource reservat i on for fut ure t i me frames and t he
concept of guarant eed availability. Al t hough most HPC schedulers do not
present l y suppor t such an appr oach it can be i mpl ement ed by using pre-
280
empt i on (a checkpoint-restart facility) while still mai nt ai ni ng a hi gh s ys t em
load.
In t he pr oj ect S CHEDULE [18] of t he i ni t i at i ve a me t a c omput e r sched-
ul er was desi gned usi ng CORBA [15] t o allow t r ans par ent and l anguage in-
dependent access t o di st r i but ed management i nst ances. For t he eval uat i on of
di fferent schedul i ng met hods a si mul at i on f r amewor k has f ur t her been i mpl e-
merited. I t is used t o compar e di fferent schedul i ng al gor i t hms r egar di ng t hei r
appl i cabi l i t y for a met acomput i ng net wor k. The benefi t of possi bl e t echnol -
ogy enhancement s, like for exampl e pr eempt i on, t o t he qual i t y of t he schedul e
is also det er mi ned wi t h t he help of t he si mul at or. As al r eady ment i oned com-
muni cat i on bet ween resources dur i ng a mul t i site j ob execut i on mus t be t aken
i nt o account as well. To this end t he avai l abl e net wor k mus t be consi der ed as
a l i mi t ed r esour ce t hat is managed by t he schedul ers in t he Met aDomai ns .
The i ncl usi on of this obj ect i ve i nt o t he schedul er is pa r t of t he f ut ur e work.
8 St a t us
The NRW- Met acomput i ng i ni t i at i ve has devel oped a f unct i oni ng manage-
ment syst em t ha t has been depl oyed in a pi l ot i nst al l at i on t o connect t he par -
allel comput er s of t he par t i ci pat i ng i nst i t ut i ons, t ha t are namel y a Cr ay T3 E
in Jfilich, I BM RS6000/ SP in Dor t mund, Sun Ent er pr i se 10000 in Col ogne
and a Par s yt ec CC in Pader bor r i .
In this t est - phase t he appl i cat i on pr oj ect s of t he i ni t i at i ve has been used
t o show t he benefi t s of such a met acomput er . Thi s pr oj ect s r epr esent s t ypi -
cal exampl es of probl ems t hat ar e well sui t ed for met acomput i ng. The y can
easily be por t ed t o di fferent ar chi t ect ur es and pr ovi de a small net wor k com-
muni cat i on f oot pr i nt . The devel oped HPCM management sof t war e is goi ng
i nt o pr oduct i on use in 1999, pr ovi di ng publ i c access t o all users.
The Schedul i ng pr oj ect provi des a worki ng i nt er f ace t o t he ment i oned
syst em t ypes. Cur r ent l y onl y single si t e appl i cat i ons ar e s uppor t ed as t he
pr esent i mpl ement at i on does not i ncl ude r eser vat i on. However, si mul at i ons
have been execut ed t o eval uat e whet her t he backfilling s t r at egy [7] can be
used t o consi der reservat i ons. These si mul at i ons have yi el ded pr omi si ng re-
sults. As many commer ci al local schedul ers ar e al r eady usi ng backfilling, onl y
small changes t o t hese schedulers are requi red.
9 Co n c l u s i o n
I t is t he pr i mar y goal of met acomput i ng t o pr ovi de users wi t h easy access
t o mor e HPC resources. Thi s includes a pl at f or m i ndependent si mpl e user
i nt erface. The user hersel f has t he abi l i t y t o expl oi t t he fl exi bi l i t y of t he
syst em t o her advant age by cl earl y speci fyi ng t he r esour ce r equi r ement s of
her j ob. In or der t o benefi t from mul t i si t e comput i ng she may need t o appl y
new pr ogr ammi ng paradi gms.
281
Th e owne r of HP C c o mp o n e n t s i n a me t a c o mp u t e r mu s t onl y f ocus on
t he ma i n t e n a n c e of a si ngl e pl a t f or m. He mu s t not s t r i ve t o s at i s f y all l ocal
us er s wi t h a l i mi t ed b u d g e t as s ome speci f i c d e ma n d s can be f o r wa r d t o o t h e r
HP C i ns t a l l a t i ons wi t hi n t he net wor k. Al so, t he r e is no need t o ma i n t a i n a
s e p a r a t e user i nt er f ace. Wi t h an i n d e p e n d e n t us er i nt er f ace t h e i n t e g r a t i o n
of new r es our ces will b e c o me easi er. On t he o t h e r h a n d t he owne r ma y f ace
s ome pr e s s ur e t o i ncr eas e s t a n d a r d i z a t i o n . Al t h o u g h t he a p p r o a c h g u a r a n t e e s
a hi gh degr ee of f l exi bi l i t y he ma y al so l ose s ome c ont r ol over t he a l l oc a t i on
of l ocal r es our ces .
Th e ma n u f a c t u r e r s of HP C r es our ces ma y see a de c r e a s e i n sal es of r eal l y
bi g machi nes , whi l e i t will b e c o me mo r e c o mmo n t o b u y mi ds i ze s y s t e ms a n d
i nt e gr a t e t h e m i nt o an i n f r a s t r u c t u r e of exi s t i ng r es our ces . Th e over al l us er
c o mmu n i t y will i ncr eas e as mo r e user s will gai n access t o t hes e r es our ces .
I n t o d a y s s ys t e ms ma n a g e a b i l i t y a n d o p e n i nt er f aces t o o t h e r ma n a g e me n t
s y s t e ms ar e not a s t r o n g sel l i ng a r g u me n t . Thi s ma y c h a n g e i f mo r e l ar ge
s y s t e ms b e c o me p a r t of a he t e r oge ne ous me t a c o mp u t i n g e n v i r o n me n t .
Re f e r e n c e s
1. A. Bachem, B. Monien, F. Ramme. Der For schungsver bund NRW-
Met aeomput i ng - verteiltes HSchst l ei st ungsrechnen (1996),
ht t p : / / www. uni -paderborn. de / pc2 / nr w- mc / ht ml _rep / h t ml ~ e p . h t ml
2. Hi gh Per f or mance Comput i ng and Communi cat i on (1997). NSTC.
ht t p: / / www. hpc c . gov/ pubs / bl ue 97
3. I BM RS/ 6000 SP Pr oduct Line.
ht t p: / / www. r s 6000. i bm. com/ har dwar e/ l ar ges cal e/
4. D. Becker, T. Sterling, D. Savarese, J. Dor band, U. Ranawak, C. Packer.
Beowulf: A parallel workst at i on for scientific comput at i on (1995), Proceedi ngs,
I nt er nat i onal Conference on Parallel Processi ng
5. I nt r oduci ng NQE. (1998), Cr ay Research Publ i cat i on, Silion Graphi cs, Inc.
6. I. Foster, C. Kesselman. Globus: A met acomput i ng i nf r ast r uct ur e t ool ki t
(1997), The I nt er nat i onal Jour nal of Super comput er Appl i cat i ons and Hi gh
Per f or mance Comput i ng, 11(2), pp 115-128
7. Intel Microprocessors, Volume I I (1991) Handbook. Intel Cor por at i on
8. Pent i um I I Xeon[t m] Processor Technol ogy Brief (1998), Intel Cor por at i on
9. A. Keller, A. Reinefeld. CCS Resource Management in Net worked HPC Sys-
t ems (1998), I n Proceedi ngs Het erogenous Comput i ng Wor kshop ( HCW) at
I P P S / S P DP ' 9 8
10. S. Kari n, S. Gr aham. The Hi gh Per f or mance Cont i nuum (Nov 1998), Commu-
ni cat i ons of t he ACM, pp. 32 - 35
11. D. A. Lifka. The ANL/ I BM SP Scheduling Syst em, Spri nger LNCS 949, Pr o-
ceedings of t he Job Scheduling Strategies for Parallel Processi ng Wokshop,
I PPS' 95, pp. 295 - 303
12. P. Messina, D. Culler, W. Pfeiffer, W. Mart i n, J. Oden, G. Smi t h. Ar chi t ect ur e
(Nov 1998), Communi cat i ons of t he ACM, pp. 36 - 44
13. Rober t C. Malone, Ri char d D. Smi t h, and John K. Dukowicz. Cl i mat e, t he
Ocean, and Parallel Comput i ng (1993), Los Al amos Science, No.21
282
14. Grand Challenges, National Challenges, and Multidisciplinary Challenges
(1998), NSF Report
ht tp://www.cise.nsf.gov/general/workshops/nsf_gc.ht ml
15. Object Management Group Document. The Common Object Request Broker:
Architecture and Specification (1998), Revision 2.2
16. SGI PowerChallenge XL Product Line
htt p: //www.sgi.com/remanufactured/challenge, SGI
17. T. Sterling. Applications and Scaling of Beowulf-class Clusters (1998), Work-
shop on Personal Computers based Networks Of Workstations, IPPS' 98
18. U. Schwiegelshohn, R.Yahyapour. Resource Allocation and Scheduling in Meta-
systems, Springer LNCS 1593, Proceedings of the Distributed Computing and
Metacomputing Workshop, HPCN'99, Amsterdam, pp. 851 - 860
19. J. Dongarra, H. Meurer, E. Strohmaier. (Nov. 1998), TOP500 Supercomputing
Sites, http://www.top500.org
20. Fujitsu VPP700E,
htt p://www.fujitsu.co.j p/index-e.html
21. V. Sander, D. Erwin, V. Huber. High-Performance Computer Management
Based on Java (1998), Proceedings of High Performance Computing and Net-
working Europe (HPCN), Amsterdam, pp. 526 - 534
22. S. Zhou. LSF: load sharing in large-scale heterogeneous distributed systems
(1992), In Proceedings Workshop on Cluster Computing
De s i gn and Eval uat i on of Pa r a St a t i o n2
Th o ma s M. War schko and J oa c hi m M. Bl um and Wal t er F. Ti chy
Inst i t ut ffir Programmst rt t kt uren und Datenorganisation, Fakult/it f/h" Informat i k,
Am Fasanengarten 5, Uuiversit/it Karlsruhe, D-76128 Karlsru_he, Germany
S u mma r y . ParaSt at i on is a communications fabric to connect off-the-shelf work-
stations into a supercomputer. This paper presents ParaStation2, an adapt i on of the
ParaSt at i on syst em (which was build on top of our own hardware) to the Myrinet
hardware. The main focus lies on the design and i mpl ement at i on of ParaSt at i on2' s
flow control protocol to ensure reliable dat a transmission at network interface level,
which is different to most other proj ect s using Myrinet.
One-way latency is 14.5ps to 18ps (depending on the hardware pl at form) and
t hroughput is 50 MByt e/ s to 65 MByt e/ s, which compares well to ot her approaches.
At application level, we were able to achieve a performance of 5.3 GFLOP running
a mat ri x multiplication on 8 DEC Alpha machines (21164A, 500 MHz).
1. I n t r o d u c t i o n
Par aSt at i on2 is a c ommuni c a t i on s ubs ys t e m on t op of Myr i c om' s Myr i net
har dwar e [BCF+95] t o connect of f - t he- shel f wor ks t at i ons and PCs i nt o a par -
allel s uper comput er . The appr oach is t o combi ne t he benefi t s of a hi gh- s peed
MPP net wor k wi t h t he excel l ent pr i c e / pe r f or ma nc e r at i o and t he s t a nda r d-
ized p r o g r a mmi n g i nt erfaces (e.g. Uni x socket s, PVM, MPI ) of convent i onal
wor kst at i ons. Wel l - known p r o g r a mmi n g i nt erfaces ensur e por t a bi l i t y over a
wi de r ange of di fferent syst ems. The i nt egr at i on of a hi gh- speed MPP net wor k
opens up t he oppor t uni t y t o el i mi nat e mos t of t he c ommuni c a t i on over head.
Pa r a St a t i on was or i gi nal l y devel oped for t he Pa r a St a t i on har dwar e, a
sel f - r out i ng net wor k wi t h a ut onomous di s t r i but ed swi t chi ng, har dwar e flow-
cont rol at link-level combi ned wi t h a back- pr essur e mechani s m, and a r el i abl e
and deadl ock- f r ee t r ans mi s s i on of var i abl e sized packet s (up t o 512 byt e) . Thi s
base s ys t e m is now bei ng adopt ed t o t he Myr i net har dwar e, whi ch has a ful l y
p r o g r a mma b l e net wor k i nt er f ace and a much be t t e r base pe r f or ma nc e t ha n
t he classic Pa r a St a t i on har dwar e (see sect i on 2.). The ma j o r di fference is
t he absence of rel i abl e da t a t r ans mi s s i on, whi ch has t o be i mpl e me nt e d at
net wor k i nt er f ace level on t he Myr i net har dwar e (see sect i ons 3. and 4. ).
Pa r a St a t i on offers as p r o g r a mmi n g i nt erfaces wel l - known and s t a nda r d-
ized pr ot ocol s (e.g. T CP and UDP Uni x socket s) and p r o g r a mmi n g envi r on-
ment s (e.g. MPI and PVM) at a r easonabl e and accept abl e pe r f or ma nc e level
(see sect i on 5.) r at her t han squeezi ng t he mos t out of t he har dwar e usi ng a
c ommuni c a t i on l ayer wi t h nons t a nda r d semant i cs.
284
2. P a r a S t a t i o n v s . Myrinet
Tabl e 2.1 present s a br i ef compar i son bet ween t he Par aSt at i on [WBT96] and
t he Myr i net [BCF+95] har dwar e.
Tabl e 2.1. Comparison between ParaStation and Myrinet
ParaStation Myrinet
Technology
Topology
Bandwidth
Flow control
Flow control policy
Error detection
Error management
Interface
Processor
PCI-Bus adapt er
2D-Torus
128 Mbit/s
Link level
back-blocking
Parity
Fatal
FIFO
none (FPGA)
PCI-Bus adapt er & Switches
hierarchical crossbar
1.28 Gbi t / s
Link level
back-blocking & discard
CRC
implementation dependent
SRAM
32bit RISC (LanAI)
Par aSt at i on wi t h i t ' s two i ncomi ng and t wo out goi ng links nat ur al l y uses a 2D
t or us as its net work t opol ogy. The necessary swi t chi ng el ement s (bet ween t he
X and Y di mensi on) are l ocat ed on each Par aSt at i on adapt er and t her ef or e no
cent ral swi t ch is needed. Myr i net i nst ead uses cascadabl e swi t chi ng el ement s
(8 or 16 way crossbars) and t her ef or e has no l i mi t at i ons on t he net wor k
t opol ogy bui l t . In t er ms of t r ansmi ssi on speed t her e is also a cl ear advant age
for Myr i net . Bot h syst ems i mpl ement flow cont r ol at link level but wi t h
different policies. Wher eas Par aSt at i on i mpl ement s a st ri ct back- bl ocki ng
mechani sm bet ween all nodes, Myr i net onl y blocks for a while and t hen s t ar t s
di scardi ng packets. Thi s behavi our helps t o keep t he net wor k alive even in
case of f aul t y component s, but also forces t he i mpl ement at i on of a hi gher-
level flow cont rol pr ot ocol t o guar ant ee rel i abl e t r ansmi ssi on. Pa r a St a t i on
si mpl y blocks if t her e is not enough buffer space in t he next node on t he way
t o t he final dest i nat i on and wai t s unt i l t he recei ver st ar t s accept i ng messages.
Reference [War98] proves, t hat t hi s behavi our is deadl ock free and rel i abl e, as
long as t he receiver keeps consumi ng packets. As a consequence Pa r a St a t i on
does not need any higher-level flow cont r ol mechani sm.
Anot her maj or difference bet ween Par aSt at i on and Myr i net is t he pro-
gr ammi ng interface. Par aSt at i on provi des a si mpl e FI FO i nt erface t o send
and receive messages al ong wi t h some flags descri bi ng t he st at us of t he in-
comi ng and out goi ng FI FO. Pr i or t o a send oper at i on t he sender checks t he
flags t o ensure t hat t he sender FI FO can accept a compl et e packet t . I f t her e
is enough space, it wri t es t he compl et e packet i nt o t he FI FO and Pa r a St a -
t i on' s flow cont r ol mechani sm ensures t ha t t he packet will event ual l y make i t s
way t o t he receiver. On t he recei vi ng side t he st at us flags i ndi cat e whet her
a compl et e packet has arri ved in t he receiver FI FO. Thus, t he recei ver is
1 A packet is up to 512 byte long.
285
able to receive the whole packet at once rather than polling for individual
flits. Writing to and reading from the transmission FIFO is done by the CPU
(PIO 2) rather than using the DMA 3 engines. The Myrinet board uses a 32bit
RISC CPU called LanAI , fast SRAM memory (up to 1 MByte) and three
programmable DMA engines - two on the network side to send and receive
packets and one as interface to the host. The LanAI is fully programmable
(in C / C++) and the necessary development kit (especially a modified gcc
compiler) is available from Myricom. The kit opens up the opport uni t y to
implement and test a much broader design space for high speed transmission
protocols than with the ParaStation system. In fact this capability in addi-
tion to the high performance of the Myrinet hardware was the main criteria
to choose Myrinet as the hardware platform for ParaStation2.
3. De s i gn c ons i de r a t i ons f or Pa r a St a t i on2
The major questions to answer is how to interface the Myrinet hardware to
the rest of the ParaStation software, especially the upper layers with their va-
riety of implemented protocols (Ports, Sockets, Active Messages, MPI, PVM,
FastRPC, Java Sockets and RMI). There are three different approaches:
1. Emulating ParaStation on the Myrinet adapter: Simulating ParaStation' s
transmission FIFO with a small LanAI program running on the Myrinet
adapter would not be a problem. But as ParaStation is using programmed
I/O to receive incoming packets this approach would lead to unacceptable
performance (see [BRB98b]).
2. Emulating ParaStation at software level: As the ParaStation system al-
ready has a small hardware dependent software layer called HAL (hard-
ware abstraction laycr), this approach allows the use of all Myrinet spe-
cific communication features as well as a simple interface to the upper
protocol layers of the ParaStation system.
3. Designing a new system: This approach would lead to an ideal system
and probably the best performance, but we would have to rewrite or
redesign most parts of the the ParaStation system.
Because of its simplicity, we choose the second strategy to interface the ex-
isting ParaStation software to the Myrinet hardware. The second question to
answer is how to guarantee reliable transmission of packets with the Myrinet
hardware. As said before, the original ParaStation hardware offers reliable
and deadlock free packet transmission as long as the receiver keeps accepting
packets. Myrinet instead discards packets (after blocking a certain amount
of time) which may happen when the receiver is running out of resources or
is unable to receive packets fast enough. Additionally the Myrinet hardware
2 Programmed I/O
3 Direct Memory Access
286
seems to lose packets under certain circumstances, e.g. in heavy bidirectional
traffic with small packets. The upper layers of the ParaStation system rely on
a reliable data transmission, so a low level flow control mechanism - either
within the Myrinet control program running on the LanAI processor or as
part of the HAL interface - is required.
4. I mpl e me nt a t i on of Pa r a St a t i on2
The goal of this section is to explain the basics of the ParaStation2 protocol.
Most parts of the protocol are implemented in a Myrinet control program
(MCP) running on the Myrinet adapter. The protocol guarantees a reliable
dat a transmission so that only minor changes to the HAL have to be made
and it is possible to use all upper layers of the ParaStation system without
any changes.
4. 1 Ba s i c o p e r a t i o n
Figure 4.1 shows the basic operation during message transmission of the
ParaStation2 protocol. The basic protocol has four independent parts: (a) the
interaction between the sending application and the sender network interface
(NI), (b) the interaction between the sending and the receiving NI, (c) the
interaction between the receiving NI and the receiving host, and (d) the
interaction between the receiving application and the host.
First, the sender checks if there is a free send buffer (step 1). This is
accomplished by a simple table lookup in the host memory, which reflects
the status of the buffers of the send ring located in the fast SRAM of the
network interface (Myrinet adapter). If there is buffer space available, the
sender copies (step 2) the data to a free slot of the circular send buffer located
in the network interface (NI) using programmed I/O. Afterwards the NI is
notified (a descriptor is written) that the used slot in the send ring is ready
for transmission and the buffer in host memory is marked as in transit. A
detailed description of the buffer handling is given in section 4.2. In step (3),
the NI sends the data to the network using its DMA engines.
When the NI receives a packet (step 4) it stores the packet in a free slot
of the receive ring using its receive DMA engine. The flow control protocol
ensures that there is at least one free slot in the receive ring to store the in-
coming packet. Once the packet is received completely and if there's another
free slot in the receive ring, the flow control protocol acknowledges the re-
ceived packet (step 5). The flow control mechanism is discussed in section 4.3.
As soon as the sender receives the ACK (step 6), it releases the slot in the
send ring and the host is notified (step 7) to update the status of the send
ring.
In the receiving NI the process of reading data from the network is com-
pletely decoupled from the transmission of data to the host memory. When
287
M ~ e t i nt erface ~ ] Myrl net i nt er f ace
9 ~ ~
1 IDLE X (2)
2 In Tra~it ~j~ \ copy data
send ring 3 \ \ & notify
- - \ \
7 (1) 9
8 c h e c k buffer
9 ( A )
(4) ~ (5) Jr c h e c k buffer
receive ~, send |
data ' ~ ACK[ / ' I ' I D L E
r e c e i v e r i ng ~ I 8 I
\
(B) ~ r e c e i v e t i n
copy data & notify 'q~ ~ . , ~
/ " , ~/ \
c h e c k for n e w ~ ~ / I
packets ( D ) /
receive data ~1 $
Host A ( s ender} Holt B ( r e c e i v e r )
Fig. 4.1. Data transmission in ParaStation2
a complete packet has been received from the network, the NI checks for a
free receive buffer in the host memory (step A). If there is no buffer space
available, the packet will stay in the NI until a host buffer becomes available.
Otherwise the NI copies the data into host memory using DMA and notifies
the host about the reception of a new packet by writing a packet descriptor
(step B). Concurrently, the application software checks (polls) for new pack-
ets (step C) and eventually, after a packet descriptor has been written in step
(B), the data is copied to application memory (step D).
Obviously, the data transmission phases in the basic protocol (step 2, 3, 4,
and B) can be pipelined between consecutive packets. The ring buffers in the
NI (sender and receiver) are used to decouple the NI from the host processor.
At the sender, the host is able to copy packets to NI as long as there is buffer
space available although the NI itself might be waiting for acknowledgements.
The NI uses a transmission window to allow a certain amount of outstanding
acknowledgements which must not necessarily equal the size of the send ring.
At the receiver the NI receive ring is used to temporarily store packets if the
host is not able to process the incoming packets fast enough.
4.2 Buffer handl i ng
Each buffer or slot in one of the send or receive rings can be in one of the
following states:
288
IDLe.: The buffer is e mpt y and can be used t o st or e a packet ( up t o 4096
byt e) .
INTRANSIT: Thi s buffer is cur r ent l y i nvol ved in a send or recei ve oper at i on,
which has been s t ar t ed but whi ch is still act i ve.
READY: Thi s buffer is r eady for f ur t her oper at i on ei t her a send t o t he recei ver
NI (if i t ' s a send buffer) or a t r ansf er t o host me mor y (if i t ' s a recei ve
buffer).
Re.TRANSMIT: Thi s buffer is mar ked for r et r ansmi ssi on, because of a negat i ve
acknowl edgement or a t i meout (send buffer onl y).
Fi gure 4.2 shows t he st at e t r ansi t i on di agr ams for bot h send and receive
buffers in t he net work i nt erface.
s e nd buf f e r h a n d l i n g r e ve i ve buf f e r h a n d l i n g
receive A C K
A C K rcceivtxl, N A C K received
~
C error, data (out of sequence)
send ~ sequence)
Fi g. 4.2. Buffer handling in sender and receiver
At t he sender t he NI waits unt i l a send buffer becomes READY, whi ch is ac-
compl i shed by t he host aft er it has copi ed t he da t a and t he packet descr i pt or
t o t he NI (st ep 2 in figure 4.1). Aft er t he buffer becomes READY t he NI st ar t s
a send oper at i on (net work DMA) and mar ks t he buffer INTRANSIT. Whe n
an acknowl edgement (ACK) for t hi s buffer arri ves (st ep 6 in figure 4. 1), t he
buffer is released (st ep 7) and mar ked IDLE. If a negat i ve acknowl edgement
( NACK) arri ves or t he ACK does not ar r i ve in t i me (or get s l ost ) t he buffer is
mar ked for r et r ansmi ssi on (Re.TRANSMIT). The next t i me t he NI t ri es t o send
a packet it sees t he Re.TRANSMIT buffer and resends t hi s buffer, changi ng t he
st at e t o INTRANSIT again. Thi s Re.TRANSMIT - INTRANSIT cycle ma y happen
several t i mes unt i l an ACK arri ves and t he buffer is mar ked IDLe..
At t he recei ver t he buffer handl i ng is qui t e si mi l ar (see figure 4. 2). As
soon as t he NI sees an i ncomi ng packet it st ar t s a receive DMA oper at i on
and t he st at e of t he associ at ed buffer changes f r om IDLe. t o INTRANSIT (see
st ep 4 in figure 4.1). Assumi ng t hat t he received packet cont ai ns user dat a, is
not cor r upt ed, and has a valid sequence number 4 t he NI checks for anot her
4 For a discussion of the ACK/ NACK protocol see section 4.3.
289
free buffer in t he receive ri ng. I f t her e is anot her free buffer it sends an
ACK back t o t he sender and t he buffer is ma r ke d READY. Ot her wi s e a NACK
is sent , t he packet di scar ded and t he buffer rel eased i mme d i a t e l y ( ma r ke d
IDLE). The check for a second free buffer in t he recei ve ri ng ensur es t ha t
t here is at l east one free buffer t o recei ve i ncomi ng packet s a nyt i me , because
any packet eat i ng up t he l ast buffer will be di scar ded. Whe n t he recei ved
packet cont ai ns pr ot ocol da t a ( ACK or NACK) , t he NI processes t he packet
and rel eases t he buffer. In case of an er r or ( CRC) t he buffer is ma r ke d IDLE
i mme di a t e l y wi t hout f ur t her processi ng. I f t he recei ved d a t a packet does not
have a val i d sequence number , t he packed is di scar ded and t he sender is
not i fi ed by sendi ng a NACK back. Thus , t he recei ver refuses t o accept d a t a
out of sequence and wai t s unt i l t he sender will resend t he mi ssi ng packet .
4. 3 F l o w c o n t r o l p r o t o c o l
Par aSt at i on2 uses a flow cont r ol pr ot ocol wi t h a fixed sized t r ans mi s s i on
wi ndow and 8 bi t sequence numbe r s ( r el at ed t o i ndi vi dual s ender / r ecei ver
pai rs), where each packet has t o be acknowl edged (ei t her wi t h a pos i t i ve or
a negat i ve acknowl edgement ) in c ombi na t i on wi t h a t i me out and r et r ans mi s -
sion me c ha ni s m in case t ha t an acknowl edgement get s l ost or does not ar r i ve
wi t hi n a cer t ai n a mount of t i me. The pr ot ocol s assumes t he ha r dwa r e t o be
u n r e l i a b l e and is abl e t o deal wi t h any numbe r of cor r upt ed or l ost pack-
et s ( cont ai ni ng ei t her user d a t a or pr ot ocol i nf or mat i on) . Tabl e 4.1 gi ves an
overvi ew of possi bl e cases wi t hi n t he pr ot ocol , an expl anat i on of each case as
well as t he r esul t i ng act i on i ni t i at ed.
Ta bl e 4. 1. Packet processing within receiver
packet t ype sequence check explanation
< lost ACK
DATA = ok
> lost dat a
<
ACK =
>
NACK none
CRC none
duplicate ACK
ok
previous ACK lost
resulting action
resend ACK
check buffer space
(see fig 4.2)
ignore & send NACK
ignore packet
release buffer
ignore packet
mark buffer for
retransmission
error detected ignore packet
When a d a t a packet is recei ved, t he NI c ompa r e s t he sequence n u mb e r of
t he packet wi t h t he as s umed sequence numbe r for t he sendi ng node. I f t he
number s are equal , t he recei ved packet is t he one t ha t is expect ed and t he
NI cont i nues wi t h i t s r egul ar oper at i on. A recei ved sequence numbe r s mal l er
t han expect ed i ndi cat es a dupl i cat ed d a t a packet caused by a l ost or l at e
ACK. Thus t he correct act i on t o t ake is t o resend t he ACK, because t he
290
sender expect s one. Is t he recei ved sequence numbe r l ar ger t ha n expect ed,
t he packet wi t h t he cor r ect sequence numbe r has been cor r upt ed ( CRC) or
lost. As t he pr ot ocol does not have a sel ect i ve r et r ans mi s s i on me c h a n i s m
t he packet is s i mpl y di scar ded and t he sender is not i fi ed wi t h a negat i ve
acknowl edgement ( NACK) . Thus, t hi s packet will be r e t r a ns mi t t e d l at er ,
ei t her because t he sender got t he NACK, or because of a t i meout . As t he
mi ssi ng packet al so causes a t i me out at t he sendi ng side, t he packet s will
event ual l y ar r i ve in t he cor r ect order.
On t he r ecept i on of an ACK packet , t he NI al so checks t he sequence
numbe r and i f it is ok it cont i nues pr ocessi ng and rel eases t he acknowl edged
buffer. I f t he recei ved sequence numbe r is smal l er t ha n as s umed, we' ve re-
cei ved a dupl i cat ed ACK because t he sender r an i nt o a t r ans mi s s i on t i me o u t
before t he correct ACK was recei ved and t he recei ver has resent an ACK
upon t he arri val of an al r eady acknowl edged d a t a packet 5. The r esponse in
t hi s case is s i mpl y t o i gnore t he ACK. A recei ved sequence n u mb e r l ar ger
t han what is expect ed i ndi cat es t ha t t he cor r ect ACK has been c or r upt e d or
lost. Thus t he act i on t aken is t o i gnore t he ACK, but t he associ at ed buffer
is ma r ke d for r et r ans mi s s i on t o force t he recei ver t o resend t he ACK. The
buffer associ at ed wi t h t he as s umed ( and mi ssi ng) ACK will t i me out and be
resent which al so forces t he recei ver t o resend t he ACK.
A recei ved NACK packet does not need sequence checki ng; t he as s oci at ed
buffer is mar ked for r et r ans mi s s i on as l ong as it is in t he INTRM~ISIT s t at e.
Ot her wi se t he NACK is i gnored (t he buffer is in R~.TANSHIT s t at e anyway) . I n
case of a CRC er r or t he packet is dr opped i mme di a t e l y and no f ur t her act i on
is i ni t i at ed, because t he pr ot ocol is unabl e t o det ect er r or s in t he pr ot ocol
header.
The r esul t i ng pr ot ocol is abl e t o handl e any numbe r of cor r upt ed or l ost
packet s cont ai ni ng ei t her user d a t a or pr ot ocol i nf or mat i on, as l ong as t he NI
and t he connect i on bet ween t he i ncor por at ed nodes is worki ng. The pr ot ocol
was devel oped t o ensure rel i abi l i t y of d a t a t r ans mi s s i on at NI level, not t o
handl e har dwar e fai l ures in t er ms of f aul t t ol er ance. The pr ot ocol i t sel f can be
opt i mi zed in some cases (e.g. bet t er handl i ng of ACK' s wi t h a l ar ger sequence
number ) , but t hi s is left t o f ut ur e i mpl ement at i ons . In compar i s on t o exi st i ng
pr ot ocol s, t hi s pr ot ocol can r oughl y be classified as a var i at i on of t he T CP
pr ot ocol .
5. Ba s i c p e r f o r ma n c e o f t h e p r o t o c o l h i e r a r c h y
In t abl e 5.1, per f or mance figures of all sof t war e l ayers in t he Pa r a St a t i on2 sys-
t em are pr esent ed 6 . The var i ous levels pr esent ed are t he har dwar e a bs t r a c t i on
5 This case may sound strange, but we' ve seen this behaviour several times.
6 For a detailed discussion of the ParaStation2 protocol hierarchy, have a look at
our paper called ParaStation User Level Communication in this proceedings.
291
l ayer (HAL), which is t he lowest l ayer of t he hi erarchy, t he so cal l ed p o r t s and
TCP layers, which are bui l d on t op of t he HAL, and s t andar di zed communi -
cat i on l i brari es such as MPI and PVM, whi ch are opt i mi zed for Pa r a St a t i on2
and bui l d on t op of t he por t s layer. Lat ency is cal cul at ed as r ound- t r i p/ 2 for
a 4 byt e pi ng- pong and t hr oughput is measur ed usi ng a pai rwi se exchange
for large messages (up t o 32K). N/ 2 denot es t he packet size in byt es when
hal f of t he ma xi mum t hr oughput is reached. The per f or mance da t a is gi ven
for t hr ee di fferent host syst ems, namel y a 350MHz Pent i um II r unni ng Li nux
(2.0.35), a 500MHz and a 600MHz Al pha 21164 syst em r unni ng Di gi t al Uni x
(4. 0D).
Tabl e 5.1. Basic performance parameters of ParaStation2
Programming
System Measurement
Pentium II Latency [psi 14.5
350 MHz Throughput [MByte/s] 56
N/ 2 [Byte] 256
Alpha 21164 Latency [psi 17.5
500 MHz Throughput [MByte/s] 65
N/ 2 [Byte] 512
Alpha 21164 Latency [psi 18.0
600 MHz Throughput [MByte/s] 64
N/ 2 [Byte] 350
interface
Port s TCP MPI PVM
18.7 20.2 25
48 51 43
500 500 400
24 24 30 29
55 57 50 49
500 500 500 1000
24 25 25 28
56 59 51 48
700 700 500 700
The l at ency at HAL level of 14. 5ps t o 18ps is s omewhat hi gher t ha n for
compar abl e syst ems such as LFC ( l l . 9 p s ) or FM (13.2/~s) [BRB98a]. Thi s
is because nei t her LFC nor FM copies t he da t a it receives t o t he appl i cat i on
and second, bot h LFC and FM i ncor r ect l y assume Myr i net t o be rel i abl e.
The ma xi mum t hr oughput wi t h 56 MByt e / s t o 65 MByt e / s of Pa r a St a -
t i on2 is bet ween t he t hr oughput of FM (40.5 MByt e/ s ) and LFC ( up t o
70 MByt e/ s ) . If LFC or FM st ar t s copyi ng t he recei ved da t a t o t he appl i ca-
t i on (as Pa r a / - St a t i on does) t he t hr oughput decreases for l arge messages t o
30 - 35 MByt e / s [BRB98a] whereas Par aSt at i on2' s t hr oughput keeps qui t e
st abl e close t o ma xi mum level ( ~ 50 MByt e / s ) .
Swi t chi ng f r om a si ngl e- pr ogr ammi ng envi r onment (HAL) t o mul t i - pr o-
gr ammi ng envi r onment s ( upper layers) resul t s in a slight per f or mance degr a-
dat i on r egar di ng l at ency as well as t hr oughput . The reason for i ncr easi ng
l at enci es is due t o locking over head t o ensure cor r ect i nt er act i on bet ween
compet i t i ve appl i cat i ons. The decreased t hr oughput is caused by addi t i onal
buffering and a compl ex buffer management .
292
6. Pe r f or manc e at appl i cat i on level
Focusing onl y on l at ency and t hr oughput is t oo narrow for a compl et e eval-
uat i on. It is necessary t o show t hat a low-latency, hi gh- t hr oughput com-
muni cat i on subsyst em also achieves a reasonabl e appl i cat i on efficiency. For
this reason we i nst al l ed t he widely used and publ i cl y available ScaLAPACK 7
l i brary [CDD+95], which uses bot h BLACS s [DW95] and MPI as commu-
ni cat i on subsyst em on Par aSt at i on2. The benchmar k we use is t he paral l el
mat r i x mul t i pl i cat i on for general dense mat r i xes from t he PBLAS l i brary,
which is part of ScaLAPACK. Table 6.1 shows t he performance in MFLOP' s
r unni ng on our 8 processor DEC- Al pha cluster (500 MHz, 21164A).
Tabl e 6.1. Parallel matrix multipfication on ParaStation2
Problem
size (n)
1000
2000
3000
4000
Uaiprocessor
MFlop (Eft.)
782
(100%)
785
(100%)
790
(100%)
772
(100%)
1 Node 2 Nodes 4 Nodes 6 Nodes 8 Nodes
Performance in MFlop (Efficiency)
731 1276 2304 3243 3871
(93.5%) (81.6%) (73.6%) (69.1%) (61.9%)
743 1359 2546 3582 4683
(94.6%) (86.6%) (81.1%) (76.1%) (74.6%)
755 1396 2700 3908 4887
(95.6%) (88.4%) (85.4%) (82.4%) (77.3%)
1398 2694 4044 5 3 3 7
(90.5%) (87.2%) (87.3%) (86.4%)
Fi rst , we measured the uniprocessor performance of a hi ghl y opt i mi zed mat r i x
mul t i pl i cat i on (cache aware assembler code), which act s as a reference t o
cal cul at e the efficiency of the parallel versions. A uniprocessor per f or mance of
772 to 790 MFLOP on a 500 MHz processor proves t hat t he pr ogr am is hi ghl y
opt i mi zed (IPC of more t han 1.5). Obvi ousl y t he parallel version execut ed
on an uniprocessor has to be somewhat slower, but t he measured efficiency
of 93.5% t o 95.6% is very high. Using more nodes, t he absol ut e per f or mance
in MFLOP increases st eadi l y while t he efficiency decreases smoot hl y. The
maxi mum performance achieved was 5.3 GFLOP using 8 nodes which is qui t e
good compared to t he 10.1 GFLOP of t he 100 nodes Berkeley NOW cl ust er 9.
7. Re l at e d work
There are several approaches which use Myr i net as a har dwar e i nt erconnect t o
build parallel syst ems: Active Messages and t he Berkeley NOW cluster, espe-
cially Active Messages-II [CMC97], Illinois Fast Messages (FM) [PLC95], t he
Se.._.~alable Linear Algebra Package.
8 B_asic Linear --Algebra Communication S._ubroutines
9 see ht t p : / / now. cs . ber kel ey. edu
293
basic interface for parallelism from the University of Lyon (BIP) [PT97], the
link-level flow control protocol (LFC) [BRB98a] from the distributed ASCI
supercomputer, PM [TOHI98] from the Real World Computing Partnership
in Japan, the virtual memory mapped communication VMMC and VMMC-
II [DBLP97] from Princeton University, Hamlyn [BJM+96], the user-level
network interface U-Net [vEB+95], and Trapeze [YCGL97].
The major difference between these projects and ParaStation2 is twofold.
First, ParaStation2 focuses on a variety of standardized programming inter-
faces, such as UNIX sockets (TCP and UDP), MPI, PVM, and Java sockets
and RMI with a reasonable performance at each level rather than a single
purpose, nonstandard, proprietary interface which squeezes the most out of
the hardware for a specific application.
The second difference due to reliability assumptions of the Myrinet hard-
ware (see figure 7.1).
Assume Myri net is reliable?
Reliablility strategy
prevent buffer
ove r f l ow~
. . . . \ recovery
application N I host
Reliability protocol?
(unreliablenO//~y/\es
AM-il
VMMC-2
ParaStation2
Fig. 7.1. Myrinet and Reliability (from [BRB98b])
Most approaches assume Myrinet to be reliable or pass the unreliability on
to the application layer. Only AM-II, VMMC-2 and ParaStation2 accept
the unreliability of Myrinet and provide mechanisms to ensure reliable dat a
transmission. The reason why most projects assume Myrinet to be reliable
is mainly due to the rather low error rate at hardware level. We have ob-
served that the link-level flow control mechanism seems to fail by overwriting
or dropping complete packets under certain circumstances. The only way to
detect this behaviour is to count packets or to use sequence numbers within
packets, because the hardware neither blocks the transmission nor signals
294
any error. Fur t her mor e, t he har dwar e does not di st i ngui sh bet ween d a t a
and cont rol packet s while dr oppi ng one of these. Thus , a si mpl e flow con-
t rol pr ot ocol t o pr event buffer overflow assumi ng t ha t cont r ol packet s will be
del i vered rel i abl y is not sufficient t o ensure rel i abl e t r ansmi ssi on. Al t hough
AM-II [CMC97] and VMMC2 [DBLP97] do not expl i ci t l y s t at e pr obl ems
wi t h t he Myr i net har dwar e, t hey i nt r oduced a pr ot ocol t o ensur e rel i abl e
communi cat i on as t hey swi t ched f r om AM t o AM- I I or VMMC t o VMMC2
respectively. The same hol ds for Par aSt at i on2 whi ch also s t ar t ed usi ng st r i ct
back-bl ocki ng unt i l serious pr obl ems arose.
8. Conc l us i on and f urt her work
In t hi s paper we' ve present ed t he design of Par aSt at i on2, especi al l y t he
ACK/ NACK r et r ansmi ssi on pr ot ocol t o ensure rel i abl e da t a t r ansmi ssi on
at net wor k i nt erface level. The advant age of t hi s appr oach was t ha t we coul d
reuse t he Par aSt at i on code wi t h mi nor changes and get t i ng t he compl et e
f unct i onal i t y of t he Par aSt at i on syst em (especi al l y t he var i et y of s t andar d-
izes and well-known interfaces) for free.
The eval uat i on of Par aSt at i on2 shows t hat Par aSt at i on2 compar es well
wi t h ot her appr oaches in t he cl ust er c ommuni t y usi ng Myr i net . Pa r a St a t i on2
is not t he fast est syst ems in t er ms of pur e l at ency and t hr oughput , but in
cont r ast t o most ot her appr oaches it offers t he rel i abl e i nt erface which is - in
our experi ence - mor e i mpor t a nt t o t he user t han an ul t r a hi gh-speed, but
unrel i abl e i nt erface. At t he level of appl i cat i on per f or mance t he 5.3 GF LOP s
has not been achi eved before wi t h t hi s smal l number of nodes.
The f ut ur e pl ans for Par aSt at i on2 are t o opt i mi se t he i nt er f ace bet ween
t he soft ware and t he Myr i net har dwar e t o get even mor e per f or mance out
of t he syst em. Second, por t s t o ot her pl at f or ms such as Spar c/ Sol ar i s and
Al pha / Li nux are on t he way.
Re f e r e nc e s
[BCF+95] Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Ku-
lawik, Charles L. Seitz, Jarov N. Seizovic, and Wen-King Su. Myrinet: A
Gigabit-per-Second Local Area Network. I EEE Micro, 15(1):29-36, February
1995.
[BJM+96] G. Buzzard, D. Jacobson, M. MacKey, S. Marovich, and J. Wilkes. An
Implementation of the Hamlyn Sender-Managed Interface Architecture. In The
2nd USENI X Syrup. on Operating Syst ems Design and Implementation, pages
245-259, Seattle, WA, October 1996.
[BRB98a] Raoul A. F. Bhoedjang, Tim Rfihl and Henri E. Bal. LFC: A Communi -
cation Substratefor Myrinet. Fourth Annual Conference of the Advanced School
for Computing and Imaging, June 1998, Lommel, Belgium.
295
[BRB98b] Raoul A. F. Bhoedjang, Ti m Rfihl and Henri E. Bed. User-Level Network
Interface Protocols. IEEE Comput er, 31(11), pp. 52 - 60, November 1998.
[CDD+95] J. Choi, J. Demmel, I. DhiUon, J. Dongarra, S. Ostrouchov, A. Pet i t et ,
K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A port abl e linear alge-
bra library for distributed memory comput ers - design issues and performance.
Technical Report UT CS-95-283, LAPACK Working Note #95, University of
Tennessee, 1995.
[CMC97] B. Chung, A. Mainwaring, and D. Culler. Virtual Network Transport
Protocols for Myrinet. In Hot Interconnects'97, Stanford, CA, April 97.
[DBLP97] C. Dubnicki, A. Bilas, K. Li, and J. Philbin. Design and Impl ement at i on
of Virtual Memory-Mapped Communication on Myrinet. In 11th Int. Parallel
Processing Symposium, pages 388-396, Geneva, Switzerland, April 1997.
[DW95] J. Dongarra and R. C. Whaley. A user' s guide to the blacs vl . 0. Technical
Report UT CS-95-281, LAPACK Working Note #94, University of Tennessee,
1995.
[PLC95] S. Pakin, M. Lauria, and A. Chien. High Performance Messages on Work-
stations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing '95, San
Diego, CA, December 1995.
[PT97] L. Prylli and B. Tourancheau. Protocol Design for High Performance Net-
working: A Myrinet Experience. Technical Report 97-22, LIP-ENS Lyon, July
1997.
[TOHI98] H. Tezuka, F. O' Carrol , A. Hori, and Y. Ishikawa. Pin-down Cache: A
Virtual Memory Management Technique for Zero-copy Communication. In 12th
Int. Parallel Processing Symposium, pages 308-314, Orlando, FL March 1998.
[YCGL97] K. Yocum, J. Chase, A. Gallatin, and A. Lebeck. Cut - Thr ough Delivery
in Trapeze: An Exercize in Low-Latency Messaging. In The 6th Int. Syrup. on
High Performance Distributed Computing, Portland, OR, August 1997.
[vEB+95] T. von Eicken, A. Basu, V. Buch, and W. Vogel. U-Net: A User-Level
Network Interface for Parallel and Distributed Computing. In Proc. of the 15th
Syrup. on Operating Syst em Principles, pages 303-316, Copper Mountain, CO,
December 1995.
[WBT96] Thomas M. Warschko, Joachim M. Blum, and Walter F. Tichy. The
ParaSt at i on Project: Using Workstations as Building Blocks for Parallel Com-
puting. In Proceedings of the International Conference on Parallel and Dis-
tributed Processing, Techniques and Applications ( PDPTA '96}, pages 375-386,
Sunnyvale, CA, August 9-11, 1996.
[War98] Thomas M. Warschko. Effiziente Kommuni kat i on in Parallelrechnerar-
chitekturen. Dissertation, Universits Karlsruhe, Fakults f/Jr Informat i k. Pub-
lished as: VDI Fortschrittberichte, Reihe 10: Informat i k / Kommuni kat i onst ech-
nik Nr. 525. ISBN: 3-18-352510-0. Ms 1998.
Br oadc as t Co mmu n i c a t i o n in ATM Co mp u t e r
Ne t wo r ks and Ma t he ma t i c a l Al g o r i t hm
De v e l o pme nt
Mi chael Wel l er
Inst i t ut e for Experimental Mathematics, Ellernstr. 29, 45326 Essen, Ger many
Ab s t r a c t . This article emphasizes the i mport ance of collective communi cat i on and
especially broadcasts in mat hemat i cal algorithms. It points out t hat algorithms in
discrete mat hemat i cs usually t ransmi t higher amount s of dat a fast er t han typical
floating point algorithms. It describes the o.tel.o ATM Test bed of the GMD Na-
ti onal Research Center for Information Technology, St. Augustin, and the Inst i t ut e
of Experimental Mathematics, Essen, and the experiences with di st ri but ed comput -
ing in this network. It turns out t hat the current i mpl ement at i ons of IP over ATM
and libraries for distributed computing are not yet suited for high performance com-
puting. However, ATM itself is able to perform fast broad- or multicasts. Hence it
might be worthwhile to design a message passing library based on nat i ve ATM.
1 Br o a d c a s t s i n ma t h e ma t i c a l a l g o r i t h ms
Di s t r i but ed and paral l el c omput i ng pl ays an i mp o r t a n t rol e in chemi st r y,
physi cs, engi neer i ng and t hus numer i cal ma t he ma t i c s . However , t her e are
par t s of pur e and di scret e ma t h e ma t i c s like c r ypt ogr a phy, c o mp u t a t i o n a l
gr oup t heor y and r epr esent at i on t heor y whi ch can al so benefi t f r om usi ng
a c omput e r . In general , t he al gor i t hms t r a ns f or m t he or i gi nal pr obl e m t o a
huge pr obl e m in l i near al gebr a over a fi ni t e field [2].
Dense equat i on s ys t ems in 300,000 or mor e var i abl es [3,4] or t he c o mp u t a -
t i on and enumer at i on of hundr eds of mi l l i ons of vect or s or even s ubs paces [5,6]
show up easily. However, an ent r y of a ma t r i x or vect or can of t en be real i zed
by a smal l numbe r of bits. Oper at i ons on t he machi ne wor ds r epr es ent i ng
smal l vect or s of such ent ri es are ei t her done by i nt eger addi t i ons, l ogi cal bi t
oper at i ons, or (in t he mor e compl ex cases) by t abl e l ookups.
For t he har dwar e i nvol ved in sol vi ng such pr obl ems t hi s me a ns t h a t t her e
are ma n y t r i vi al (hence fast ) ar i t hmet i cal oper at i ons one has t o per f or m. I n
r et ur n, t he pr obl ems are big t hemsel ves and par al l el i zat i on onl y makes sense
if each c omput i ng node receives a s ubs t ant i al a mount of d a t a t o deal wi t h. For
t he c ommuni c a t i on in a di s t r i but ed appl i cat i on t hi s means t ha t it usual l y has
t o t r ansf er a huge a mount of d a t a fast . Ther ef or e t he c omput i ng nodes mus t
have l arge a mount s of me mo r y wi t h a hi gh me mo r y ba ndwi dt h t o sat i sf y t he
speed r equi r ement s of t he CPU handl i ng ma n y t r i vi al i nt eger oper at i ons . I f
t abl e l ookups are i nvol ved in t he pr ogr a m, a cache is of t en unabl e t o hi de t he
lack of act ual me mo r y bandwi dt h f r om t he appl i cat i on pr ogr a m.
298
Numerical applications typically involve slower operations on floating
point numbers and no table lookups and reduce the unbalance between mem-
ory, communi cat i on and CPU speed. However, there are no probl ems with
numerical stability in discrete mat hemat i cs.
There are many interesting comput at i onal problems in discrete mat he-
matics requiring access to parallel supercomput ers. Since they are rare, it
seems interesting to couple smaller parallel machines and workst at i on clus-
ters of different institutions over a WAN to perform these comput at i onal
tasks.
Many of the linear algebra al gori t hms use broadcast s and ot her group
communications like the algorithm sketched in Fig. 1.
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
' Cur r e nt node *
I Solve : History ',p
i
I
I
i
I
.I llpivot[] ~ I pivot[] :I
<""'i
| 9
CLI
~roadcast
Fig. 1. Data flow of parallel gaussian elimination over a fm.ite field: One of the
processes solves a few columns of the equation system residing in its memory. It
broadcasts instructions to the other nodes which perform the same operations as
the sender on the remaining columns. Then control passes to the next node. The
processes perform the pre-solving and broadcasting in a round-robin fashion to
achieve a better load distribution. The algorithm is described in [4] in more detail
and was also used in a world record computation achieved in [3]
299
Speci al net wor ks desi gned for par al l el c omput e r s like t he CM- 5 by Thi nk-
ing Machi nes are abl e t o pe r f or m br oadcas t s and r educt i on oper at i ons in
har dwar e. Cur r ent l y it seems not possi bl e t o pe r f or m r educt i ons usi ng s t an-
dar d net wor k har dwar e. Also, br oadcas t s a ppe a r mor e of t en in our pr obl e ms
and are even used dur i ng t he i ni t i al i zat i on phas e of a l gor i t hms whi ch do not
use t he m af t er war ds.
2 I mp l e me n t a t i o n o f Br o a d c a s t s
Ideal l y, t he appl i cat i on p r o g r a mme r shoul d not car e a b o u t t he i mpl e me n-
t at i on of br oadcas t s . Thi s ought t o be done by a c ommuni c a t i on l i brary.
Typi cal i mpl e me nt a t i ons are shown in Fi g. 2. Of t en a t r ee like i mpl e me n-
t at i on is used. Each node whi ch al r eady recei ved a copy of t he d a t a hel ps
di s t r i but i ng it. Whi l e t hi s reduces t he br oadcas t t o l og2(n ) st eps of poi nt t o
poi nt c ommuni c a t i ons t her e is mor e t ha n one node sendi ng at t he s a me t i me.
Thi s causes collisions in any non swi t ched net wor k connect i ng t he nodes and
l eads t o net wor k congest i on, if, for exampl e, several host s in t he s a me cl ust er
send d a t a t o host s in a cl ust er at a not he r l ocat i on over a WAN connect i on.
Tree structured broadcast: Simple (PVM) broadcast: Cyclic broadcast:
O 1 0 1
9
Non-switched Network / Ethemet:
Fi g. 2. Different implementations of broadcasts. From left to right: El aborat e im-
plementation using a tree like distribution of dat a in log2(n ) steps but causing
collisions in non switched networks (some MPI, POE on old IBM SP), simple im-
pl ement at i on for non switched networks (PVM), cyclic commtmication done by the
application
J us t sendi ng t he da t a in a l oop t o all n - 1 r eci pi ent s avoi ds such con-
gest i on, but is much slower. I t is wor t h not i ng t ha t t hi s way is still f ast er
t ha n usi ng t he t ree br oadcas t on a s har ed me di a Et her net . Not onl y mus t
t he t ree br oadcas t degener at e t o a sequent i al scheme here, but t he collisions
300
reduce the efficiency even more. We tried several t ypes of broadcast in our
experi ment shown in Fig. 1. Using sequential broadcast f r om t he begi nni ng
on shared medi a 10Mbps Et hernet required 8h 43m where the at t empt to use
a tree broadcast required l l h 44m.
In t hat experi ment we obt ai ned good results using a cyclic pat t er n for t he
sequential broadcast . The mast er only sends the dat a to a successor, t hen
resumes his comput at i onal work. Thi s has benefits as t he mast er al ready
had to do some pr ecomput at i on to find the dat a to be broadcast ed. That
is, in a sense he is al ready behind the ot her nodes and must not be delayed
any further. It s successor then sends the dat a to the next node and so on,
until all nodes received the dat a. As we are using a round robin met hod to
move the mast er node in this al gori t hm anyway this i nt eract s nicely with this
al gori t hm. Thi s observat i on was also made with numeri cal solvers of floating
poi nt equat i on systems.
3 ATM a nd di s t r i but e d c omput i ng
Workstations
Fig. 3. The Essen - - St. Augustin o.tel.o ATM-Testbed
The Inst i t ut e of Experi ment al Mat hemat i cs (IEM) is connect ed to an
ATM Test bed as shown in Fig. 3. The carrier o.tel.o provides a 100km
301
155Mbps connect i on t o t he GMD in St. Augus t i n and a s hor t 622Mbps con-
nect i on t hr ough t he ci t y of Essen i nt o t he bui l di ng of t he ma i n c omput e r
cent re ( HRZ) of t he uni ver si t y of Essen. Cl assi cal I P over ATM is used t o
r un T CP / I P over t hi s net wor k. I t t ur ne d out t ha t any r out er s in t hi s net wor k
have t o be avoi ded and LAN Emul a t i on is not t o be used t o achi eve sensi bl e
per f or mance. Thi s way we can r each up t o 80Mbps file t r ans f er s (each node
has a 155Mbps adapt er ) .
I t is now possi bl e t o use t he s t a nda r d l i br ar i es for di s t r i but ed c omput i ng
over T CP / I P for par al l el c omput i ng in such an e nvi r onme nt . However , we
onl y obt ai ned bad per f or mance t hi s way. Onl y a Pe a k - Ba n d wi d t h of 7. 5Mbps
coul d be meas ur ed when usi ng PVM. MPI - CH usi ng P4 for c ommuni c a t i on
onl y achi eved 0. 6Mbps of ma x i mu m t r ans mi s s i on ba ndwi dt h. The r e al so ex-
ists a package PLUS [1] whi ch i nt er connect s MPI i mpl e me nt a t i ons of differ-
ent wor ks t at i on cl ust ers or par al l el c omput e r s over T CP / I P . I t coul d not be
t est ed in t he exper i ment of Fig. 3 as it is not yet avai l abl e for AI X, but i t s
cr eat or s al r eady t est ed it in a 34Mbps ATM envi r onment . Scal i ng t he per-
f or mance t her e t o 155Mbps we woul d achi eve onl y l l . 4 5 Mb p s whi ch is still
less t ha n a pl ai n T CP / I P file t r ans f er can achi eve.
Af t er t he per i od of low-level ATM t est s in t he j oi nt pr oj ect wi t h t he
l ab of our car r i er o. t el . o and s uppor t of Si emens and GN Net t es t descr i bed
bel ow t he dar k fibre t o t he ma i n c omput e r cent r e of Essen Uni ver si t y was
bui l t and we di d our exper i ment s agai n. Thi s t i me r unni ng t h e m on several
nodes of all t hr ee i nst i t ut i ons of Fig. 3 at t he s a me t i me. Doi ng so, we f ound
no si gni fi cant , t echni cal difference in r unni ng our appl i cat i on on nodes of
one, t wo or t hr ee i nst i t ut i ons. However, as mor e peopl e and i ns t i t ut i ons were
i nvol ved, it be c a me much mor e di ffi cul t t o ensur e a cor r ect s et up of r out i ng,
host names and net wor k for t he c omput a t i ons .
Also, as s ome t i me had passed, f i r mwar e upgr ades t ook pl ace on t he swi t ch
of t he i ns t i t ut e (V. 3. 1. 0 t o V. 3. 2. 1), and t he GMD machi nes were upgr aded
t o AI X 4.3 f r om 4.2. Ot her ATM- r el at ed fixes of I BM were appl i ed t o t he
AI X 4.2 machi nes in Essen.
Thi s way, we were abl e t o meas ur e a peak ba ndwi dt h under MPI - CH
usi ng P4 of 93Mbps dur i ng t he i ni t i al i zat i on of a par al l el appl i cat i on act ual l y
per f or mi ng an HPI _Bc a s t . We used t he s ame pr ogr a m, even t he s a me bi nar y,
used for t he f or mer t est s. Onl y swi t ch f i r mwar e and t he ope r a t i ng s ys t e m on
s ome of t he nodes had changed. Thi s pr ogr a m used t o need 47 mi nut es (!)
t o br oadcas t 75MB of da t a t o s i x- ei ght nodes. Af t er t he upgr ades t hi s t i me
was r educed t o 30 seconds.
Scat t er i ng an equat i on s ys t e m of 800MB on 6 nodes st i l l r equi r ed mor e
t han 8000 seconds. But t hi s d a t a is s cat t er ed ma nua l l y by t he pr ogr a m in
mos t l y smal l chunks. Hence it mi ght not profi t f r om t he advant ages of t he
newer f i r mwar e.
On t he ot her hand, usi ng nat i ve ATM c ommuni c a t i on, we were abl e t o
t r ansf er up t o 123Mbps ( AAL and ot her over head is al r eady s ubt r act ed) f r om
302
poi nt to poi nt in this network. The highest per f or mance was achieved using
large packets of 40KB which was the largest size t he AI X oper at i ng syst em
was able to handle.
Q
O
Fi g. 4. Single to Multipoint ATM communication: Based on a point-to-point con-
nection (solid line) from the lower left to the upper left ATM allows to add further
recipients (dashed lines). Still data is only transmitted once over the WAN lines
between the three switches. The receiving switches to the right and upper left dis-
tribute the data locally to all receiving parties
In addition, there is anot her interesting feat ure of ATM. As sketched in
Fig. 4 ATM is able to broadcast . Al t hough ATM per se is a st ri ct l y poi nt to
poi nt connection oriented protocol, it is possible to specify mul t i pl e receivers
of the same dat a st r eam which splits at the ATM switches as close to the
receivers as possible. When connecting two clusters of workst at i ons over an
ATM WAN link, this means t hat dat a will only be t r ansmi t t ed once t hrough
the WAN link and t hen di st ri but ed aut omat i cal l y by the switch in the r emot e
ATM cluster.
In our experi ment s, we were able to achieve t he same peak t ransmi ssi on
rat e of 123Mbps (not counting any ATM or SDH overhead) even when dis-
t ri but i ng dat a to the GMD and the IEM. Unf or t unat el y we experi enced a
very high rat e of 1-2% dat a loss. Lat ency or delay vari at i on could not be
measured reliably with the workst at i ons as t hey had no exact synchronous
t i me source.
In a j oi nt proj ect (Fig. 5) with the lab of our carrier o.tel.o and suppor t
of Siemens and GN Net t est we found t hat there is act ual l y no dat a loss in
t he network or the switches. It appears t hat these are either due to failure
of the workst at i ons to be unabl e to accept the dat a fast enough or not being
N oMDst.Au ustin
..... i i i i i "" "S'i'ngle 155Mbps physical link
303
IEMEssen
Fig. 5. Testing a single to multipoint connection: The data was sent through a
unidirectional PVC to a loop on the remote switch and from there back to points
B & C at once (using a single to multipoint connection). As the bandwidth on the
[EM-GMD link was limited, it is guaranteed that the data was duplicated at the
switch at the IEM, not at the other side
able not to overcommit a specified link bandwi dt h. As the dat a loss was even
higher for very slow links, it appears t hat the latter mi ght be the reason.
Wi t h GN Nettest equipment provided by the carrier and Siemens we could
measure network latencies and delay variations which we found to be very
small and below a millisecond even in the WAN segment (see also Fig. 6).
Definitely these should be of no relevance for distributed comput i ng. How-
ever, ATM is connection oriented and the time required to setup connections
cannot be ignored. We found t hat a typical ATM LAN switch can st and at
most bursts of about 100 connection setups per second (assuming there is
no other traffic). Therefore, for a typical distributed application, it will be
too slow to initiate the necessary connections when a dat a packet is to be
sent. One can consider initiating all required connections at the beginning
of the program, but there will be many such connections and the resources
of the ATM switches are usually limited to a few t housand connections per
interface. Thus, an interesting approach could be to initiate each connec-
tion in advance, before the dat a to be sent is actually available. Of course,
304
Fig. 6. Delay variation and throughput measured on one leave of a single to multi-
point connection. The GN Nettest Interwatch does not allow to measure absolute
delays in this configuration as the data was not received on the card which gener-
ated the traffic. Hence there are no sensible cell delay values reported. The values
for delay variation and throughput are reported correctly. We had no access to a
synchronous clock driving the switches, SDH equipment and traffic generators. This
might result in increased jitter effects. The values measured on the other leaves did
not differ significantly.
this requires t hat the application can foresee the recipient at such an early
point and t hat the application pr ogr ammi ng interface of the message passing
library allows to prepare sending messages in such a way.
The tests also included the use of network management systems. These are
an indispensable tool to administer any large ATM network. Wi t hout t hem,
a connection has to be configured separately on each switch it crosses. Using
such a management system it is possible j ust to specify endpoi nt s and have
the system find a route t hrough the network. However, di st ri but ed comput i ng
would require switched virtual connections which are normal l y not handled
by the management system and were not a maj or part of this project. Still,
the management system in question allows a network carrier to assign some
backbone bandwi dt h to a customer which can then use this bandwi dt h at his
own disposal. He can also book ATM-links in advance which are then made
available on aut omat i c at a later point (for example at night, when the traffic
is low and the ATM connection is cheaper), maybe for a bat ch comput at i on.
305
In concl usi on, it appear s t o be wor t hwhi l e t o p e r f o r m di s t r i but ed hi gh
per f or mance c omput i ng over an ATM WAN net wor k. However , cur r ent l y a
message passi ng l i br ar y ut i l i zi ng nat i ve ATM connect i ons is not avai l abl e but
a ppe a r s t o be necessar y t o achi eve t he r equi r ed d a t a r at es. I n addi t i on, such
a l i br ar y mus t be abl e t o deal wi t h t he d a t a loss whi ch c a nnot be avoi ded
when usi ng ATM, al t hough t he net wor k i t sel f a ppe a r s t o be ver y rel i abl e in
t hi s r espect , as l ong as t he ATM i nt er f aces of t he c o mp u t i n g nodes are f ast
enough and don' t exceed t he t raffi c cont r act s.
4 Ac kno wl e dg e me nt s
The a ut hor ki ndl y acknowl edges f i nanci al s uppor t by t he Mi ni s t r y of Sci-
ence and Educat i on Nor t h Rhi ne West f al i a, Essen Uni ver s i t y and o. t el . o
Diisseldorf. Hi s work was al so s uppor t e d by t he DFG- NSF exchange pr ogr a m,
DFG g r a n t # Mi - 89/ 24- 1. He is al so gr at ef ul t o Si emens Essen for pr ovi di ng
access t o a wi de ar ea swi t ch and a ma n a g e me n t s ys t em, and t o GN Net t es t
Muni ch for t echni cal s uppor t .
Fi nal l y he woul d like t o t ha nk t he GMD Na t i ona l Resear ch Cent er for
I nf or ma t i on Technol ogy, St. Augus t i n, and t he Co mp u t e r Cent er of t he Uni -
ver si t y of Essen for t he per mi s s i on t o use t hei r r esour ces and t he hel p of t hei r
ma i nt e na nc e st af f in set t i ng up t he s ys t e ms for t he wi de- ar ea di s t r i but ed com-
put at i ons .
Re f e r e nc e s
1. Mat t hi as Brune, JSrn Gehring, and Alexander Reinefeld. Communi cat i ng across
parallel message-passing environments. Prepri nt submi t t ed to Elsevier, February
1998.
2. P. Fleischmann, G. O. Michler, P. Roelse, J. Rosenboom, R. Staszewski, C. Wag-
ner, and M. Weller. Li near Algebra over Smal l Fi ni t e Fi el ds on Paral l el Ma-
chi nes, volume 23 of Vorl esungen aus dem Fachberei ch Mat hemat i k. University
of Essen, 1995.
3. Pet er Roelse. Factoring high-degree polynomials over F2 with Niederreiter' s
algorithm on the IBM-SP/ 2. Mat h. Comp. to appear.
4. M. Weller. Parallel gaussian efimination over small finite fields. In Procedings
of t he 9th I nt er nat i onal Conf erence on Paral l el and Di st ri but ed Comput i ng Sys-
t ems. ISCA, Sept ember 1996.
5. Michael Weller. Construction of large per mut at i on representations for mat ri x
groups. In E. Krause and W. Js editors, High Pe r f or manc e Comput i ng in
Sci ence and Engi neeri ng '98, pages 430-452. HLRS St ut t gar t , Springer-Verlag
Berlin, Heidelberg, New York, 1998.
6. Michael Weller. Construction of large per mut at i on representations for mat ri x
groups | I. Submitted, 1999.
Hi ghl y Available Di st ri but ed Storage Syst ems
Li hao Xu 1 and J e hos hua Br uck 2
1 Depar t ment of Comput er Science, Washington University, Campus Box 1045,
Saint Louis, MO 63130, USA. Emaih lihao@cs.wustl.edu. Thi s work was done
while this aut hor was at the California Inst i t ut e of Technology.
2 Depar t ment of Electrical Engineering, California Inst i t ut e of Technology, Mail
Stop 136-93, Pasadena, CA 91125, USA. Emaih bruck@paradise.caltech.edu.
1. I n t r o d u c t i o n
I nf or mat i on is gener at ed, pr ocessed, t r a ns mi t t e d and s t or ed in var i ous f or ms:
t ext , voice, i mage, vi deo and mul t i me di a t ypes. Her e all t hese f or ms will be
t r e a t e d as gener al dat a. As t he need for d a t a i ncreases exponent i al l y wi t h t he
pas s age of t i me and t he i ncr ease of c omput i ng power, d a t a s t or age becomes
mor e and mor e i mpor t a nt . Fr om scientific comput i ng t o busi ness t r a ns a c -
t i ons, d a t a is t he mos t pr eci ous par t . How t o st or e t he d a t a r el i abl y a nd
efficiently is t he essent i al issue, t ha t is t he focus of t hi s chapt er .
As wi t h di s t r i but ed comput i ng, di s t r i but ed st or age is comi ng of age as a
good sol ut i on t o achi eve scal abi l i t y, f aul t t ol er ance and efficiency. Hi st or i cal l y,
since t he speed of st or age devi ces, such as t apes and disks, is much sl ower
t ha n t he speed of c omput i ng devi ces, e.g., CPUs , I / O is a bot t l eneck in com-
put i ng syst ems. To i mpr ove t he d a t a t hr oughput of s t or age devi ces, RAI D
(Redundant Array of Independent Disks) was proposed[14][16] t o st or e d a t a
over mul t i pl e s t or age devi ces in a di s t r i but ed way, so t ha t t he t ot al I / O band-
wi dt h is s um of t he bandwi dt hs of t he i ndi vi dual s t or age devi ces. Th a t was
t he s t a r t of di s t r i but ed ( net wor ked) st or age. Since t hen, s t or age t echnol ogi es
have been advanci ng r api dl y; t he capaci t y of magnet i c devi ces cont i nuous l y
i ncreases and access speed cons t ant l y i mpr oves. But as wi t h CPUs , t her e ar e
physi cal l i mi t s t o t he densi t y of di sks, seek t i me and r ot at i onal s peed of t he
di sk dri ves. Thes e l i mi t s mean t ha t t he capaci t y and access speed of a si ngl e
s t or age devi ce can not be i mpr oved infinitely. The need for s t or age c a pa c i t y
and access speed can be met by i mpr ovi ng st or age s ys t ems at t he ar chi t ec-
t ur al level, i.e., usi ng mul t i pl e di s t r i but ed st or age devi ces connect ed vi a a f ast
net wor k, such as t he Fiber Channel, whi ch r educes d a t a l at ency i ncur r ed over
t he net wor k t o much less t ha n t he l at ency t i me of a single s t or age devi ce. A
di s t r i but ed s t r uct ur e not onl y can i ncr ease t he capaci t y and speed of s t or age
syst ems, but al so can br i ng f aul t t ol er ance and scal abi l i t y.
As wi t h comput i ng, f aul t t ol er ance (or reliability) is i ncr easi ngl y i mpor -
t a nt in s t or age syst ems. Some cri t i cal da t a shoul d be avai l abl e and s ome
servi ces shoul d be pr ovi ded even when f aul t s occur in s t or age uni t s. Besi des,
a s t or age s ys t e m t ha t al l ows s ome f aul t y uni t s can be r epl aced on- t he- f l y
woul d have gr eat val ue for busi ness t r ans act i ons , such as a i r por t ma n a g e me n t ,
308
banki ng syst ems, and i nt er net servi ce pr ovi der syst ems. Nat ur al l y, rel i abi l i t y
of st orage syst ems can be achi eved mor e easi l y usi ng di s t r i but ed st r uct ur e.
Scal abi l i t y is anot her nat ur al f eat ur e of di s t r i but ed syst ems: addi t i on or re-
pl acement of component s is much mor e flexible in a di s t r i but ed syst em t ha n
in a cent ral syst em. Thus di s t r i but ed st or age syst ems can adapt be t t e r t o
dynami c and growi ng da t a demands.
In this chapt er , t he reliability, efficiency and scal abi l i t y of di s t r i but ed st or -
age syst ems are all consi dered aspect s of availability. A hi ghl y avai l abl e st or -
age syst em has hi gh rel i abi l i t y (or can t ol er at e mor e faul t s), hi gh efficiency
(or per f or mance) and scalability. Achi evi ng hi gh avai l abi l i t y in di s t r i but ed
st orage syst ems is t he mai n t opi c of t hi s chapt er .
Thi s chapt er mai nl y consists of t wo par t s. The first par t discusses t he
rel i abi l i t y issue. The rel i abi l i t y is usual l y achi eved by i nt r oduci ng da t a re-
dundancy i nt o a st or age syst em. The second par t shows t ha t t he efficiency
of a dat a st or age syst em can be i mpr oved by pr oper l y usi ng t he da t a r edun-
dancy in t he syst em. So t he appr oaches of i nt r oduci ng da t a r e dunda nc y ar e
very i mpor t ant t o a st or age syst em, for bot h rel i abi l i t y and efficiency. Thi s
chapt er will descri be a few MDS array codes, a class of error-control codes
t hat are ver y sui t abl e t o be used t o i nt r oduce da t a r e dunda nc y in st or age
syst ems.
2. MDS Ar r a y Co d e s f o r Re l i a b i l i t y
2. 1 Ar r a y C o d e s
Rel i abi l i t y of st or age syst ems is of t en achi eved by st or i ng r edundant da t a
in t he syst ems usi ng er r or - cont r ol codes. Usual l y in st or age syst ems, t he
failure of a single st or age uni t can be det ect ed by t he st or age cont r ol l er s
and t hen can be masked. Thus erasure-correcting codes ar e of t en used, since
t he device failures can be mar ked as erasures. Er as ur e- cor r ect i ng codes ar e
a mat hemat i cal means of r epr esent i ng da t a so t hat lost i nf or mat i on can be
recovered. Wi t h an (n, k) er asur e- cor r ect i ng code, we r epr esent k symbol s of
t he original da t a wi t h n symbol s of encoded da t a ( n - k is cal l ed t he a mount of
redundancy or parity). Wi t h an m- er as ur e- cor r ect i ng code, t he ori gi nal da t a
can be recovered even if m symbol s of t he encoded da t a are lost[13], and
t he distance d of t hi s code is defi ned t o be d = m + 1. A code is said t o be
Maxi mum Di st ance Separable (MDS) if m = n - k. An MDS code is opt i mal
in t er ms of t he amount of r edundancy versus t he er asur e r ecover y capabi l i t y.
The Reed-Sol omon code [13] is a wel l -known exampl e of an MDS code.
The compl exi t y of t he comput at i ons needed t o cons t r uct t he encoded da t a
(a process called encoding) and t o r ecover t he ori gi nal da t a (a process cal l ed
decoding) is an i mpor t a nt consi der at i on for pr act i cal syst ems. Ar r ay codes are
ideal in this respect . Ar r ay codes have been st udi ed ext ensi vel y [2][3][4][8].
309
A c ommon pr ope r t y of t hese codes is t ha t t he encodi ng and decodi ng pr oce-
dur es use onl y si mpl e bi nar y excl usi ve- or ( XOR) , whi ch can be i mpl e me nt e d
easi l y in ha r dwa r e a n d / o r soft ware; t hus t hese codes ar e much mor e efficient
t h a n Reed- Sol omon codes in t e r ms of c omput a t i on compl exi t y and ar e ver y
sui t abl e t o be used in st or age syst ems, for bot h rel i abi l i t y and efficiency.
In an a r r a y code, t he i nf or mat i on (ori gi nal ) and pa r i t y ( r edundant ) bi t s
ar e pl aced in a 2- di mensi onal a r r a y of size 1 x n. I n a di s t r i but ed s t or age
s ys t em, t he bi t s in t he s ame col umn ar e s t or ed in t he s ame disk. I f any bi t
in a di sk is cor r upt ed, t hen t he di sk is consi der ed t o be a fai l ure di sk and
needs r epai r , i.e., t he cor r es pondi ng col umn of t he code is consi der ed t o be
an er asur e.
Cur r ent RAI D ( Re dunda nt Ar r ay of I ndependent Di sks) s ys t ems can t ol -
er at e at mos t one di sk fai l ure at a t i me, i.e., t he code used is onl y a 1- er asur e-
cor r ect i ng code. I n mor e and mor e appl i cat i ons, f aul t - t ol er ance of onl y one
single di sk is not enough. A s ys t em t h a t can t ol er at e mor e t ha n one fai l ure at
t he s ame t i me woul d be mor e r obus t and flexible. For exampl e, when one di sk
fails, t he s ys t e m can still have some non- s t op f aul t - t ol er ance capabi l i t y whi l e
t he ba d di sk is bei ng r epl aced by a good one. Thi s level of f aul t t ol er ance re-
qui res codes wi t h hi gher er as ur e- cor r ect i ng capabi l i t y. A 2- er as ur e- cor r ect i ng
code can pr ovi de a much l onger nons t op f unct i oni ng t i me t o a di s t r i but ed
s t or age s ys t e m t ha n a 1- er asur e- cor r ect i ng code.
Consi der i ng all t he above f act or s, i.e., c omput a t i on compl exi t y, opt i ma l
r e dunda nc y and hi gh level f aul t t ol er ance, we will focus on t hr ee cl asses
of 2- er asur e- cor r ect i ng MDS ar r ay codes: t he EVENODD code[2][3], t he X-
Code[21] and t he B-Code[22]. Thes e codes can be used effect i vel y t o achi eve
t he rel i abi l i t y and efficiency of s t or age syst ems.
2. 2 E VE NODD Co d e
The EVENODD code has a ver y si mpl e st r uct ur e: all bi t s ar e pl aced in an
ar r ay of size (p - 1 ) x ( p + 2), wher e p is a pr i me number , i.e., it has p + 2
col umns. All t he i nf or mat i on bi t s ar e pl aced in t he first p col umns, and t he
l ast 2 col umns cont ai n all pa r i t y bi t s. The 2 par i t y col umns ar e cons t r uct ed
usi ng t he di agonal s of sl ope 0 and sl ope - 1 , respect i vel y. Det ai l s a b o u t t he
cons t r uct i on of t he EVENODD code can be f ound in [2]. Fol l owi ng e xa mpl e
shows a cons t r uct i on for a (7,5) EVENODD code:
Exampl e 2.1. A (7,5) EVENODD code
Table 2.1 shows an encoding rule of a (7, 5) E V E NODD code, and Table
2. 2 is a numeri cal exampl e of Table 2.1.
[]
I t was pr oven t ha t [2] t ha t t he EVENODD code is a 2- er as ur e- cor r ect i ng
a r r a y code, i.e., it is MDS. An al gor i t hm for r ecover i ng 2 er ased col umns for
t he EVENODD code and ot her det ai l s can be f ound in [2]. A gener al i zat i on
310
Tabl e 2.1. Encoding of a (7,5) EVENODD code, where s = a5 + b4 -b C3 -~- d2
al a2 a3 a4 a5
bl b2 b3 b4 ba
C1 C2 C3 C4 C5
d l d2 d3 d4 d5
al q- a2 + a3 + aa + a5 s q- al + b5 q- c4 + d3
b 1 + b2 -4- ba + b4 + b5 s + a2 + bl + c5 + d4
C1-~" C2 "4" C3 -~ C4 -[- C5 s A- a3 + b2 + c l + d5
dl + d: + d3 + d4 + d5 s + a4 + b3 + c2 + dl
Tabl e 2.2. Numerical example of a (7,5) EVENODD code
1 0 1 1 0 1 0
0 1 1 0 0 0 0
1 1 0 0 0 0 1
0 1 0 1 1 1 0
of t he EVENODD code t o recover mor e er asur es whi l e still t o mai nt ai n t he
MDS pr oper t y is descr i bed in [3].
2. 3 Up d a t e Co mp l e x i t y
One i mpor t ant par amet er of ar r ay codes is t he aver age number of par i t y
bits affect ed by a change of a single i nf or mat i on bit; t hi s pa r a me t e r is cal l ed
t he updat e c o mp l e x i t y here. The upda t e compl exi t y is par t i cul ar l y cruci al
when t he codes are used in st or age appl i cat i ons t ha t upda t e i nf or mat i on
frequent l y. It also measur es of t he encodi ng compl exi t y of t he code. The
lower this par amet er is, t he si mpl er t he encodi ng oper at i ons are. If a code
is descri bed by a p a r i t y c he c k mat r i x[ 13] , t hen t hi s pa r a me t e r is t he aver age
r ow d e n s i t y - - t he number of nonzer o ent ri es in a row - - of t he par i t y check
mat r i x. Research has been done t o r educe t hi s pa r a me t e r or t o make t he
densi t y of par i t y check mat r i x of codes as low as possi bl e [9][17]. The obvi ous
lower bound of t he upda t e compl exi t y of any 2- er asur e- cor r ect i ng code is 2.
The updat e compl exi t y of E V E N ODD codes appr oaches 2 as t he l engt h
( number of t he col umns) of t he codes increases. But it was pr oven in [3] t ha t
for any l i near ar r ay code wi t h separ at e i nf or mat i on and par i t y col umns, t he
updat e compl exi t y is always s t r i c t l y l arger t ha n 2. The n a nat ur al quest i on is
whet her t he updat e compl exi t y of 2 is achi evabl e for general ar r ay codes. The
answer is, f or t unat el y, yes. The next t wo subsect i ons will descri be t wo classes
of codes, called X- Code and B- Code respect i vel y, whose upda t e compl exi t y
is exact l y 2.
2. 4 X- Co d e
The X- Code is a class of 2- er asur e- cor r ect i ng MDS ar r ay code. It s upda t e
compl exi t y is exact l y 2, i.e., it has t he opt i mal encodi ng ( updat e) pr oper t y.
It has a ver y simple geomet r i cal const r uct i on st r uct ur e.
311
2 . 4 . 1 S t r u c t u r e o f t h e X- Co d e . I n X- Code, i nf or mat i on bi t s ar e pl aced
in an a r r a y of size (n - 2) n. Li ke ot her a r r a y codes [2][3][5][11], par i t y bi t s
ar e cons t r uct ed by addi ng t he i nf or mat i on bi t s al ong sever al pari t y check lines
or diagonals of gi ven slopes. The addi t i on ope r a t i on is j us t t he bi nar y XOR.
But i nst ead of bei ng put in s e pa r a t e col umns, t he pa r i t y bi t s of t he X- Code
ar e pl aced in two addi t i onal rows. So t he coded a r r a y is of size n n, wi t h t he
first n - 2 r ows cont ai ni ng i nf or mat i on bi t s, and t he l ast t wo rows cont ai ni ng
pa r i t y bi t s. Not i ce t ha t each col umn has i nf or mat i on bi t s as well as pa r i t y
bi t s, i.e., i nf or mat i on bi t s and pa r i t y bi t s ar e mi xed in each col umn. By t he
s t r uc t ur e of t he code, if t wo col umns ar e er ased, t he numbe r of r emai ni ng
bi t s is n( n - 2), whi ch is equal t o t he numbe r of ori gi nal i nf or mat i on bi t s,
ma ki ng it possi bl e t o r ecover t he t wo er ased col umns.
2 . 4 . 2 E n c o d i n g Ru l e s . The encodi ng rul e of t he X- Code is si mpl e: let
Ci,j be t he bi t at t he i t h r ow and j t h bi t , t hen pa r i t y bi t s ar e cons t r uct ed
accor di ng t o t he fol l owi ng rules:
n- - 3
Cn-2,i = ~ Ck,(~+k+2).
k=O
n- 3
Cn-l,i = Z Ck'(i-k-2)n
(2.1)
o 0 a .~ 0 m ~
~ 0 0 a i 0 ~?
m ~ 0 0 a i <>
0 m ~ 0 0 a 8
i 0 9 ~ 0 0 a
a .~ 0 m ~ 0 0
0 a ~ 0 m ~ 0
last row being an i magi nary O-row, as follows:
+- 1st pari t y row
+- i magi nary O-row
The second pari t y row is calculated along the diagonals of slope - 1 , with
the last row being an i magi nary O-row, as follows:
k=0
wher e i = 0, 1, . - - , n - 1, and (x)n = x mod n. Geomet r i cal l y speaki ng, t he
t wo pa r i t y rows ar e j us t t he checksums al ong di agonal s of sl opes 1 and - 1
respect i vel y.
The fol l owi ng exampl e shows t he encodi ng of a (7,5) X- Code:
Exampl e 2.2. A (7, 5) X- Code
The f i rst pari t y row is calculated along the diagonals of slope 1, with the
312
[ ] 0
LL
i ~ ~ A D
~ A D 0 a
A ~ 0 a M
+-- 2nd par i t y row
+- i magi nar y O-row
Table 2. 3 s hows a nume r i c al e x ampl e of a ( 7, 5) X- Code .
Ta bl e 2. 3. Numerical example of a (7,5) X-Code
1 0 1 1 0 1 0
0 1 1 0 0 0 0
1 1 0 0 0 0 1
0 1 0 1 1 1 0
1 0 0 1 0 1 0
[3
Fr om t he cons t r uct i on of t he X- Code, it is eas y t o see t ha t t he t wo pa r i t y
rows ar e obt ai ned i ndependent l y, mor e specifically, each i nf or mat i on bi t af-
fect s exact l y one par i t y bi t in each pa r i t y row. All par i t y bi t s de pe nd onl y on
i nf or mat i on bi t s, but not on each ot her . So, upda t i ng a si ngl e i nf or mat i on bi t
r esul t s in upda t i ng onl y t wo pa r i t y bi t s. Thus t he X- Code has t he opt i ma l
encodi ng (or upda t e ) pr oper t y, i.e., i t s upda t e compl exi t y of 2 ma t c he s t he
l ower bound for any 2- er asur e- cor r ect i ng code.
In addi t i on, not i ce t ha t each col umn has t wo par i t y bi t s, each of whi ch is
t he checksum of n - 2 i nf or mat i on bi t s. Thus comput i ng pa r i t y bi t s at each
col umn needs 2 ( n - 3 ) XORs. Thi s bal anced c omput a t i on pr ope r t y of X- Code
is ver y useful in appl i cat i ons t ha t r equi r e evenl y di s t r i but ed c omput a t i ons .
I t was pr oven in [21] t ha t
T h e o r e m 2. 1. ( MDS P r o p e r t y o f t h e X- Co d e )
X- Co d e can recover up to t wo erased col umn, i. e. , it is MDS , if and onl y if n
is a p r i me number .
The pr ocedur e t o recover up t o t wo er ased col umns is cal l ed erasure-
correct i ng or erasure-decodi ng. A f or mal descr i pt i on and cor r ect ness pr oof of
t he er as ur e- decodi ng al gor i t hm for t he X- Code can be f ound in [21]. A pseudo-
code descr i pt i on of t he al gor i t hm can al so be f ound in [23]. Mor e det ai l s
a bout t he X- Code, i ncl udi ng exampl es of i t s er as ur e- decodi ng al gor i t hm ar e
di scussed in [21] and [23].
2. 5 B- Co d e
313
Now we descr i be a not he r class of 2- er as ur e- cor r ect i ng MDS ar r ay code, cal l ed
B- Code. I t s upda t e compl exi t y is al so exact l y 2, i.e., it has t he opt i ma l en-
codi ng ( updat e) pr oper t y. Cons t r uct i on for t he B- Code is not as di r ect as for
t he X- Code. I t is r el at ed t o a cl assi cal gr a ph t heor y pr obl em.
2. 5. 1 S t r u c t u r e o f t h e B- Co d e . B- Code is of size n l, wher e l -- 2n
or 2n + 1. Denot e such a B- Code Bt. As t he X- Code, pa r i t y bi t s ar e pl aced
in a row r a t he r t ha n col umns. For B2n, t he first n - 1 rows ar e i nf or mat i on
rows, and t he l ast row is a pa r i t y row, i.e., all t he bi t s in t he fi rst n - 1 r ows
ar e i nf or mat i on bi t s, while t he 2n bi t s in t he l ast row ar e pa r i t y bi t s. The
s t r uc t ur e of B2n+l can be der i ved f r om t ha t of t he B2n s i mpl y by addi ng one
mor e i nf or mat i on col umn as t he l ast col umn. Thei r s t r uct ur es ar e shown in
Fi gur e 2.1.
T
n - 1
i
2 n 2n t -
Information
Parity
~- I
n - 1 Information n
[ i Parity o
(a) (b)
Fi g. 2.1. Structures of (a) B2n and (b) B2,~+1.
I nt ui t i vel y, if t he roles of t he i nf or mat i on and pa r i t y bi t s of t he B- Code
ar e exchanged, i.e., t he par i t y bi t s ar e pl aced in t he ent r i es whi ch or i gi nal l y
were for t he i nf or mat i on bi t s and vi ce ver sa, t hen we get t he dual code of t he
B- Code. Denot e t he dual B- Code of l engt h l / 3t . A mor e r i gor ous def i ni t i on of
t he dual code for gener al a r r a y codes can be f ound in [22]. I t was al so pr oven
t ha t [22]
T h e o r e m 2. 2. The dual code of an MDS array code is also MDS.
So t he dual B- Code is al so an MDS ar r ay code; it has di st ance I - 1, i.e., t he
dual B- Code can be r ecover ed f r om any t wo of i t s col umns. Fi gur e 2.2 shows
s t r uct ur es of / 32n a nd/ ) 2n+1.
2. 5. 2 A Ne w Gr a p h De s c r i p t i o n o f t h e B- Co d e . Typi cal l y, an a r r a y
code is descr i bed by i t s geometrical constructions lines or diagonals[2][3] [5] [11],
such as t he X- Code. Cons t r uct i ons of a r r a y codes ar e difficult t o get usi ng
t hi s descr i pt i on. Her e we descr i be t he B- Code and i t s dual usi ng a new g r a p h
appr oach, whi ch al l ows us t o get t he cons t r uct i on of t he B- Code easily.
For any ar r ay code, each pa r i t y bi t is t he s um of s ome i nf or mat i on bi t s.
For bi na r y codes, t he addi t i on is j us t t he si mpl e XOR ( bi nar y exclusive OR)
314
2 1 ]
Information
n - 1 Parity
( a )
2 I I ~ | =
~- Information P
~- a
r
i
n- 1 Parity t
y
( b )
Fi g. 2.2. Structures of (a) /32~ and (b) B2n-F1
oper at i on. Now consi der t he dual B- Code /3t. Si mpl e count i ng [22] shows
t ha t each par i t y bi t must be t he sum of exactly 2 i nf or mat i on bits. Thus if we
r epr esent an i nf or mat i on bi t as a ver t ex, t hen a par i t y bi t can be r epr es ent ed
by an edge, wher e t he par i t y bi t is t he sum of t he t wo i nf or mat i on bi t s whose
vert i ces f or m t he edge. Thi s is t he key i dea of descri bi ng t he B- Code and i t s
dual wi t h gr aphs.
Since t he const r uct i on of / 32n, B2n and B2n+l can be easi l y obt a i ne d
f r om t he B2n+l , her e we give a gr aph descr i pt i on of t he /32n+1. Det ai l ed
j ust i fi cat i ons of t hi s descri pt i on can be f ound in [22].
De s c r i p t i o n 2. 1. Graph Description of 1~2n+1
Gi ven a compl et e gr aph K2~ wi t h 2n vert i ces, which are l abel ed wi t h
i nt egers f r om 1 t o 2n, find an edge l abel i ng scheme such t hat
1) each edge is l abel ed exact l y once by an i nt eger f r om 1 t o 2n+1
2) For any pai r of vert i ces (i , j ) and any ot her ver t ex k, where i , j , k E
[1, 2n], t her e is al ways a pat h t o k f r om ei t her i or j , usi ng onl y t he edges
l abel ed wi t h i or j .
3) For any ver t ex i and any ot her ver t ex k, wher e i, k E [1, 2n], t her e is
always a pat h f r om i t o k, using onl y t he edges l abel ed wi t h i or 2n + 1.
Wi t h t he above descri pt i on, it is easy t o see t ha t t he ver t ex and edges
wi t h t he l abel i in t he K2~ r epr esent t he i nf or mat i on bi t and par i t y bi t s in
t he i t h col umn of t he/ 32n+1. The pr oper t i es 2) and 3) ensur e t ha t any t wo
col umns of t he code can recover t he i nf or mat i on bi t s in all ot her col umns,
t hus t he code is of col umn di st ance 2n. Fi gur e 2.3 shows such a l abel i ng of
K4 and t he cor r espondi ng/ ~5, wher e al t hr ough a4 ar e t he i nf or mat i on bi t s.
2. 5. 3 Co n s t r u c t i o n o f t h e B- Co d e . As al r eady descr i bed above, con-
s t r uct i ng t he B- Code amount s t o t he same pr obl em as desi gni ng an edge
l abel i ng scheme such as in Descr i pt i on 2.1 for a compl et e gr aph K2n. For-
t una t e l y t hi s can be r el at ed t o anot her gr aph t heor y pr obl em, namel y t he
perfect one-factorization ( P1F) pr obl em.
De f i n i t i o n 2. 1. [19] Let G=(V,E) be a gr aph. A factoror spanning subgraph
of G is a subgr aph wi t h ver t ex set V. In par t i cul ar a one-factor is a f act or
1 ~ 4 ~ 2
4 w 2 w 3
( a )
i o , i o 2 i a 3 I ~ 1 ~ 1 7 6
a 2 Jr- a 3 a 3 -t- a 4 a4 -t- a l a l + a 2 a2 -t- a 4
(b)
Fi g. 2. 3. (a) graph and (b) array representations of 1)5
315
whi ch is a r egul ar gr aph of degr ee 1. A factorization of G is a set of f act or s
of G whi ch ar e pai r wi se edge disjoint, and whose uni on is all of G. A one-
factorization of G is a f act or i zat i on of G whose f act or s ar e all one- f act or s.
I n par t i cul ar , a one- f act or i zat i on is perfect if t he uni on of any pai r of i t s
one- f act or s is a Hamilton cycle, a cycl e t ha t passes t hr ough ever y ver t ex of
G.
Fi gur e 2.4 shows a per f ect one- f act or i zat i on o f / ( 4 .
1 2 1 2 1 2
.. : : : : > <
4 (a) 3 4 (b) 3 4 (c) 3
Fi g. 2. 4. (a)(b)(c) are 3 one-factors, t hat together form a perfect one-factorization
o f K4
The per f ect one- f act or i zat i on of compl et e gr aphs has been s t udi ed for
ma n y year s since its i nt r oduct i on in 1960' s in [12]. I t is now known t hat [19]:
T h e o r e m 2. 3. I f p is an odd prime, then Kp+a and K2p have perfect one-
factorizations.
Cons t r uct i ons of P l F for Kp+l and K2p can be f ound in [1] and [18]. Ad-
di t i onal l y, cons t r uct i ons of P1F for K2n'S whose n' s ar e some ot her s por adi c
i nt eger s have al so been found[18][19]. However it still r emai ns a conj ect ur e
[18][19] t hat :
Conjecture 2.1. For any posi t i ve i nt eger n, K2~ has per f ect one- f act or i zat i on( s ) .
I t was pr oven in [22] t hat :
316
T h e o r e m 2. 4. Le t Pm be a P I F / o r Ki n. Co n s t r u c t i n g / } 2 n +l ( or equi va-
l ent l y B2n+l ) is equi val ent to cons t r uct i ng P2n+2.
The or e m 2.4 was pr oven cons t r uct i vel y in [22]. Combi ni ng The or e m 2.3
and The or e m 2.4, we get t hat :
T h e o r e m 2. 5. For any odd p r i me p, a B- Co d e and i t s dual code o] si ze n x 1
can be const ruct ed, wher e n is ei t her p- 1 or p - 1, and I = 2n or 2n + 1.
2
2. 5. 4 E r a s u r e Re c o v e r y . Recall t ha t t he dual B- Code can recover all in-
f or mat i on bi t s from any t wo col umns. Er as ur e decodi ng for t he dual B- Code
is al most obvi ous f r om its gr aph descr i pt i on ( De s c r i pt i on 2. 1). The t wo pat hs,
s t ar t i ng f r om i and j and l eadi ng t o all t he ot her vert i ces in t he gr aph, give
t he decodi ng chai n used in recoveri ng a / } - Code f r om its i t h and j t h col umns.
Fi gur e 2.5 shows t he decodi ng chai n used in r ecover i ng/ }5 f r om its 1st col umn
and its 2nd, 3r d and 5t h col umns, respect i vel y.
l 2
4 ~- 2 ~3 3 3
1 , 2 - +3 - - +4 1 -~ 4,3 ---~ 2 1 - + 3 - + 2 - - + 4
(a) (b) (c)
Fi g. 2.5. Erasure decoding of/~5: recovering from its 1st and (a) 2nd (b) 3rd and
(c) 5th columns. The decoding chains for each case are also listed. 1 through 4 axe
the information bits in the corresponding columns.
For mal er asur e r ecover y al gor i t hms for t he B- Code and its dual code, and
mor e det ai l s about t he B- Code can be f ound in [22] and [23].
2. 6 Co mp a r i s o n s o f Ar r a y Co d e s
As al r eady seen above, t he X- Code and B- Code have t he opt i mal upda t e
pr oper t y, i.e., t hei r upda t e compl exi t y is exact l y 2. The B- Code also achi eves
t he ma x i mu m l engt h possible for MDS codes wi t h t he opt i mal updat e, t hus
t he B- Code has opt i mal l engt h, t wi ce t ha t of t he X- Code wi t h t he same col-
umn size. In addi t i on, t he par i t y bi t s are evenl y di s t r i but ed over all col umns,
and each par i t y bi t requi res t he same number of X OR oper at i ons. Conse-
quent l y, t he comput at i onal compl exi t y for comput i ng par i t y bi t s is balanced,
i.e., t he X- Code and B- Code f eat ur e bal anced c omput at i on as well. Thi s pr op-
er t y is qui t e useful in di st r i but ed st or age syst ems, since t he comput at i onal
l oads ar e nat ur al l y di st r i but ed t o all disks evenly, el i mi nat i ng anot her bot t l e-
neck. The pr oper t i es of t he X- Code and t he B- Code are summar i zed in Tabl e
2.4, t oget her wi t h a compar i son wi t h Re e d - S o l o mo n and E V E N ODD codes.
Ta b l e 2. 4. X-Code, B-Code vs. Reed-Solomon and EVENODD.
Codes \ MDS XOR
Propert i es
Reed Solomon Yes No
EVENODD Yes Yes
X-Code Yes Yes
B-Code Yes Yes
Opt i mal
Updat e
Opt i mal
Lengt h
Balanced
Comput at i on
No Yes No
No No No
Yes No Yes
Yes Yes Yes
317
3. Efficiency through Redundancy
Whi l e it is convent i onal wi s dom t ha t r e dunda nc y is necessar y for f aul t t ol -
er ance, r e dunda nc y is in gener al r egar ded as a passi ve cost ( over head) t o
achi eve rel i abi l i t y. However, in t hi s sect i on, it will be shown t h a t in a dis-
t r i but e d s t or age syst em, r e dunda nc y is an act i ve pa r t of t he s ys t e m in t he
sense t h a t pr ope r da t a r e dunda nc y can hel p t o i mpr ove t he pe r f or ma nc e
( da t a t hr oughput ) of s t or age syst ems. Thus da t a r e dunda nc y will i mpr ove
not onl y t he r el i abi l i t y of a s ys t em, but al so t he efficiency of a s ys t em. A si m-
i l ar i dea was first shown in [7], namel y t ha t r e dunda nt d a t a can ma ke packet
r out i ng mor e efficient by r educi ng t he mean and var i ance of t he r out i ng de-
lay. Recent l y, mor e scal abl e and efficient rel i abl e mul t i cas t schemes have been
pr opos ed, bas ed on da t a r e dunda nc y in t he messages t o be mul t i cast [10]. We
will show her e a mor e s ys t emat i c way of usi ng pr ope r r edundancy, bas ed on
er r or - cor r ect i ng codes ( par t i cul ar l y t he MDS ar r ay codes descr i bed above) ,
t o i mpr ove t he per f or mance of d a t a ser ver s ys t ems , whi ch ar e a s uper s et of
s t or age syst ems.
Our d a t a ser ver s ys t e m s et up is shown in Fi gur e 3.1: a cl ust er of ser ver s is
connect ed vi a some rel i abl e communi cat i on net wor k. I n addi t i on, br oa dc a s t
is s uppor t e d over t he net wor k, so t ha t a cl i ent can br oa dc a s t i t s r equest
for cer t ai n d a t a t o s ome or all of t he n ser ver s in t he syst em. The d a t a
is di s t r i but ed over t he ser ver s in such a way t ha t a cl i ent can r ecover t he
compl et e r equest ed da t a af t er it get s da t a f r om at l east k of t he n ser ver s
and t hi s is t r ue for any k servers. Such a di s t r i but ed da t a ser ver s ys t e m is
cal l ed an (n, k) ser ver syst em. Agai n, such (n, k) s ys t ems can be i mpl e me nt e d
by usi ng er r or - cor r ect i ng codes, par t i cul ar l y MDS ar r ay codes.
For t he above d a t a ser ver s ys t em, t her e ar e a coupl e pr obl ems t o solve:
(1) Wh a t is t he proper r e dunda nc y when t he t ot al numbe r of t he ser ver s is
gi ven? Or how shoul d k be det er mi ned when n is gi ven, in or der t o achi eve t he
bes t s ys t e m per f or mance? Thi s is t he so-cal l ed data di st ri but i on pr obl e m at
t he ser ver side. (2) Once d a t a r e dunda nc y is pr ope r l y di s t r i but ed a mo n g t he
servers, how shoul d ma t c hi ng r ead appr oaches be chosen t o opt i mi ze me a n
servi ce t i me? Thi s is t he pr obl e m cal l ed t he data acqui si t i on at t he cl i ent
side. Bot h pr obl ems will be expl or ed in t hi s sect i on, mos t l y t heor et i cal l y.
318
Client
. 1
unication )
Fi g. 3. 1. An (n, k) server syst em
3. 1 P r e l i mi n a r y An a l y s i s
Bef or e we seek t he sol ut i ons t o t he above pr obl ems , we first define a ser ver
s ys t e m model we will be using, bas ed on pr obabi l i t y anal ysi s. The n we gi ve
s ome basi c anal yt i cal r esul t s t ha t can be used f ur t her t o solve t he d a t a dis-
t r i but i on and t he da t a acqui si t i on pr obl ems.
3. 1. 1 S y s t e m Mo d e l . Defi ne t he servi ce t i me Ti of t he ser ver i (1 < i < n)
t o be t he el apsed t i me f r om when t he cl i ent sends i t s r equest t o t he ser ver i
t o when it recei ves d a t a f r om t he ser ver i. Not i ce t ha t T~ does not i ncl ude t he
t i me needed at t he cl i ent side t o do any necessar y computations t o r ecover t he
final dat a, since her e we as s ume t ha t t he c omput a t i ons ar e r a t he r si mpl e and
t hus t ake much less t i me t ha n does t he d a t a del i ver y t hr ough c ommuni c a t i on
medi a. We model Ti as a cont i nuous r a ndom var i abl e wi t h probability density
.function (pdf) ,fi(t)[15]. For si mpl i ci t y of anal ysi s, we as s ume t ha t all T, s ar e
i.i.d (independent, identically distributed) r a ndom var i abl es, i.e., 'fi (t) = . f(t ),
l < i < n .
3. 1. 2 An a l y s i s Re s u l t s . Let Fi (t ) be t he cumulative distribution .function
(cdf) of T~, i.e.[15],
Fi (t ) = Probabi l i t y(Ti <_ t) = , f i (x)dx
Now l et T( n, k) be t he el apsed t i me f r om when t he client br oadcas t s i t s d a t a
r equest t o t he ser ver s t o when it recei ves d a t a f r om at l east k out of t he n
servers. The n T( n, k) is anot her r a ndom var i abl e and is a si mpl e f unct i on of
all t he Tis:
T ( n , k ) > r i , where I1{i}11 >_ k
I n t he above equat i on, IlSll is t he numbe r of t he el ement s in t he set S.
319
Let f(n, k) (t) and F(n,k ) (t) be t he pdf and cdf of T ( n , k) respect i vel y, t hen
it is easy t o r el at e F( n, k) ( t ) and f ( n, k) ( t ) t o F ( t ) and f ( t ) [7]:
F(~, k)(t ) = ~ (~) F( t ) i [ 1 - F( t ) ] n - i (3.1)
i =k
or [7][20]:
f(n, k) (t) -- dF( n, k) ( t ) _ k (~) F ( t ) k - 1 [1 - F ( t ) ] n - k f ( t ) (3.2)
dt
The me an of T ( n , k) E [ T ( n , k)] is a good meas ur ement of t he server syst em' s
per f or mance. It can be cal cul at ed once t he f ( n, k) ( t ) is known:
E [ T ( n , k)] = t f ( n, k) ( t ) dt (3.3)
3. 1. 3 P r o p e r t i e s o f Me a n S e r v i c e T i me . Though it is usual l y har d t o
get a cl ean closed f or m of E [ T ( n , k)] for a general pdf f ( t ) , it is still possi bl e
t o get some of its pr oper t i es wi t h r espect t o n and k. Int ui t i vel y, for a fixed
pdf f ( t ) , a bi gger n a n d / o r a smal l er k leads t o a smal l er E [ T ( n , k)] and t hi s
can be pr oven mat hemat i cal l y[23]:
Th e o r e m 3. 1. For a r andom variable T wi t h a f i xed pdf f ( t ) , t he f ol l owi ng
i nequal i t i es hold f o r 1 < k < n:
1. E f T ( n , k)] > E [ T ( n + m, k)], f o r m > 1;
2. E f T ( n , k)] < E f T ( n , k + m)], f o r m >_ 1;
3. E [ T ( n , k ) ] < E [ T ( n + m , k + m)], f o r m >_ 1;
4. E [ T ( i , j ) ] >_ E [ T ( n , k ) ] , i f n >_ i and k <_ j , equality holds onl y when
n -- i and k = j ;
5. E [ T ( i , j ) ] < E [ T ( n , k ) ] , i f n >_ i, k > j and n - k < i - j .
[]
We will use t hese pr oper t i es above as gui del i nes for t he da t a di st r i but i on
and t he da t a acqui si t i on probl ems. One woul d hope t ha t t he vari ances of
r andom vari abl es also had t he si mi l ar pr oper t i es. Unf or t unat el y, however,
t he above pr oper t i es do not hol d for t he vari ances. One such an exampl e is
shown in [23].
3. 2 Se r v e r P e r f o r ma n c e Mo d e l
Fr om Eq. (3. 2) and Eq. (3. 3), E f T ( n , k)] is a f unct i on of t he pdf f ( t ) of an
i ndi vi dual ser ver ' s da t a servi ce t i me. The goal of t he da t a di st r i but i on and t he
da t a acqui si t i on pr obl ems is t o r educe E f T ( n , k)] under vari ous condi t i ons.
Before we anal yze t he da t a di st r i but i on and t he da t a acqui si t i on pr obl ems,
it is necessar y t o est abl i sh some model of f ( t ) .
320
3 . 2 . 1 A b s t r a c t i o n f r o m E x p e r i m e n t s . T h e d a t a s e r v i c e t i me T d e p e n d s
o n ma n y f a c t o r s i n a pr a c t i c a l s e r v e r s y s t e m, s uc h as c o mp u t i n g p o we r ( i . e. ,
C P U s pe e d) o f t h e s e r ve r s a nd t h e c l i e nt , l o c a l di s k I / O s p e e d o f t h e s e r v e r s
a n d b a n d wi d t h a n d l a t e n c y o f t h e c o mmu n i c a t i o n me d i u m ( u s u a l l y i n c l u d i n g
a r e l i abl e c o mmu n i c a t i o n s o f t wa r e l aye r ) c o n n e c t i n g t h e s e r v e r s a n d t h e c l i e nt .
A mo d e l c o n s i d e r i n g al l t h e f a c t o r s wi l l be f ai rl y c o mp l e x . I n t hi s s e c t i o n , we
wi l l t r y t o mo d e l t h e d a t a s e r v i c e t i me a s a s i mp l e p r o b a b i l i t y d i s t r i b u t i o n ,
t h a t c a n be a n a l y z e d r a t he r e as i l y, a n d y e t c a n a p p r o x i ma t e t h e r e al d a t a
s e r v i c e t i me c l os e l y. Suc h a mo d e l wi l l b e a b s t r a c t e d f r o m e x p e r i me n t a l r e s u l t s
o f a real d a t a s e r ve r s y s t e m.
Ou r e x p e r i me n t a l s e r v e r s y s t e m c o n s i s t s o f s e v e r a l s e r ve r s , wh i c h ar e P C s
r u n n i n g Li nux . Ea c h s e r ve r ha s d a t a s t o r e d o n i t s l o c a l ha r d di s k. D a t a
i s a c c e s s e d v i a t h e Li nux fi l e s y s t e m. T h e c l i e nt i s a l s o a P C r u n n i n g t h e
s a me Li nux . T h e n o d e s ar e c o n n e c t e d v i a My r i n e t s wi t c h e s . A sl i di ng wi n dow
p r o t o c o l i s u s e d t o e ns ur e r e l i abl e c o mmu n i c a t i o n . E x p e r i me n t s ar e c o n d u c t e d
i n s u c h a real s y s t e m t o me a s u r e t h e s e r v i c e t i me f or d a t a o f di f f e r e nt s i z e s .
T h e p r o c e d u r e o f t h e e x p e r i me n t i s as f ol l ows : ( 1) t h e c l i e nt s e n d s a r e q u e s t
f or a c e r t a i n a mo u n t o f d a t a t o a s erver; ( 2) t h e s e r v e r r e a ds t h e d a t a f r o m i t s
l o c a l di s k a n d s e n d s i t t o t h e c l i e nt t h r o u g h t h e r e l i abl e c o mmu n i c a t i o n l ayer;
( 3) t h e d a t a i s de l i v e r e d t o t h e c l i e nt t h r o u g h t h e r e l i abl e c o mmu n i c a t i o n
l ayer. T h e d a t a s e r v i c e t i me i s me a s u r e d f r o m t h e i n s t a n t t h a t t h e c l i e nt
f i ni s he s s e n d i n g i t s r e que s t t o t h e i n s t a n t t h a t t h e c l i e nt g e t s t h e d a t a . We
r un t h e a b o v e p r o c e d u r e a f e w t h o u s a n d t i me s f or d a t a o f a g i v e n s i z e , a n d
g e t t h e s e r v i c e t i me p d f a c c o r d i n g t o t h e o b s e r v e d f r e que nc i e s o f di f f e r e nt
r a n g e s o f s e r v i c e t i me . Fi g ur e 3. 2 s h o ws e mp i r i c a l s e r v i c e t i me pdf s f or d a t a
s i z e s ( a) 32 Kb y t e s , (b) 320 Kb y t e s a n d ( c) 3200 Kb y t e s .
lu
im
~ m
z
f '~ \ \
i ,,
/
i i
/ ' \
. . . + + - - + " , , ~ . . . .
o+o,~+ o o ~ o++m 1 1 1 1 1 + + 1 2 i m i ~ ir a + + 22 ' ~
(a) data si ze : 32K bytes (c) data si ze : 3200K byt es
\
o m o o m
] "']
] !,,
,-+ r . . . . \.-. .......
(b) data si ze : 320K bytes
J i
t ~ i \ i
t
i l l ! i ~ .
i \ ' h
i
i J
Fi g . 3 . 2 . Empi ri cal pdfs of servi ce t i me for dat a of different si zes
321
The effect i ve d a t a bandwi dt hs in t hi s exper i ment s ar e qui t e low, si nce
t hey ar e t he concat enat i on of t he l ocal di sk ba ndwi dt h and t he rel i abl e com-
muni cat i on l ayer bandwi dt h. But t he s hape of t he ba ndwi dt h pdf s is mor e
i nt er est i ng. The exper i ment r esul t s show t h a t t he s hape of empi r i cal pdf s of
di fferent d a t a size can be a ppr oxi ma t e d by t he s ame di st r i but i on. A cl oser
l ook shows t ha t t he wi dt h of t he di s t r i but i on base is a ppr oxi ma t e l y pro-
portional to t he da t a size. Mor e compl ex di st r i but i ons, such as t he Ga mma
di s t r i but i on or t he Be t a di st r i but i on, mi ght be mor e accur acy. But t o si mpl i f y
t he anal ysi s, t h a t follows, we will r egar d t he d a t a servi ce t i me T as a r a n d o m
var i abl e defi ned on [a, b] (a and b ar e t wo p a r a me t e r s of a real s ys t em) , whi ch
follows a t r i angul ar di st r i but i on, denot ed Tr[a, b]:
~ a <t < a+b
f(t) = 2 (3.4)
(b-o). ~ < t <_b
I t s cdf (cumulative distribution function) is
~ a <t < a+b
g(t) = (b--a)Z 2 (3.5)
1 - ~ ~+b <t <b
( b- - a) z 2 - -
One expl anat i on for t hi s model is as follows: in a real s ys t em, d a t a is
del i vered in packet s of some smal l size. The del i ver y t i me of t he i t h packet
is a r a n d o m var i abl e ti, whose pr obabi l i t y di s t r i but i on can be char act er i zed
by a uni f or m di st r i but i on over s ome t i me span; t he ti's ar e as s umed t o be
i.i.d, r a ndom var i abl es. The n t he servi ce t i me T of t he whol e d a t a is: T --
s + ~ i ti, wher e s is a not he r uni f or m r a n d o m var i abl e descr i bi ng t he s et up
(or over head) t i me for sendi ng a cer t ai n a mount of dat a. Thus t he pdf of T
is a Gaussian-like f unct i on, whose bas e wi dt h is a ppr oxi ma t e l y pr opor t i ona l
t o t he numbe r of t he packet s in t he dat a, whi ch in t ur n is pr opor t i ona l t o
t he d a t a size. For si mpl i ci t y, we a p p r o x i ma t e t he Gaussi an- l i ke f unct i on by a
sui t abl e t r i angul ar f unct i on. The di s t r i but i ons ar e shown in Fi gur e 3.3.
3. 2. 2 Ve r i f i c a t i o n wi t h T ( n , 1 ) . I nt ui t i vel y, havi ng mor e ser ver s shoul d
pr ovi de be t t e r per f or mance when t he a mo u n t of da t a s t or ed on each ser ver
is fixed, i.e., E[T(n, k)] decr eases as n i ncr eases a n d / o r k decr eases. We can
get pdf s of t he T(n,k) for a d a t a ser ver s ys t e m by eval uat i ng Eq. (3. 2) for
t he servi ce t i me di s t r i but i on in Eq. (3. 4) and Eq. (3. 5). Fi gur e 3. 4(a) shows
t he pdfs of T(n, 1), wher e 1 < n < 3 and T is of t he t r i angul ar di s t r i but i on
Tr [ 1, 2]. Her e we can see t he pdf of T(n, k) shi ft s left as n i ncr eases, whi ch
i ndi cat es t h a t t he aver age of t he r a n d o m var i abl e T(n, k) decr eases as n
i ncreases.
To f ur t her veri fy t he pr oper t i es of E[T(n, k)], si mpl e exper i ment s t o mea-
sure T(n, 1) were done on t he e xpe r i me nt a l ser ver s ys t e m descr i bed in pr e-
vi ous subsect i on. The s ys t e m consi st s of t hr ee servers. I n or der t o r emove
ot her f act or s t ha t al so affect d a t a servi ce t i me, such as cont ent i on in t he
322
--C i i i i i i ~
i
( ~ )
s~
I
, 2 3 I S ~ 7 f t
( b )
2 ~ ~ 5 7 I s
( ~ )
Fi g. 3. 3. Probabi l i t y distributions of dat a service t i me of (a) single packet, (b) t he
whole data, (c) t he approxi mat i on with Tr[a,b]
c ommuni c a t i on me di um (i ncl udi ng t he rel i abl e c ommuni c a t i on l ayer, whi ch
is a bot t l eneck if we use a single client whi ch communi cat es wi t h t he t hr ee
servers), we use t hr ee clients, each of whi ch is ser ved by a s e pa r a t e ser ver .
Concept ual l y t he t hr ee clients are r egar ded as a single client, t hus t he whol e
da t a servi ce t i me is t he mi ni mum of t he t hr ee i ndi vi dual servi ce t i me of t he
ser ver - cl i ent pai r s. Fi gur e 3. 4(b) shows t he servi ce t i mes (T1,T2, and T3)
of t he t hr ee i ndi vi dual server-cl i ent pai r s for 3200 Kbyt e s d a t a each. Si nce
t he var i ance a mong t he t hr ee pai r s is bi gger t ha n t he var i ance wi t hi n each
pai r , t he whol e servi ce t i me (Train), whi ch is t he mi ni mum of t he t hr ee, is
det er mi ned by t he servi ce t i me of t he bes t cl i ent - ser ver pai r as can be seen
in t he exper i ment al resul t s. In t hi s case, t he pdf of Tmi~ is ver y close t o
t ha t of TI . To make t he exper i ment al r esul t s mor e i nt er est i ng, s ome r a n d o m
l oads ar e added t o each server, so t ha t t he var i ance a mong t he t hr ee cl i ent -
ser ver pai r s is less t ha n t he var i ance wi t hi n each pai r , i.e., each pai r behaves
mor e si mi l arl y. The servi ce t i mes of t hr ee i ndi vi dual pai r s (T1, T2, and T3)
and t he whol e servi ce t i me (Tmi~) ar e shown in Fi gur e 3. 4(c). Of t hose f our
pdfs ( Tm~, T1, T2 and T3), t ha t of Tmi,~ is t he l ef t most , whi ch s uppor t s t he
anal yt i cal pr oper t i es of T( n, k) and t he pdf model of T.
3 . 3 D a t a D i s t r i b u t i o n S c h e m e
Now l et ' s t ur n t o t he da t a di st r i but i on pr obl em: in a ser ver s ys t em, wi t h
a gi ven t ot al numbe r of servers, n, we need t o det er mi ne t he numbe r k of
t he servers whi ch st or e t he raw d a t a in or der t o maxi mi ze t he pe r f or ma nc e
of t he whol e s ys t e m (i.e., t o mi ni mi ze t he me a n servi ce t i me of cl i ent ' s d a t a
r equest ) ; gi ven k, t he r est of t he ser ver s can st or e t he redundant dat a. Whe n n
and t he pdf f ( t ) ar e fixed, E[T(n, k)] decr eases monot oni cal l y as k decr eases.
323
:3
2. 5
2
1
0. 5
0(
180
n
n
9
: o
.2
g
215
T
160
140
120
1 O0
60
P.1,

o c.
c~ ,++ ;.;.,
1 113 1 19 1 2 1 21 1 22
T ( =~)
(b)
T2
T3
180
180
140
120
8O
2O
%
o _- Tr ni n
+ _
1":25"5 . . . . . . . . . . . 1. 21 1. 215 1. 22 1. 225 " 1. 23- - - - 1. 235 " - 1- 24
T ( se~)
(c)
Fi g. 3. 4. pdfs of T(n, 1): (a) analytical result, where the pdf of T is Tr[1, 2], and
experimental service time for data of size 3200 Kbytes, where (b) no other loads on
the servers, and (c) other random loads on the servers
324
Thi s means t ha t in or der t o make E[T(n, k)] smal l , k shoul d be as smal l as
possi bl e. On t he ot her hand, however, t he smal l er k is, t he mor e d a t a needs
t o be st or ed on each server, since t he t ot al a mount of t he d a t a a cl i ent needs
is al ways fixed; t hi s means hi gher servi ce t i me f r om each server. Our goal is t o
find such a k t ha t when bot h sides of t he pr obl em ar e consi der ed, E[T(n, k)]
is mi ni mi zed.
Af t er t he p a r a me t e r k is det er mi ned, in or der t o achi eve opt i ma l per f or -
ma nc e in t e r ms of E[T(n, k)], we can use MDS a r r a y codes t o di s t r i but e t he
r e dunda nt da t a so t ha t da t a f r om any k servers can be as s embl ed t o f or m
t he whol e of t he r equest ed dat a, as was shown in t he pr evi ous sect i on. The
onl y r emai ni ng pr obl em is t o det er mi ne k t o mi ni mi ze E[T(n, k)]. Appl yi ng
t he pdf model of each ser ver ' s servi ce t i me, T, and usi ng MDS codes for
di s t r i but i ng t he r edundant dat a, we get t ha t i f t he pdf of T is Tr[a, b] when
k - - l , t hen for gener al k, t he cor r es pondi ng pdf is Tr[~, b], since t he bas e
wi dt h of t he pdf is pr opor t i onal t o t he da t a size. Theor et i cal l y, t he opt i ma l
k can be cal cul at ed as follows:
kmin = argmink k (i) F(t) k-1 [1 - F(t)]n-ktf(t)dt (3.6)
wher e f(t) and F(t) ar e as in Eq. (3. 4) and Eq. (3. 5), except t ha t a and b
shoul d be r epl aced by ~ and ~ respect i vel y. Not i ce t ha t kmin is a f unct i on of
t he ent i r e pdf f(t), not onl y t he me a n E(T) and t he var i ance Var[T].
Even for a si mpl e pdf such as Tr[a, b], t he above equat i on can not be
sol ved in closed form. But in pr act i ce, t he s ys t e m p a r a me t e r s a and b can be
det er mi ned by exper i ment s , t hen t he above equat i on can be sol ved numer i -
cally. Fi gur e 3.5 gives several exampl es of sol vi ng t he above equat i on. In t he
exampl es , a = 1 and b = 5. For n = 10, 20, and 40, E[T(n, k)] is cal cul at ed
for 1 ~ k < n. The resul t s ar e shown in Fi gur e 3. 5( a) ( b) ( c) , wher e (b) and
(c) onl y show t he l ast few val ues for k, since for smal l k E[T(n, k)] decr eases
monot oni cal l y as k. Fr om t he resul t s, we can see (a) k mi n = 10, when n =
10, (b) kmin = 19, when n = 20, and (c) kmin = 37, when n = 40.
Even t hough t he above exampl es use specific pdfs, t he s ame me t hod al so
a ppl y wi t h ot her pdfs by pl uggi ng sui t abl e f(t) i nt o Eq. (3. 6). Thus , for a
gi ven ser ver syst em, such a kmin can al ways be found. Pr o p e r MDS a r r a y
codes can t hen be used based on t he (n, k) pai r. Thus we get an opt i ma l d a t a
di s t r i but i on scheme for a gi ven ser ver syst em.
3. 4 Dat a Ac qui s i t i on Scheme
Once t he da t a di st r i but i on scheme is set , i.e., k is det er mi ned and t he pr ope r
MDS ar r ay code is chosen, t he cl i ent needs t o deci de how t o r equest (or
r ead) dat a. I n general , a client shoul d send its r equest t o as ma n y ser ver s as
possi bl e and al so make t he a mount of d a t a it needs f r om each ser ver as smal l
as possi bl e, since t he pr oper t i es of E[T(n, k)] show t ha t mor e r e dunda nc y
325
i
i
\
\
. . . . +":--+-+._ _+_.
3 4 + ; + S +
x
(a) n---- 10
I : . .......... + / / /
u~J
~L
(b) n -- 20
a , t , . . . . . . . . .
ll111\ '
(I+I "~+,,
1
i + l ,,,
\ /
K
(c) n = 40
Fi g. 3. 5. E[T(n, k)] vs. k for different n, where a = 1 and b = 5
br i ngs b e t t e r pe r f or ma nc e . For a speci fi c di s t r i but i on s cheme, t h e cl i ent needs
t o c a l c ul a t e t he pdf s of all pos s i bl e d a t a r e a d s chemes , a n d t h e n c hoos e a n
o p t i ma l r e a d s cheme. Si nce t h e r e a d s chemes ar e cl osel y r e l a t e d t o t h e MDS
a r r a y c ode bei ng us ed, her e we will gi ve an e x a mp l e us i ng a speci f i c c ode t o
s how t h e gui del i nes f or c hoos i ng a n o p t i ma l r e a d s cheme.
I n t hi s e xa mpl e , t he s er ver s y s t e m has 2n ser ver s, a n d t he d a t a t h a t
t h e cl i ent r e que s t s can be a s s e mbl e d f r om a n y 2n - 2 ser ver s, i. e. , t hi s is
a ( 2n, 2n - 2) s ys t e m. Th e B- Co d e c a n be us ed t o i mp l e me n t t hi s s ys t e m.
Th e d a t a di s t r i but i on us i ng t he B- Co d e is as fol l ows: (1) t h e whol e r a w
(information) d a t a is p a r t i t i o n e d i nt o 2n( n - 1) bl ocks of equal si ze ( s ome
p a d d i n g s ar e a d d e d if neces s ar y) ; (2) each of t he 2n s er ver s s t or es n - 1
bl ocks of t h e d a t a ; (3) 2n bl ocks of r e d u n d a n t ( or parity) d a t a ar e c a l c ul a t e d
a c c o r d i n g t o t he e n c o d i n g r ul es of t he B- Co d e , i.e., each p a r i t y bl ock is an
XOR of s ui t a bl e 2n - 2 r aw d a t a bl ocks, a n d t h e n each s er ver s t or es 1 p a r i t y
bl ock. Th e s t r u c t u r e of t he B- Co d e is s hown i n Fi gur e 2. 1.
Th e MDS p r o p e r t y of t he B- Co d e gi ves 3 s chemes f or r e c o n s t r u c t i n g t h e
whol e r aw d a t a f r o m t he d a t a s t or e d on 2n ser ver s, each of whi ch has n - 1
bl ocks of r aw d a t a a n d 1 bl ock of p a r i t y d a t a : (1) r e a d f r o m all of t h e 2n
ser ver s, each of whi ch sends i t s n - 1 bl ocks of r aw da t a ; (2) r e a d f r o m a n y
2n - 2 ser ver s, e a c h of whi ch s ends all of i t s n bl ocks of d a t a ( i ncl udi ng r aw
a n d p a r i t y d a t a ) ; (3) r e a d f r o m all of t he 2n ser ver s, each of whi c h s ends all
of i t s n bl ocks of d a t a . Th e 3 s chemes ar e s hown i n Fi gur e 3. 6, whe r e t h e
s h a d e d p a r t s ar e t he d a t a t o be r ead.
Not i ce t h a t t h e r e is no r e d u n d a n t d a t a i n s c he me (1) or s c he me (2), so
t he cl i ent mu s t wai t unt i l i t r ecei ves all t he d a t a f r om all t h e ser ver s. Bu t i n
326
T i
o. , :
(a) Scheme 1
. . . . !
(b) Scheme 2
n- i
t
(c) Scheme 3
Fi g. 3.6. Three read schemes using the B-Code
scheme (3), t her e is r edundant dat a, t hen t he client onl y needs t o receive dat a
f r om any 2n - 2 of t he 2n request ed servers. Let E[T( 2n, 2n) ] , - 1, E[T( 2n -
2, 2n - 2)]n and E[T(2n, 2n - 2)In denot e t he mean dat a service t i me of
t he t hr ee schemes respectively. Fr om Pr oper t y 1 of Theor em 3.1, E[T( 2n -
2, 2n - 2)]n > E[T(2n, 2n - 2)]n. But t he rel at i on bet ween E[T( 2n, 2n)],~_1
and ei t her E[T( 2n - 2, 2n - 2)], or E[T( 2n, 2n - 2)In is not so obvi ous, si nce
in scheme (1) t he client needs t o wai t for mor e servers, but needs less da t a
(t hus less service t i me) from each server. So t o det er mi ne whi ch scheme is
best scheme for a gi ven syst em, we need t o cal cul at e t he pdf of t he whol e
service t i me for all possible t he schemes, whi ch are scheme (1) and scheme
(3) in this case.
Assume t hat t he pdf of t he t i me T for each server t o send n blocks of dat a
t o t he client is Tr[a, b]; t hen t he pdf of T in scheme (1) is Tr [ ~a, n-1 -~-- b],
since each server onl y needs t o send n - 1 blocks of dat a, and t he pdf of T
in scheme (2) or (3) is Tr[a,b]. Now t he pdfs of t he whol e service t i me in
t he different schemes can be cal cul at ed accor di ng t o Eq. (3. 2), Eq. (3. 4) and
Eq. (3. 5). Fi gur e 3.7 shows t he pdfs for different values of n, where a -- 1 and
b- - 10.
Usi ng Eq. (3. 3), t he mean of t he whol e service t i me of different schemes
can be cal cul at ed. These means are listed in Tabl e 3.1, for a -- 1 and b -- 10.
Tabl e 3.1. Mean service time of different dat a read schemes, where a = 1, and b
= 10
n 3 7 10
E[T(2n, 2n)]~-1 5. 2195 7.3128 7.8857
E[T(2n - 2, 2n - 2)]n 7.4089 8.4207 8.6976
E[T(2n, 2n- 2)], ~ 5.8910 7. 2466 7. 6786
The above cal cul at i ons show t hat t he per f or mance of t he t hr ee schemes
depends on t he syst em par amet er n (when a and b are fixed). I n a smal l
server syst em, scheme (1) is t he best . As n increases, scheme (3) becomes
327
bet t er . For a syst em of 6 servers (n = 3), scheme (1) is t he best, but for
syst ems of 14 servers (n = 7) and 20 servers (n = 10), scheme (3) is t he best.
Though quite simple, t he above exampl e shows t ha t aft er t he da t a distri-
but i on is set at t he server side, t he client has different ways of r eadi ng da t a
from t he servers. For a given syst em (i.e., a cert ai n pdf of T, a fixed (n, k)
pair and a par t i cul ar code), t her e al ways exists an opt i mal read scheme for
t he client. Fi ndi ng t hi s scheme requires careful cal cul at i on. Since t he r ead
schemes are hi ghl y rel at ed t o t he codes used, expl ori ng codes t hat offer more
read choices is an i nt erest i ng research probl em. It is conj ect ured in [23] t ha t
all MDS codes have a so-called strong MDS propert y, which provi des t he
flexibility of readi ng schemes.
4. S u mma r y
Thi s chapt er deals wi t h two issues in hi ghl y available di st r i but ed st or age
syst ems: rel i abi l i t y and efficiency. To achieve reliability, t hr ee classes of MDS
ar r ay codes are described. They are sui t abl e for st orage appl i cat i ons because
of t hei r simple comput at i ons for encodi ng and decodi ng, t hei r MDS pr oper t y
and t hei r low (or opt i mal ) updat e complexity. Two problems, namel y t he da t a
di st r i but i on probl em and t he da t a acqui si t i on probl em, and t hei r sol ut i ons
are proposed t o use t he r edundancy in st orage syst ems pr oper l y t o i mprove
t hei r performance.
A pract i cal di st r i but ed st orage syst em is i mpl ement ed as par t of t he RAI N
(Reliable Ar r ay of I ndependent Nodes) syst em, a reliable and efficient com-
put i ng envi r onment at t he Paral l el and Di st r i but ed Comput i ng Lab of Cal-
tech, usi ng t he approaches discussed in t hi s chapt er. A det ai l ed descri pt i on
about t he RAIN syst em can be found in [6].
328
0 . 6
O . m
O . 4
N
O . 3
0 . 2
0 . 1
0
o ' ,
c ~ : ~ - ~ c . 0 3
c~
c ~
o O O Q ? o
o 0 o ~
0
o
o , ~. o ~
o *~
o~ o ~ o %
o o o~ o %
, - , 0 o o 0 o
o- o ,,~ <> o %
3 4 5 6 o 7 8 v " e . . . . .
T
( a ~ n = 3
0 . 6
0 . 5
0 . 4
O . 3
O . 2
0
0 . 6
0 5
0 3
0 . 2
0 . 1
0 4 -
~ 3 o c " o
0
0 < ~ , ~ ' c ' ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6
0 ~ ~
0 o ~ ' ~
~ 0 ~ ,
0 o ~
. o ~
t ) 0
0 ~ , ~ o ~
o O O o o o
%
.:.~ o O O o ,r
: 0 ~ ' o o o 0
T
( b ) n - : 7
o . . . . . ~ o " 3 ~ 1
0 . o
0 .-., < " o ~
. . . . . - 5 . . . . 6 " " 7
T
( c ) n = 1 0
O O c , . O
O o ~ 1 7 6 1 7 6 1 7 6 1 7 6
o
0 o ~ o
0 o o o
o'r 0 o ,o,
0 ~ o o
0
o o '=' o
~ c~
o c, < ~ 0 o
o o ~ 0
o
i , 0
F i g . 3 . 7 . P D F s o f di f f e r e nt d a t a r e a d s c h e me , wh e r e a = 1, b = 10; 1, 2 a n d 3
r e p r e s e n t s c h e me ( 1 ) , ( 2) a n d ( 3) r e s p e c t i v e l y .
R e f e r e n c e s
329
1. B. A. Anderson, "Symmetry Groups of Some Perfect 1-Factorizations of Com-
plete Graphs," Discrete Mathematics, 18, 227-234, 1977.
2. M. Blaum, J. Brady, J. Bruck and J. Menon, "EVENODD: An Efficient Scheme
for Tolerating Double Disk Failures in RAID Architectures," IEEE Trans. on
Computers, 44(2), 192-202, Feb. 1995.
3. M. Blaum, J. Bruck, A. Vardy, "MDS Array Codes with Independent Parity
Symbols," IEEE Trans. on Information Theory, 42(2), 529-542, March 1996.
4. M. Blaum, P. G. Farrell and H. C. A. van Tilborg, "Chapter on Array Codes",
Handbook of Coding Theory, edited by V. S. Pless and W. C. Huffman, to
appear.
5. M. Blaum, R. M. Roth, "New Array Codes for Multiple Phased Burst Correc-
tion," IEEE Trans. on Information Theory, 39(1), 66-77, Jan. 1993.
6. V. Bohossian, C. Fan, P. LeMahieu, M. Riedel, L. Xu and J. Bruck, "Comput-
ing in the RAIN: A Reliable Array of Independent Nodes", Caltech Technical
Report, 1998.
Available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / p a p e r s / e t r 0 2 9 , ps.
7. M. N. Frank, "Dispersity Routing in Store-and-Forward Networks," Ph.D. the-
sis, University of Pennsylvania, 1975.
8. P. G. Farrell, "A Survey of Array Error Control Codes," ETT, 3(5), 441-454,
1992.
9. R. G. Gallager, "Low-Density Parity-Check Codes," MIT Press, Cambridge,
Massachusetts, 1963.
10. J. Gemmell, "Scalable Reliable Multicast Using Erasuring-Correcting Re-
Sends," Technical Report MSR-TR-97-20, Microsoft Research, June, 1997.
11. R. M. Goodman, R. J. McEliece and M. Sayano, "Phased Burst Error Correct-
ing Arrays Codes," IEEE Trans. on Information Theory, 39, 684-693,1993.
12. A. Kotzig, "Hamilton Graphs and Hamilton Circuits," Theory of Graphs and
Its Applications (Proc. Sympos. Smolenice), 63-82, 1963.
13. F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting Codes,
Amsterdam: North-Holland, 1977.
14. Norman K. Ouchi, "System for Recovering Data Stored in Failed Memory
Unit," US Patent 4092732, May 30, 1978.
15. A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd Edi-
tion, McGraw-Hill, Inc., 1984.
16. D. A. Patterson, G. A. Gibson and R. H. Katz, "A Case for Redundant Arrays
of Inexpensive Disks," Proc. SIGMOD Int. Conf. Data Management, 109-116,
Chicago, IL, 1988.
17. R. M. Tanner, "A Recursive Approach to Low Complexity Codes," IEEE Trans.
on Information Theory, 27(5), 533-547, Sep. 1981.
18. D. G. Wagner, "On the Perfect One-Factorization Conjecture," Discrete Math-
ematics, 104, 211-215, 1992.
19. W. D. Wallis, One-Factorizations, Kluwer Academic Publisher, 1997.
20. Samuel S. Wilks, Mathematical Statistics, John Wiley & Sons, Inc., 1963.
21. L. Xu and J. Bruck, "X-Code: MDS Array Codes with Optimal Encoding,"
IEEE Trans. on Information Theory, 45(1), 272-276, Jan., 1999.
A l s o available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / p a p e r s / e t r 0 2 0 . p s .
22. L. X u , V . Bohossian, J. B r u c k a n d D . W a g n e r , " L o w D e n s i t y M D S C o d e s
a n d Factors o f C o m p l e t e G r a p h s , " P r o c e e d i n g s of 1 9 9 8 I E E E S y m p o s i u m o n
I n f o r m a t i o n T h e o r y , A u g . , 1 9 9 8 ; R e v i s e d version to a p p e a r in I E E E T r a n s . o n
330
23.
Information Theory, Sep. 1999.
Also available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / p a p e r s / e t r 0 2 5 . p s .
L. Xu, "Highly Available Di st ri but ed Storage Systems, " Ph. D. thesis, California
Inst i t ut e of Technology, 1998.
Also available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / ~ l i h a o / t h e s i s , html.
List of Lectures
K. Abdali
W. Almesberger
J. Blum
T. Braun
J. Bruck
H. Busch
G. Cooperman
T. Eickermann
L. Finkelstein
E. Gabriel
G. Havas
A. Hoisie
P. Holleczek
E. Jessen
M. K6ster
H. Lederer
Advanced Computing and Communi-cation Research
under NSF Support
SRP - a Scalable Resource Reservation Protocol for the
Internet
Para-Station: High Performance Environment for
Clusters
Differentiated Internet Services
Reliable Distributed High Performance Computing
BRAIN - Berlin Research Area Information Network
Parallel TOP-C and Scaling Up with DSM
Metacomputing in the Gigabit Testbed West
Experiences at Northeastern University Connecting to
a High Performance National Network (presented by
G. Cooperman)
High Performance Metacomputing in a Transatlantic
Wide Area Application Testbed
Some Performance Studies in Exact Linear Algebra
Performance and Scalability Analysis of Applications
on Teraflop-class Distributed Architectures
Controlling the Quality of Service in Wide Area ATM
Networks
The Gigabitwissenschaftsnetz of DFN
High Performance Computing across the ATM-WAN
Essen-Bonn
Visual Supercomputing and Metacomputing - Gigabit
Testbed Projects with Contributions of the Max Planck
Society
332
M. M~ihler
I. Matta
G. Michler
D. Nastoll and
H. Gollan
T. Plagemann
E. Quintana-Orti
E. Rathgeb
A. Rieke
G. Schneider
U. Schwiegelshohn
R. Staszewski
T. Warschko
M. Weller
Value-added Services Bases on Virtual LANs
Quality of Service in Wide Area Networks: Issues and
Protocols
The Monster: A Challenge for High Performance
Computing
MILESS - A Learning and Teaching Server for Multi-
Media Documents
Gigabit Networking in Norway - Infrastructure,
Applications and Projects
A Portable Subroutine Library for Solving Linear
Control Problems on Distributed Memory Computers
Gigabit Wide Area Networks - Options and Trends
Encryption in ATM Systems
Low-Speed ATM over ADSL and the Need for High-
Speed Networks
The NRW Metacomputing Initiative
Retrodigitalization and Multivalent Document
Systems
ParaStation 2: Efficient Parallel Computing in
Workstation Clusters
Multi-Broadcast Communication in ATM Computer
Networks and Mathematical Algorithm Development
List of R eg istered P a rticip a n ts
Dr. Kamal Abdali
National Science Foundation
1800 G Street
N. W. Washington DC 20550
USA
Kabdali@ndf.gov
Dr. W. Almesberger
D~partement d' informatique
Laboratoire de r6seaux de
communication- LRC
IN- Ecublens
CH- 1015 Lausanne
Werner.almesberger@di.epfl.ch
I. Bl um
Fakult/it fOr Informatik
Universit/it Karlsruhe
Am Fasanengarten5
76131 Karlsruhe
blum@ira.uka.demailto:blum@ira.uka.
de
Prof. Dr. T. Braun
Universit~it Bern
Institut for Informatik
und angewandte Mathematik
Neubrtickstr. 10
3012 Bern
braun@iam.unibe.ch
Prof. J. Bruck
Department of Electrical Engineering
California Institute of Technology
Pasadena, CA 91125
USA
bruck@vangogh.paradise.caltech.edu
H. Busch
Konrad-Zuse-Zentrum for
Informationstechnik Berlin
Bereich Rechenzentren
Abteilung H6chstleistungsrechner -
Leiter, Takustr. 7
14195 Berlin-Dahlem
busch@zib.de
E. Gabriel
Rechenzentrum Universit/it Stuttgart
Abteilung Paralleles Rechnen
Allmandring 30
70550 Stuttgart
gabriel@hlrs.de
H. GoUan
Institut ftir Experimentelle Mathemafik
Universitiit GH Essen
Ellernstr. 29
45326 Essen
holger@exp-math.uni-essen.de
Prof. George Havas
Dept. of Computer Science
University of Queensland
Queensland 4072
Australien
havas@cs.uq.edu.au
Dr. Adol fy Hoi si e
Scientifc Computing, CIC-19
MS B256
Los Alamos National Laboratory
Los Alamos, NM 87545
hoisie@lanl.gov
Dr. P. Holleczek
Abt. Kommunikationssysteme
Regionales Rechenzentrum Erlangen
Martensstr. 1
91058 Erlangen
peter.holleczek@rrze.uni-erlangen.de
Profi Dr.- Ing. E. Jessen
Institut for Informatik
TU Mtinchen
Augustenstr. 77
80290 Mtinchen
jessen@informatik.tu-muenchen.de
334
Prof. Gene Cooperman
College of Computer Science
Northeastern University
M/S 215 CN
Boston, MA 02115
gene@ccs.neu.edu
Dr. T. Ei ckermann
Forschungszentrum Jfilich GmbH
ZAM
Leo-Brandt-Strafle
52428 Jillich
th.eickermann@fz-juelich.de
Dr. B. Lix
Hochschulrechenzentrum
Universitiit GH Essen
Schfitzenbahn 70
45141 Essen
lix@hrz.uni-essen.de
Dr. M. M~f l er
IBM Deutschland GmbH
European Network Center, Heidelberg
Vangerowstr. 18
69115 Heidelberg
maehler@heidelbg.ibm.com
Prof. Dr. P. Martini
Rheinische Friedrich-Wilhelms-
Universit~it Bonn
Institut fiir Informatik IV
ROmerstrasse 164
D-53117 Bonn
Peter.Martini@cs.uni-bonn.de
Prof. Dr. G. Michler
Institut fiir Experimentelle
Mathematik
Universit~it GH Essen
Ellernstr. 29
45326 Essen
ar chiv@exp-math.uni-essen.de
Dipl.-Inf. M. K6ster
Rheinische Friedrich-Wilhelms-Universit~it
Bonn
Institut fiJr Informatik IV
R6merstrasse 164
D-53117 Bonn
koester@cs.uni-bonn.de
Dr. H. Lederer
Rechenzentrum Garching der Max-Planck-
Gesell-schafi
Max-Planck-Institut ffir Plasmaphysik
Boltzmannstr.2
85748 Garching
lederer@rzg.mpg.de
Prof. E. Ouintana-Orti
Departamento de Informatica
Universidad Jaime I
Campus Penyeta Roja
E- 12071 Castellon, Spanien
Quintana@nuvol.uji.es
Prof. Dr.-Ing. E. P. Rathgeb
Institut fiir Experimentelle Mathematik
Universit~it GH Essen
Ellernstr. 29
45326 Essen
erwin.rathgeb@exp-math.uni-essen.de
Andreas Rieke
Lehrstuhl fiir Kommunikationssysteme
Fachbereich Elektrotechnik
Fernuniversit~it Hagen
Feithstr. 142
58084 Hagen
andreas.rieke@fernuni-hagen.de
Prof. Dr. G. Schneider
Gesellschaft fiir wissenschaftliche
Datenverarbeitung
mbH G6ttingen (GWDG)
Am Faflberg
37077 G6ttingen
gschnei2@gwdg.de
D. Nastoll
Hochschulrechenzentrum
Universit/it GH Essen
Schtitzenbahn 70
45141 Essen
nastoll@hrz.uni-essen.de
Prof. Dr. H. Obrecht
Lehrstuhl far Baumechanik-Statik
Universit~it Dortmund
Fakult~it Bauwesen
August-Schmidt-Strafle 8
44221 Dortmund
msobr@busch.bauwesen.uni-
dortmund. de
Dr. D. V. Pasechnik
Faculty of Technical Mathematics and
Informatics
Department of Statistics, Probability
and Operations
Mekelweg 4
NL-2628 CD Delft
D.Pasechnik@twi.tudelft.nl
Dr. T. Plagemann
University of Oslo, UNIK
Granveien 33, P.O Box 70
N-2007 Kjeller
plagemann@unik.no
Dr. T. Warschko
Fakult~it far Informatik
Universit~it Karlsruhe
Am Fasanengarten5
76131 Karlsruhe
warschko@ira.uka.de
Dr. M. Weller
Institut far Experimentelle
Mathematik
Universit~it GH Essen
Ellernstr. 29
45326 Essen
eowmob@exp-math.uni-essen.de
335
Prof. Dr. U. Schwi egel shohn
Universit~it Dortmund
Lehrstuhl Datenverarbeitungssysteme
Otto-Hahn-Str. 4
44221 Dortmund
uwe.@ds.e-technik.uni-dor tmund. de
Prof. Dr. U. Stammbach
ETH Zfirich
Forschungsinstitut far Mathematik
ETH Zentrum
CH-8092 Zfirich
Stammb@math.ethz.ch
Dr. R. Staszewski
Institut far Experimentelle Mathematik
Universit~it GH Essen
Ellernstr. 29
45326 Essen
reiner@exp-math.uni-essen.de
Dr. R. V61pel
GMD
SCAI (Institut far Wissenschaftliches
Rechnen)
Schloss Birlinghoven
D-53754 Sankt Augustin
Roland.voelpel@gmd.de
P. Wunderl i ng
GMD
IMK (Institut far Medienkommunikation)
Schloss Birlinghoven
D-53754 Sankt Augustin
wunderling@gmd.de
R. Yahyapour
Fakult~it far Elektrotechnik
Lehrstuhl far Datenverarbeitungssysteme
Otto-Hahn-Str. 4
44221 Dortmund
yahya@peggy.E-technik.Uni-Dortmund.de
Lect ure Not es in Con trol a n d In forma tion Sci en ces
Ed ited by M. Thoma
1993-1999 P ublished Titles:
Vol. 186: Sreenath, N.
Systems Representation of Global Climate
Change Models. Foundation for a Systems
Science Approach.
288 pp. 1993 [3-540-19824-5]
Vol. 187: Morecki, A.; Bianchi, G.;
Jaworeck, K. (Eds)
RoManSy 9: Proceedings of the Ninth
CISM-IFToMM Symposium on Theory and
Practice of Robots and Manipulators.
476 pp. 1993 [3-540-19834-2]
Vol. 188: Naidu, D. Subbaram
Aeroassisted Orbital Transfer: Guidance
and Control Strategies
192 pp. 1993 [3-540-19819-9]
Vol . 189: Ilchmann, A.
Non-Identifier-Based High-Gain Adapti ve
Control
220 pp. 1993 [3-540-19845-8]
Vol. 190: Chatila, R.; Hirzinger, G. (Eds)
Experimental Robotics Ih The 2nd
International Symposium, Toulouse,
France, June 25-27 1991
580 pp. 1993 [3-540-19851-2]
Vol. 191: Blondel, V.
Simultaneous Stabilization of Linear
Systems
212 pp. 1993 [3-540-19862-8]
Vol . 192: Smith, R.S.; Dahleh, M. (Eds)
The Modeling of Uncertainty in Control
Systems
412 pp. 1993 [3-540-19870-9]
Vol . 193: Zinober, A.S.I. (Ed.)
Vadable Structure and Lyapunov Control
428 pp. 1993 [3-540-19869-5]
Vol. 194: Cao, Xi-Ren
Realization Probabilities: The Dynamics of
Queuing Systems
336 pp. 1993 [3-540-19872-5]
Vol. 195: Liu, D.; Michel, A.N.
Dynamical Systems wi th Saturation
Nonlinearities: Anal ysi s and Design
212 pp. 1994 [3-540-19888-1]
Vol. 196: Battilord, S.
NoninteraclJng Control wi th Stability for
Nonlinear Systems
196 pp. 1994 [3-540-19891-1]
Vol . 197: Henry, J.; Yvon, J.P. (Eds)
System Modelling and Optimization
975 pp approx. 1994 [3-540-19893-8]
Vol . 198: Winter, H.; NOl~er, H.-G. (Eds)
Advanced Technol ogi es for Ai r Traffic Flow
Management
225 pp approx. 1994 [3-540-19895-4]
Vol . 199: Cohen, G.; Quadrat, J.-P. (Eds)
1 l t h International Conference on
Anal ysi s and Optimization of Systems -
Discrete Event Systems: Sophia-Antipolis,
June 15-16-17, 1994
548 pp. 1994 [3-540-19896-2]
Vol. 200: Yoshikawa, T.; Miyazaki, F. (Eds)
Experimental Robotics II1: The 3rd
Intemational Symposium, Kyoto, Japan,
October 28-30, 1993
624 pp. 1994 [3-540-19905-5]
Vol . 201: Kogan, J.
Robust Stability and Convexi ty
192 pp. 1994 [3-540-19919-5]
Vol. 202: Francis, B.A.; Tannenbaum, A.R.
(Eds)
Feedback Control, Nonlinear Systems,
and Complexity
288 pp. 1995 [3-540-19943-8]
Vol . 203: Popkov, Y.S.
Macrosystems Theory and its Applications:
Equilibrium Models
344 pp. 1995 [3-540-19955-1]
Vol . 204: Takahashi, S.; Takahara, Y.
Logical Approach to Systems Theory
192 pp. 1995 [3-540-19956-X]
Vo l . 205: Kotta , U.
Inversion Method in the Discrete-time
Nonlinear Control Systems Synthesis
Problems
168 pp. 1995 [3-540-19966-7]
Vo l . 206: Ag a n ov i c, Z.; Ga jic, Z.
Linear Optimal Control of Bilinear Systems
wi th Applications to Singular Perturbations
and Weak Coupling
133 pp. 1995 [3-540-19976-4]
Vo l . 207: Gabasov, R.; Kidllova, F.M.;
Prischepova, S.V.
Optimal Feedback Control
224 pp. 1995 [3-540-19991-8]
Vol . 208: Khalil, H.K.; Chow, J.H.;
Ioannou, P.A. (Eds)
Proceedings of Workshop on Advances
inControl and its Applications
300 pp. 1995 [3-540-19993-4]
Vo l . 209: Foia s, C.; Oz ba y, H, ;
Tannenbaum, A.
Robust Control of Infinite Dimensional
Systems: Frequency Domain Methods
230 pp. 1995 [3-540-19994-2]
Vol . 210: De Wilde, P.
Neural Network Models: An Analysis
164 pp. 1996 [3-540-19995-0]
Vol . 211: Gawronski, W.
Balanced Control of Flexible Structures
280 pp. 1996 [3-540-76017-2]
Vol . 212: Sanchez, A.
Formal Specification and Synthesis of
Procedural Controllers for Process Systems
248 pp. 1996 [3-540-76021-0]
Vo l . 213: Patra, A.; Rao, G.P.
General Hybrid Odhogonal Functions and
thei r Applications in Systems and Control
144 pp. 1996 [3-540-76039-3]
Vo l . 214: Yi n, G.; Zhang, Q. (Eds)
Recent Advances in Control and Optimization
of Manufacturing Systems
240 pp. 1996 [3-540-76056-5]
Vo l . 215: Bonivento, C.; Marro, G.;
Zanasi, R. (Eds)
Colloquium on Automatic Control
240 pp. 1996 [3-540-76060-1]
Vol . 216: Kulhav~, R.
Recursive Nonlinear Estimation: A Geometric
A p p r o a c h
244 pp. 1996 [3-540-76063-61
Vol . 217: Garofalo, F.; Glielmo, L. (Eds)
Robust Control via Variable Structure and
Lyapunov Techniques
336 pp. 1996 [3-540-76067-9]
Vol . 218: van der Schaft, A.
I-2 Gain and Passivity Techni ques i n
Nonlinear Control
176 pp. 1996 [3-540-76074-1]
Vol . 219: Berger, M.-O.; Dedche, R.;
Herlin, I.; Jaffr6, J.; Morel, J.-M. (Eds)
ICAOS '96: 12th International Conference on
Anal ysi s and Optimization of Systems -
Images, Wavelets and PDEs:
Pads, June 26-28 1996
378 pp. 1996 [3-540-76076-8]
Vo l . 220: Brog lia to, B.
Nonsmooth Impact Mechanics: Models,
Dynamics and Control
420 pp. 1996 [3-540-76079-2]
Vol . 221: Kelkar, A.; Joshi, S.
Control of Nonlinear Multibody Flexible Space
Structures
160 pp. 1996 [3-540-76093-8]
Vo l . 222: Morse, A.S.
Control Using Logic-Based Switching
288 pp. 1997 [3-540-76097-0]
Vol . 223: Khatib, O.; Salisbury, J.K.
Experimental Robotics IV: The 4th
Intemational Symposium, Stanford, Califomia,
June 30 - Jul y 2, 1995
596 pp. 1997 [3-540-76133-0]
VoI . 224: Ma g n i, J.-F.; Ben n a n i, S.;
Terlouw, J. (Eds)
Robust Flight Control: A Design Challenge
664 pp. 1997 [3-540-76151-9]
Vol . 233: Chiacchio, P.; Chiaverini, S. (Eds)
Complex Robotic Systems
189 pp. 1998 [3-540-76265-5]
Vol . 234: Arena, P.; Fortuna, L.; Muscato, G.;
Xibilia, M.G.
Neural Networks in Multidimensional
Domains: Fundamentals and New Trends in
Modelling and Control
179 pp. 1998 [1-85233-006-6]
Vol . 225: Poznyak, A.S.; Najim, K.
Leaming Automata and Stochastic
Optimization
219 pp. 1997 [3-540-76154-3]
Vol . 226: Cooperman, G.; Michler, G.;
Vinck, H. (Eds)
Workshop on High Performance Computing
and Gigabit Local Area Networks
248 pp. 1997 [3-540-76169-1]
Vol . 227: Tarbouriech, S.; Garcia, G. (Eds)
Control of Uncertain Systems wi th Bounded
Inputs
203 pp. 1997 [3-540-76183-7]
Vol . 228: Dugard, L.; Verdest, E.I. (Eds)
Stability and Control of Time-delay Systems
344 pp. 1998 [3-540-76193-4]
Vol . 229: Laumond, J.-P. (Ed.)
Robot Motion Planning and Control
380 pp. 1998 [3-540-76219-1]
Vol . 230: Siciliano, B.; Valavanis, K.P. (Eds)
Control Problems in Robotics and Automation
328 pp. 1998 [3-540-76220-5]
Vol . 231: Emeryanov, S.V.; Burovoi, I.A.;
Levada, F.Yu.
Control of Indefinite Nonlinear Dynamic
Systems
196 pp. 1998 [3-540-76245-0]
Vol . 232: Casals, A.; de Almeida, A.T. (Eds)
Experimental Robotics V: The Fifth
International Symposium Barcelona,
Catalonia, June 15-18, 1997
190 pp. 1998 [3-540-76218-3]
Vol . 235: Chen, B.M.
Hoo Control and Its Applications
361 pp. 1998 [1-85233-026-0]
Vol . 236: de Almeida, A.T.; Khatib, O. (Eds)
Autonomous Robotic Systems
283 pp. 1998 [1-85233-036-8]
Vol . 237: Kreigman, D.J.; Hagar, G.D.;
Morse, A.S. (Eds)
The Confluence of Vision and Control
304 pp. 1998 [1-85233-025-2]
Vol . 238: Elia , N. ; Da hleh, M.A.
Computational Methods for Controller Design
200 pp. 1998 [1-85233-075-9]
Vol . 239: Wang, Q.G.; Lee, T.H.; Tan, K.K.
Finite Spectrum Assi gnment for Ti me-Del ay
Systems
200 pp. 1998 [1-85233-065-1]
Vol . 240: Lin, Z.
Low Gain Feedback
376 pp. 1999 [1-85233-081-3]
Vol . 241: Yamamoto, Y.; Hara S.
Learning, Control and Hybrid Systems
472 pp. 1999 [1-85233-076-7]
Vol . 242: Conte, G.; Moog, C.H.; Perdon
A.M.
Nonlinear Control Systems
192 pp. 1999 [1-85233-151-8]
Vol . 243: Tzafestas, S.G.; Schmidt, G. (Eds)
Progress in Systems and Robot Anal ysi s and
Control Design
624 pp. 1999 [1-85233-123-2]
Vol . 244: Nijmeijer, H.; Fossen, T.I. (Eds)
New Directions in Nonlinear Observer Design
552pp: 1999 [1-85233-134-8]
Vol. 245: Garulli, A.; Tesi, A.; Vicino, A. (Eds)
Robustness in Identification and Control
448pp: 1999 [1-85233-179-8]
Vol. 246: Aeyels, D.;
Lamnabhi-Laganigue, F.; van der Sd'talt, A. (Eds)
Stability and Stabilization of Nonlinear Systems
408pp: 1999 [1-85233-638--2]
Vol. 247: Young, K.D.; Ozg0ner, U. (Eds)
Variable Structure Systems, Sliding Mode
and Nonlinear Control
400pp: 1999 [1-85233-197-6]
Vol . 246: Chen, Y.; Wen C.
Iterative Learning Control
216pp: 1999 [1-85233-190-9]

You might also like