Distributed high performance computing and gigabit wide area networks. Experience with nation wide high performance wide area computer networks. This meeting surveys recent research directions in large scale networking.
Original Description:
Original Title
[G. Cooperman, E. Jessen, G. Michler (Auth.), (BookZa.org)
Distributed high performance computing and gigabit wide area networks. Experience with nation wide high performance wide area computer networks. This meeting surveys recent research directions in large scale networking.
Distributed high performance computing and gigabit wide area networks. Experience with nation wide high performance wide area computer networks. This meeting surveys recent research directions in large scale networking.
G. Cooperman, E. Jessen and G. Michler College of Computer Science Northeastern University Boston, MA 02115, USA Technische Universit~it Mtinchen Institut fiir Informatik Lehrstuhl VIII - Rechnerstruktur/architektur D-80290 MOnchen Institut ~r Experimentelle Mathematik Universitat Essen EllemstrafSe 29, D-45326 Essen In many countries there are nation wide high performance wide area computer networks, e.g. the German broad band science network B-Wi n or the very high speed backbone network services vBNS of the Ameri can National Science Foundation (NSF). They form a part of the Internet, the widest wide area network of them all. Experience with these networks provides a wealth of information, in particular on hardware and software for wide area networking and high performance computing. The lectures and discussions at the meeting have inspired several fruitful international research collaborations on distributed information technologies and applications. They emphasize the interdisciplinary cooperat i on of electrical engineers, mathematicians and computer scientists in this very active area of research. In particular, this meeting surveys recent research directions in large scale networking: a) Technologies enabling optical, electrical and wireless communi cat i ons; b) Network engineering and network management systems; c) System software and program devel opment environments for distributed high performance computing; d) Distributed applications such as electronic commerce, distance learning and digital libraries; e) High confidence systems with secure access, high availability, and strong privacv guarantees. Experimental results for such systems of the future can onl y be gained through access to today' s gigabit wide area networks. Therefore another important section of this workshop was devoted to reports about existing gigabit testbeds and planned high speed networks in the United States, Europe, and in particular in Germany. These testbeds include: the American initiatives for the next generation internet (NGI) and the Internet 2 of about 120 Universities in the United States (cf. K. Abdali); the Norwegian National Research Network (cf. T. Plagemann); and the planned German Gigabitwissenschaftsnetz (GWin) (cf. E. Jessen). The NGI initiative is financially supported by the American federal agencies DARPA, DOE, NASA, NIH, NIST and NSF. It aims: a) to promote research, development and experimentation in advanced networking technologies; b) to deploy a NGI testbed emphasizing end-to-end performance, end-to-end quality of service and security; c) to develop ultra high speed switching and transmission technologies; d) to develop demanding applications that make use of the advances in network technologies. Among the proposed application areas are: health care, crisis management and response, distance learning, and distributed high performance computations for biomedicine, climate modeling, and basic science. The Internet 2 project is funded by 77 American universities and some industrial partners. It is driven by education and research. Internet 2 will include a gigabit network (Project Abilene), which will operate in 1999. Of course it will benefit from the experiments and results of the NGI initiative. In Germany, in spring 2000, GWin, the gigabit network of DFN (Deutsches Forschungsnetz; German national scientific networking association) will start its operation. As a forerunner for its gigabit network, DFN is supporting two gigabit testbeds in West and South Germany (with a link to Berlin) where experiments are performed. Several articles of this volume report both on planned and on completed experiments. The gigabit testbed West connects the research centers GMD St. Augustin and FZ Jtilich in North Rhine Westphalia with a bandwidth of 2.5 Gbps. It has broadband connections to the computer centers of the DLR (Deutsches Zentrum fur Luft- und Raumfahrt) in Cologne-Porz and the Universities of Cologne and Essen. The gigabit testbed Stid connects the University of Erlangen and the Technical University of Munich. It currently consists of a dark fiber connection between the computer centers of these universities. The bandwidth of this switched ATM network is initially 3 times 2.5 Gbps, and has a capacity many times larger through the use of wave length division multiplexing. The gigabit t est bed SOd will be extended to Berlin and Stuttgart, in order to connect the supercomputers at the Konrad Zuse Institute in Berlin, Leibniz Computing Center in Munich and the Computing Center of Stuttgart University. This wide area network of supercomputers will be used for demanding distributed high performance computations in applied mathematics, physics, chemistry, engineering, economics and medicine. In 1997 the United States has established the National Computational Science Alliance. It is led by the Supercomputer Centers at the University of Illinois at Urbana and the University of California at San Diego. Each alliance consists of more than 60 partner institutions, including academic and government research labs and industrial organizations. These cooperating institutions all benefit from a metacomputing environment based on high speed networks. K. Abdali of the NSF provides further details in the article, "Advanced computing and communications research under NSF support". The strictest requirements for high bandwidth applications can currently be found in the areas of metacomputing and distributed high performance computation. These applications serve a secondary purpose in stress testing the network, to help the engineers and computer scientists design better ones. However, the number of such large scale experiments is currently rather small. As long as the cooperating institutions interconnected by a wide area high speed network are not given extra resources for distributed computer applications this situation is likely to continue. Another set of lectures at the meeting was devoted to the interplay between communication hardware and software for high speed computer networks and mathematical algorithm development for distributed high performance computations. In particular, the implementations of the parallel linear algebra algorithms help to create experiments checking the technical communication properties of a broadband computer network with 155 Mbit/s bandwidth and higher. On the other side such benchmarks also help to analyze the effficiency of a mathematical algorithm. This volume also contains several contributions concerning the very organization of scientific knowledge, itself. Many scientific publications quoting high performance computer applications lack proper documentation of the original computer programs and of the memory intensive output data. Recently many mathematical and other scientific journals have begun offering both paper and digital formats. The digital versions offer many advantages for the future. They have the potential for being searched, and they can be incorporated into a distributed library system. Over scientific wide area networks such as the planned German Gigabitwissenschaftsnetz, libraries of universities and research institutes in digital form can be combined into a national distributed research library. The members of these institutions can be allowed to search, read, print and even annotate the digital texts (and their computational appendices containing the original programs and the output data) at their personal computers. The wide area networks offer not only distributed library applications but they also offer completely new applications making use of multimedia, distance learning, and computer- aided teaching. Therefore this volume also contains several contributions describing such applications. Adv a nc e d Co mp u t i n g and Co mmu n i c a t i o n s Res earch unde r NS F Suppo r t S. Kamal Abdal i National Science Foundation, Arlington, VA 22230, USA Ab s t r a c t . This paper discusses the research initiatives and programs support ed by the National Science Foundation to promote high-end computing and large- scale networking. This work mainly falls under the US interagency act vi t y called High Performance Computing and Communications (HPCC). The paper describes t he Federal geovernment context of HPCC, and t he HPCC programs and t hei r main accomplishemnts. Finally, it decribes t he recommendations of a recent hi gh- level advisory committee on information technology, as these are likely t o have a major impact on the future of government initiatives in high-end comput i ng and networking. 1 I n t r o d u c t i o n A pr evi ous paper [1] descr i bed t he act i vi t i es of t he Nat i onal Sci ence Foun- dat i on (NSF) in t he U.S. High Per f or mance Comput i ng and Communi ca- t i ons ( HPCC) pr ogr am unt i l 1996. The pur pose of t he pr esent paper is t o upda t e t ha t descr i pt i on t o cover t he devel opment s since t hen. Whi l e some management changes have t aken pl ace dur i ng t hi s per i od, and t her e is some r edi r ect i on of its t hr ust s, t he HPCC pr ogr am cont i nues t o fl ouri sh, t o say t he least. The mai n new act i vi t i es at t he NSF ar e Par t ner s hi ps for Advanced Comput i ng I nf r ast r uct ur es (PACIs), t he Next Gener at i on I nt er net ( NGI ) , and t he Knowl edge and Di st r i but ed Int el l i gence (KDI) i ni t i at i ve, and t her e are renewed pr ogr ams for Science and Technol ogy Cent er s and Di gi t al Li- brari es. New i ni t i at i ves t hat may repl ace t he pr ogr am or change its di r ect i on subst ant i al l y ar e also expect ed t o resul t f r om t he r ecommendat i ons of t he Pr esi dent i al I nf or mat i on Technol ogy Advi sor y Commi t t ee ( PI TAC) . The pa- per is mai nl y concer ned wi t h t hese new issues. But t o make it sel f-cont ai ned, t he ent i r e HPCC cont ext is bri efl y descr i bed also. 2 Th e HPCC pr o g r a m The US Hi gh Per f or mance Comput i ng and Communi cat i on ( HPCC) pr ogr a m was l aunched in 1991. It oper at ed as a congressi onal l y ma nda t e d i ni t i at i ve f r om Oct ober 1991 t hr ough Sept ember 1996, following t he e na c t me nt of t he 6 High Performance Comput i ng Act of 1991. Since Oct ober 1996, it has con- tinued as a program under the leadership of t he Comput i ng, Information, and Communications (CIC) Subcommi t t ee of t he Commi t t ee on Technology (CT) which is itself overseen by t he National Science and Technology Coun- cil, a US Cabinet-level organization. Inst rument al in t he est abl i shment of t he program was a series of national-level studies of scientific and technologi- cal trends in computing and networking [2-5]. These studies concl uded and persuasively argued t hat a federal-level initiative in high-performance com- puting was needed t o ensure the preeminence of American science and tech- nology. Solving t he challenging scientific and engineering probl ems t hat were already on t he horizon required significantly more comput at i onal power t han was available. Anot her factor was t he progress made abroad, especially t he Japanese advances in semiconductor chip manufact ure and supercomput er design, and the Western European advances in supercomput i ng applications in science and engineering. It was also clear t hat t he advances in information technology would have a far reaching impact beyond science and technology, and would affect society in general in profound, unprecedent ed ways. The HPCC program was thus established t o stimulate, accelerate, and harness these advances for coping with scientific and engineering challenges, solving societal and environmental problems, meeting national security needs, and in improving the nations economic product i vi t y and competitiveness. As late as 1996, t he goals of t he HPCC initiative were st at ed separat el y (e.g., in [10]) from the CIC mission descriptions. Now t hat HPCC has become a CIC research and development (R&:D) program, its goals are subsumed in the CIC goals, which are formally st at ed as follows ([13]): * Assure continued US leadership in computing, information, and commu- nications technologies to meet Federal goals and t o suppor t U.S. 21st century academic, defense, and industrial interests * Accelerate deployment of advanced and experimental information tech- nologies to maintain world leadership in science, engineering, and mat h- ematics; improve the quality of life; promot e long t erm economic growth; increase lifelong learning; prot ect the environment; harness information technology; and enhance national security * Advance U.S. product i vi t y and industrial competitiveness t hrough long- t erm scientific and engineering research in computing, information, and communications technologies 3 HP CC P a r t i c i p a n t s a n d Co mp o n e n t s The HPCC program at present involves 12 Federal agencies, each wi t h its specific responsibilities. In alphabetical order, t he part i ci pat i ng agencies are: Agency for Health Care Policy and Research ( AHCPR) , Defense Advanced Research Proj ect s Agency (DARPA), Depart ment of Energy (DOE), De- part ment of education (ED), Environmental Prot ect i on Agency (EPA), Na- tional Aeronautics and Space Admi ni st rat i on (NASA), National I nst i t ut e of Health (NIH), National Inst i t ut e of St andards and Technology (NIST), Na- tional Oceanic and Atmospheric Administration (NOAA), Nat i onal Securi t y Agency (NSA), National Science Foundat i on (NSF), and Depar t ment of Vet- eran Affairs (VA). The activities sponsored by these agencies have br oad participation by universities as well as t he industry. The program activities of the participating organizations are coordi nat ed by t he Nat i onal Coordi na- tion Office for Computing, Information, and Communications (NCO), which also serves as t he liaison to t he US Congress, st at e and local governments, foreign governments, universities, industry, and the public. The NCO dissem- inates information about HPCC program activities and accompl i shment s in the form of announcements, technical reports, and the annual report s t hat are popularly known as "blue books" [6-13]. The NCO also mai nt ai ns t he web site http://www.ccic.gov to provide up-t o-dat e, online document at i on about the HPCC program, as well as links to t he HPCC-rel at ed web pages of all participating organizations. The program currently has five components: 1) High End Comput i ng and Comput at i on, 2) Large Scale Networking, 3) High Confidence Systems, 4) Human Centered Systems, and 5) Education, Training, and Human Re- sources. Together, these components are meant t o foster, among ot her things, scientific research, technological development, industrial and commercial ap- plications, growt h in education and human resources, and enhanced public access t o information. In addition to these components, t here is a Federal Information Services and Applications Council to oversee t he appl i cat i on of CIC-developed technologies for federal information systems, and t o dissemi- nat e information about HPCC research t o ot her Federal agencies not formally participating in the program. The goals of t he HPCC components are as follows (see t he "Blue Book" 99 [13] for an official description): 1. Hi g h En d Co mp u t i n g a n d Co mp u t a t i o n : To assure US leadership in computing through investment in leading-edge hardware, software, and algorithmic innovations. Some representative research directions are: comput i ng devices and storage technologies for high-end comput i ng sys- tems; advanced computing architectures; advanced software systems, al- gorithms, and software for modeling and simulation. This component also support s investigation of ideas such as optical, quant um, and bio- molecular computing t hat are quite speculative at present, but may lead t o feasible computing technologies in t he future, and may radically change t he nat ure of computing. 2. La r ge Scal e Ne t wo r k i n g : To assure US leadership in high-performance communications. This component seeks t o improve t he st at e-of-t he-art in communications by investing in research on networking component s, systems, services, and management. The support ed research directions include: advanced technologies t hat enable wireless, optical, mobile, and wireline communications; large-scale network engineering; system soft- ware and program development environments for network-centric com- puting; and software technology for distributed applications, such as elec- tronic commerce, digital libraries, and health care delivery. 3. Hi gh Conf i dence Syst ems: To develop technologies t hat provide users with high levels of security, protection of privacy and data, reliability, and restorability of information services. The supported research directions include: system reliability issues, such as network management under overload, component failure, and intrusion; survival of threatened systems by adaptation and reconfiguration; technologies for security and privacy assurance, such as access control, authentication, and encryption. 4. Huma n Ce n t e r e d S y s t e ms : To make computing and networking more accessible and useful in the workplace, school, and home. The technolo- gies enabling this include: knowledge repositories and servers; collabora- tories t hat provide access to information repositories and t hat facilitate sharing knowledge and control of instruments at remote labs; systems that allow multi-modal human- system interactions; and virtual reality environments and their applications in science, industry, health care, and education. 5. Ed u c a t i o n , Tr a i ni ng , a nd H u ma n Re s o ur c e s : To support HPCC re- search that enables modern education and training technologies. All lev- els and modes of education are targeted, including elementary, secondary, vocational, technical, undergraduate, graduate, and career-enhancing ed- ucation. The education and training also includes the production of re- searchers in HPCC technologies and applications, and a skilled workforce able to cope with the demands of the information age. The supported research directions include information-based learning tools, technologies that support lifelong and distance learning for people in remote locations, and curriculum development. 4 HPCC at NS F As mentioned above, NSF is one of the 12 Federal agencies participating in the HPCC program. The total HPCC budget and the NSF share in it since the inception of the program are shown in Table 1. Thus, during this period, NSF's share has ranged approximately between one-fourth and one- third of the total Federal HPCC spending. The HPCC amount has remained approximately 10% of the NSF's own total budget during the same period. Table 1. HPCC Investment: Total budget and NSF's share (in $M) Fiscal Year 1992 1993 1994 1995 1996 1997 1998 1999 Total HPCC budget 655 803 938 1039 1043 1009 1070 830 NSF's HPCC share 201 262 267 297 291 280 284 297 The NSF objectives for its HPCC effort are: 9 Enabl e U.S. to uphold a position of world leadership in t he science and engineering of computing, information and communications. 9 Pr omot e understanding of t he principles and uses of advanced comput - ing, communications, and information syst ems in service t o science and engineering, to education, and t o society. 9 Cont ri but e t o universal, t ransparent , and affordable part i ci pat i on in an information-based society. Thus NSF' s HPCC-rel at ed work spans across all of t he five HPCC program components. HPCC research penet rat es t o varying dept h nearly all t he scientific and engineering disciplines at NSF. But most of this research is concent rat ed in t he NSF' s Directorate of Comput er and Informat i on Science and Engi- neering (CISE). This directorate is organized into 5 divisions each of which is, in turn, divided into 2-8 programs. The work of t he CISE divisions can be, respectively, characterized as: fundament al comput at i on and communi- cations research; information, knowledge, intelligent systems, and robotics research; experimental systems research and integrative activities; advanced comput at i onal infrastructure research; and advanced networking infrastruc- ture research. While the phrase "high performance" may not be explicitly present in the description of many programs, t he act ual research t hey un- dertake is very much focused on HPCC. Indeed, t he CISE budget is almost entirely at t ri but ed to HPCC. Representative ongoing research topics include: scalable parallel architectures; component technologies for HPCC; simula- tion, analysis, design and t est tools needed for HPCC circuit and syst em design; parallel software systems and tools, such as compilers, debuggers, performance monitors, program development environments; heterogeneous computing environments; di st ri but ed operating systems, tools for building di st ri but ed applications; network management, aut hent i cat i on, security, and reliability; intelligent manufacturing; intelligent learning systems; probl em solving environments; algorithms and software for comput at i onal science and engineering; integration of research and learning technologies; very large dat a and knowledge bases; visualization of very large dat a sets. 5 L a r g e HP CC P r o j e c t s The HPCC program has led to several innovations in NSF' s mechanisms for support i ng research and human resources development. The t radi t i onal manner of funding individual researchers or small research t eams continues t o be applied for HPCC work too. But t o meet special HPCC needs, NSF has initiated a number of t ot al l y new programs, such as supercomput i ng cen- ters, partnerships for advanced comput at i onal infrastructures, science and 10 technology centers, and various "challenges". Also launched were special ini- tiatives such as digital libraries, knowledge and di st ri but ed intelligence, and the next generation internet. These proj ect s are much larger t han t he t ra- ditional ones in t he scope of research, number of part i ci pat i ng investigators, research duration, and award size. 5.1 Science and Technol ogy Centers (STCs) The purpose, structure, and HPCC cont ri but i ons of STCs were descri bed in [1]. So here we mainly st at e the developments t hat have t aken place since. STCs are intended t o st i mul at e "integrative conduct of research, educa- tion, and knowledge transfer." They provide an environment for i nt eract i on among researchers in various disciplines and across i nst i t ut i onal boundari es. They also provide the st ruct ure t o identify i mport ant complex scientific prob- lems beyond disciplinary and institutional limits and scales, and t he critical mass and funding stability and durat i on needed for their successful solution. They carry out fundamental research, facilitate research applications, pro- mote technology transfer through industrial affiliations, disseminate knowl- edge via visitorships, conferences and workshops, educat e and t rai n people for scientific professions, and introduce minorities and underrepresent ed groups to science and technology through out reach activities. STCs are large research proj ect s each of which involves typically 50+ principal investigators from 10+ academic institutions, and also has links to the industry. The participants work t oget her on interdisciplinary research unified by a single theme, such as parallel comput i ng or comput er graphics. The projects are awarded initially for 5 years, are renewable for anot her 5 years, and are finally given an ext ra year for orderly phaseout . There is no further renewal, so a center has t o shut down definitely in at most 11 years. Of course, the investigators are free t o regroup and compet e again in t he program in the future if it continues. As a result of t he competitions t hat t ook place in 1989 and 1991, 25 STCs were established by NSF. All of t hem have entered their final year now. The following four of those STCs were support ed by t he HPCC program: The Cen- ter for Research in Parallel Comput at i on ( CRPC) at Rice University; The Center for Comput er Graphics and Scientific Visualization at t he University of Utah; The Center for Discrete Mat hemat i cs and Theoret i cal Comput er Science (DIMACS) at Rutgers University; and The Center for Cognitive Sci- ence at the University of Pennsylvania. These STCs have cont ri but ed nu- merous theoretical results, algorithms, mat hemat i cal and comput er science techniques, libraries, software tools, languages, and environments. They have also made significant advances in various scientific and engineering applica- tion areas. Their out put has been impressive in quality, quantity, and i mpact . In 1995, NSF undert ook a t horough evaluation of t he STC program. For one st udy [14], Abt Associates, a private business and policy consulting firm was commissioned to collect various kind of information about the STCs, and 11 the National Academy of Science was asked to examine t hat dat a and evaluate the program. Another study [15] was conducted by the National Academy of Public Administration. Both studies concluded t hat the STC program represented excellent return on federal research dollar investment, and rec- ommended that the program be continued further. The studies also endorsed most of the past guidelines regarding the funding level, award duration, em- phasis on education and knowledge transfer (additionally to research), review and evaluation criteria, and management structure. Based on these findings, NSF has decided to continue the STC program. A new round of proposal solicitations took place in 1998. The submitted proposal have been evaluated, and the awards are expected to be announced soon (as of March 1999). 5. 2 P a r t n e r s h i p s for Ad v a n c e d C o mp u t a t i o n a l I n f r a s t r u c t u r e s ( PACI s ) The precursor to PACIs was a program called Supercomputing Centers (SCs) that was established by NSF in 1985 even before the start of the HPCC ini- tiative. But the SC program greatly contributed to the momentum behind HPCC, and, since its launch, became a significant part of the initiative. For a 10-year duration, the program funded four SCs: Cornell Theory Center, Cornell University; National Center for Supercomputing Applications, Uni- versity of Illinois at Urbana-Champaign; Pittsburgh Supercomputer Center, University of Pittsburgh; and San Diego Supercomputer Center, University of California-San Diego. Several of their accomplishments and HPCC contri- butions have been reported in [1]. A Task Force to evaluate the effectiveness of the SC program was com- missioned by NSF in 1995. This resulted in a document which is popularly known as the "Hayes Report [16]. The study considered the alternatives of renewing the SCs or having a new competition, and recommended the latter. For a more effective national computing infrastructure development, it also recommended funding fewer but larger alliances of research and experimen- tal facilities and national and regional high- performance computing centers. Based on these findings, NSF instituted the PACI program in 1996, as the successor to the SC program. The aim of the PACIs is to help maintain US world leadership in computational science and engineering by providing ac- cess nationwide to advanced computational resources, promoting early use of experimental and emerging HPCC technologies, creating HPCC software systems and tools, and training a high quality, HPCC-capable workforce. After holding a competition, NSF made two PACI awards in 1997. These are the National Computational Science Alliance (Alliance) led by the Na- tional Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, and the National Partnership for Advanced Computational Infrastructure (NPACI) led by the San Diego Supercomputer Center at the University of California at San Diego. Each consists of more 12 than 60 partner institutions, including academic and government research labs, national, state-level and local computing centers, and business and in- dustrial organizations. The leading sites, which maintain a variety of high- performance computer systems, and the partners which maintain smaller configurations of similar systems, jointly constitute a metacomputing envi- ronment connected via high-speed networks. The partners contribute to the infrastructure by developing in-house, using, and testing the necessary soft- ware, tools, environments, applications, algorithms, and libraries, thereby contributing to the further growth of a "national grid" of networked high- performance computers. The initial mission of the SCs was to satisfy the supercomputing needs of US computational scientists and engineers. The major role of the PACIs continues to be to provide supercomputing access to the research community in all branches of science and engineering. But their expanded mission puts a heavy emphasis on education and training at all levels. 5.3 Ne x t Generat i on Internet ( NGI ) The NGI initiative, a multi-agency Federal R&D program t hat began in Oc- tober 1997, is the main focus of LSN. It represents consolidation and refine- ment of ideas behind the vision of a National Information Infrastructure. This infrastructure is a subject of various studies, most importantly [17,18]. The NGI initiative supports foundational work to lead to much more pow- erful and versatile networks than the present-day Internet. To advance this work, the initiative fosters partnerships among universities, industry and the government. The participating federal government agencies include: DARPA, DOE, NASA, NIH, NIST and NSF. The NGI goals are: 1. Promote research, development, and experimentation in networking tech- nologies. 2. Deploy testbeds for systems scale testing of technologies and services. 3. Develop "revolutionary" applications that utilize the advancements in network technologies and exercise the testbeds. The aim of the advancement stipulated in Goal 1 is to dramatically im- prove the performance of networks in reliability, security, quality of ser- vice/differentiation of service, and network management. Two testbeds are planned for Goal 2. The first testbed is required to connect at least 100 sites and deliver speeds that are at least 100 times faster end-to-end than the present-day Internet. The second testbed is required to connect about 10 sites with end-to-end performance speed faster than the present Internet by at least a factor of 1000. The "revolutionary" applications called for in Goal 3 are to range over enabling applications technologies as well as disciplinary applications. Suggested examples of the former include collaboration tech- nologies, digital libraries, distributed computing, virtual reality, and remote 13 operat i on and simulation. Suggested application areas for the latter include basic science, education, heal t h care, manufacturing, electronic commerce, and government information services. The NGI work in progress was showcased in t he Supercomputing 98 confer- ence in a special session called Netamorphosis. The "Netamorphosis" demon- strations consisted of 17 significant NGI applications, ranging over visualiza- tion, scene analysis, simulation, manufacturing, remot e operation, etc. For example, a demonst rat i on entitled "Real-Time Functional MRI: Watching the Brain in Action" showed how one could remot el y view brain activity while a patient was performing cognitive or sensory- mot or tasks. The syst em could process functional MRI dat a in real-time, t hough t he dat a acquisition, main comput at i ons, and visualization all t ook place at different sites con- nected by advanced networks. Anot her demonst rat i on entitled "Di st ri but ed Image Spreadsheet: Eart h Dat a from Satellite t o Deskt op" showed how sci- entists could analyze, process, and visualize massive amount s of geologic, atmospheric, or oceanographic dat a t ransmi t t ed t o their workst at i ons from Eart h Observing Syst em satellites. 5. 4 Di g i t a l Li brari es I ni t i a t i v e ( DLI ) The original DLI, now referred to as DLI Phase 1, st ar t ed as a joint venture of NSF, DARPA, and NASA. Now t he initiative is in Phase 2, and includes as sponsors those agencies as well as t he National Li brary of Medicine, t he Li brary of Congress, and t he National Endowment for t he Humanities. The initiative seeks to advance t he technologies needed t o offer infor- mat i on essentially about anything, to anyone, l ocat ed anywhere around t he nation and the world. A digital library is intended t o be a very large-scale storehouse of knowledge in multimedia form t hat is accessible over t he net. The construction and operat i on of digital libraries requires developing tech- nologies for acquiring information, organizing this information in di st ri but ed multimedia knowledge bases, ext ract i ng information based on requested cri- teria, and delivering it in t he form appropri at e for t he user. Thus, t he DLI promot es research on information collection, analysis, archiving, search, fil- tering, retrieval, semantic conversion, and communication. The Phase 1 is supporting 6 large consortia consisting of academic and in- dustrial partners. Their main proj ect t hemes and their lead institutions are: geographic information systems, maps and pictures, content-base retrieval (University of California-Santa Barbara); intelligent internet search, seman- tic retrieval, scientific j ournal publishing alternatives (University of Illinois); media integration and access, new models of "documents," nat ural language processing (University of California-Berkeley); digital video libraries, speech, image and nat ural language technology integration (Carnegie Mellon Univer- sity); intelligent agent architecture, resource federation, AI service market economies, educational impact (University of Michigan); uniform access, dis- 14 t ri but ed obj ect architectures, interface for di st ri but ed information retrieval (Stanford University). The Phase 1 of t he initiative was mainly concerned wi t h learning, pro- totyping, and experimenting in t he small. The Phase 2 expect s t o put this experience into actually building larger, operational, and usable syst ems and testbeds. There is emphasis on larger contents and collections, interoperabil- ity and technology integration, and expansion of domains and user commu- nities for digital libraries. The support ed activities are expect ed t o range t hrough t he full spect rum of fundamental research, cont ent and collections development, domain applications, t est beds, operat i onal environments, and applications for developing educational resources and preserving t he national cultural heritage. 5.5 Knowledge and Distributed Intelligence (KDI) KDI is a new initiative t hat NSF established in 1998. The HPCC research has traditionally been concentrated in t he NSF' s Comput er and Informat i on Science and Engineering directorate. The KDI initiative stems from t he re- alization t hat the advances in computing, communications, and information technologies provide unprecedented possibilities for accelerating progress in all spheres of human t hought and action. KDI stresses knowledge as opposed t o information, but realizes, of course, t hat intelligent gathering of informa- tion is a prerequisite to creating knowledge. Thus, a goal of KDI is t o improve the human ability t o discover, collect, represent, store, apply, and t ransmi t information. This is t o lead t o improvements in t he ways t o creat e knowledge and in t he actual acquisition of new knowledge. The KDI research is classified into three components: 1. Knowledge Networking (KN) 2. Learning and Intelligent Systems (LIS) 3. New Comput at i onal Challenges (NCC) The KN component aims at building an open and context-rich envi ronment for online interactions among individuals as well as groups. For such an environment t o arise, advances have t o be made in the techniques for col- lecting and organizing information and discovering knowledge from it. The KN-enabled vast scale of information acquisition and t he power t o uncover knowledge buried in collected dat a has grave implications for privacy and other human interest matters. Hence, KN is also concerned with research on social, societal, ethical, and other aspects of networked information. The focus of the LIS component of KDI is t o bet t er underst and t he process of learning itself, as it occurs in humans, animals, and artificial systems. This understanding is to be used for improving our own learning skills, developing bet t er teaching methods, and creating intelligent artifacts. The NCC component is in t he spirit of NSFs "Challenges programs, such as Grand Challenges, National Challenges, and Multidisciplinary Challenges. 15 In [1], these programs were described, and their impact and some of their accomplishments were stated. The NCC component continues to seek solu- tions of very complex scientific and engineering problems, ones t hat are com- putationally expensive, data intensive, and require multidisciplinary team approaches. The Challenges research and the advance in high-performance computing and communications system have a mutually benefiting push-pull relationship; the former stress tests the latter, and the latter helps the former grow in scale and scope. NCC research aims to improve our ability to model and simulate complex systems such as the oceans or the brain. In adopt- ing the Challenges research, the KDI initiative sees it as another knowledge creation activity. In 1998, NSF made 40 awards for KDI research for a total funding of $51.5M. The awards span a broad range of topics, vast scopes of research, and investigators representing diverse disciplines and institutions. The 1999 KDI competition is in process. 6 HPCC Eva l ua t i on General directions as well as clear objectives were defined for the HPCC pro- gram from the very beginning. Thus, some evaluation is built into the pro- gram. Some objectives naturally lead to quantifiable measures of progress, such as computation speeds in teraflops, communication bandwidth in giga- bits, network extent in number of connected nodes, etc. On the other hand, there are qualitative aspects of progress, such as scientific breakthroughs, innovative industrial practices, societal penetration of knowledge and tech- nology, quality of work force trained, etc. The evaluation of the STC and SC programs has already been mentioned. Other parts of the NSF HPCC program have also produced impressive results. For the effectiveness of the HPCC program as whole, a number of evaluation studies have been done. The "Branscomb Report [19], is devoted to studying the means for making the program more productive. A thorough assessment of the effectiveness of the program is undertaken in the "Brooks-Sutherland Report" [20]. The purpose of a more recent recent study [21] is to suggest the most important future HPCC applications, specially the ones with highest national, societal, and economic impact. There is consensus that the HPCC program has been successful on most fronts. Not only the year by year milestones for quantifiable progress have been met, but the activities undertaken by the program have led to several significant, unanticipated beneficial developments. The launch of new impor- tant HPCC-inspired initiatives witnesses the programs strong momentum. But as the next section shows, there is a perception t hat the HPCC program is underfunded and the progress resulting from it is going to decelerate unless newer and larger investments are added to it. 16 7 Presi dent s I nf ormat i on Technol ogy Advi s or y Co mmi t t e e ( PI TAC) PITAC was established in February 1997 to provide advice to the Adminis- tration on all areas of computing, communications, and information technol- ogy. This committee at present consists of 26 research leaders representing academia and the industry. It issued an interim report in August 1998 and a final one in February 1999 [22], after a series of meetings and broad con- sultations with the research community. This report examines the impact of R&D in Information Technology (IT) on US business and science, and makes a number of recommendations for further work. The PITAC report observes t hat the past IT R&D through HPCC and other programs is a significant factor in the nations world leadership position in science, industry, business, and the general well-being of the citizenry. IT advances are responsible for a third of the US economic growth since 1992, and have created millions of high-paying new jobs. The computational ap- proach to science in conjunction with the HPCC algorithms, software, and infrastructure have helped the US scientists make new discoveries. The com- petitiveness of US economy is owed much to the efficiencies resulting from IT in engineering design, manufacturing, business, and commerce. If IT is the engine that is driving the economy, then obviously it needs to be kept running by further investment. The PITAC report argues t hat the IT industry is spending the bulk of its own resources, financial and human, on near-term development of new products for an exploding market. The IT industry can contribute only a small fraction of the long-term R&D invest- ment needed. Moreover, the industry does not see any immediate benefits of the scientific and social components of IT, and therefore has no interest in pursuing them. After estimating the total US R&D expenditure on IT, and the Federal and industrial shares of it, the PITAC conclusion is t hat the Fed- eral support of the Information Technology (IT) R&D is grossly inadequate. Moreover, it is focused too much on near-term and applied research. PITAC has recommended increments of about $1.3 billion per year for the next 5 years. PITAC has also identified the following four high priority areas as main targets of increased investment. S o f t w a r e : Software production methodologies have to be dramatically improved, by fundamental research, to deliver robust, usable, manageable, cost-effective software. Scalable I n f o r m a t i o n I n f r a s t r u c t u r e : With the ever increasing size, complexity, and sheer use of networks, research is needed on how to build networks that can be easily extended yet remain reliable, secure, and easy to us e . Hi gh- End Comput i ng: Scientific research and engineering design are becoming more and more computational. The increasing complexity of prob- lems demand ever faster computing and communications. Thus, sustained 17 research is needed on high performance architectures, networks, devices, and systems. S o c i o e c o n o mi c I mp a c t : Research is needed t o exploit t he IT advances t o serve t he society and t o spread its benefits t o all citizens. The accompa- nying social, societal, ethical, and legal issues have t o be studied, and ways have t o be sought for mitigating any pot ent i al negative i mpact . Based on t he PITAC recommendations, a new Federal interagency ini- tiative called Information Technology for t he Twenty-first Cent ur y (IT 2) is being developed, as a possible successor t o t he HPCC program. 8 C o n c l u s i o n Scientific and engineering work is becoming more comput at i onal , because, increasingly, comput at i on is replacing physical experi ment at i on and t he con- struction and testing of prot ot ypes. (Indeed, t he US Accelerated Strategic Comput i ng Initiative plans t o depend t ot al l y on comput at i onal simulation in its weapons research program for those weapons whose physical testing is banned by international treaties.) Several recent scientific discoveries have been possible because of comput at i on. The HPCC program has played a key role in the rise of comput at i onal science and engineering. In [1], it was observed t hat collaboration and t eam work emerged as an im- port ant modality of HPCC research. In particular, t he HPCC programs have emphasized 1) multi-disciplinary, multi-investigator, multi-institution teams, 2) partnerships among academia, business, and industry, and 3) cooperative, interagency sponsorship of research. In recent years, t he collaboration has increased in intensity and scale. The transition from SCs t o PACIs is a good example. The previous Challenge proj ect s t ended t o be computation-intensive. In a number of NCC projects, the data-intensive aspect domi nat es t he comput at i on- intensive one. Because of this situation, dat a mining has emerged as a key solution st rat egy for many Challenge-scale problems. In practice, t he HPCC program has so far been focused on applications and infrastructure development. Par t l y this is because most of t he partici- pat i ng agencies in t he HPCC program have special missions, and have rightly emphasized t he fulfillment of their missions rat her t han basic research. The development of high performance comput i ng i nfrast ruct ure has also served some critical research needs. But t here is need now t o bol st er fundamental research in order t o stimulate further progress t owards t he original HPCC goals. The PITAC report urges this. R e f e r e n c e s 1. Abdali S.K.: High Performance Computing Research at NSF, In G. Cooper- man, G. Michler and H. Vinck (Eds.), Proc. Workshop on High Performance 18 Computation and Gigabit Local Area Networks, Lect. Notes in Control and Information Sci. # 226, Springer-Verlag Berlin, 1997. 2. A National Computing Initiative: The Agenda for Leadership, Society for In- dustrial and Applied Mathematics, Philadelphia, PA, 1987. 3. Toward a National Research Network, National Academy Press, Washington, D.C., 1988. 4. Supercomputers: Directions in Technology and Applications, National Academy Press, Washington, D.C., 1989. 5. Keeping the U.S. Computer Industry Competitive: Defining the Agenda, Na- tional Academy Press, Washington, D.C., 1990. 6. Grand Challenges: High Performance Computing and Communications ("FY 1992 Blue Book"), Federal Coordinating Council for Science, Engineering, and Technology, c/o National Science Foundation, Washington, D.C., 1991. 7. Grand Challenges 1993: High Performance Computing and Communications ("FY 1993 Blue Book"), Federal Coordinating Council for Science, Engineering, and Technology, c/o National Science Foundation, Washington, D.C., 1992. 8. High Performance Computing and Communications: Toward a National Infor- mation Infrastructure ("FY 1994 Blue Book" ), Office of Science and Technology Policy, Washington, D.C., 1993. 9. High Performance Computing and Communications: Technology for a National Information Infrastructure ("FY 1995 Blue Book"), National Science and Tech- nology Council, Washington, D.C., 1994. 10. High Performance Computing and Communications: Foundation for America's Information Future ("FY 1996 Blue Book"), National Science and Technology Council, Washington, D.C., 1995. 11. High Performance Computing and Communications: Advancing the Frontiers of Information Technology ("FY 1997 Blue Book"), National Science and Tech- nology Council, Washington, D.C., 1996. 12. Technologies for the 21st Century ("FY 1998 Blue Book"), National Science and Technology Council, Washington, D.C., 1997. 13. Networked Computing for the 21st Century ("FY 1999 Blue Book"), National Science and Technology Council, Arlington, VA, 1998. 14. National Science Foundation's Science and Technology Centers: Building an In- terdisciplinary Research Program, National Academy of Public Administration, Washington, D.C., 1995. 15. An Assessment of the National Science Foundation's Science and Technology Centers Program, National Research Council, National Academy Press, Wash- ington, D.C., 1996. 16. Report of the Task Force on the Future of the NSF Supercomputing Centers ("Hayes report"), Pub. NSF 96-46, National Science Foundation, Arlington, VA. 17. The Unpredictable Certainty: Information Infrastructure through 2000, Na- tional Research Council, National Academy Press, Washington, D.C., 1996. 18. More Than Screen Deep: Toward Every-Citizen Interfaces to the Nation's In- formation Infrastructure ("Biermann Report" ), National Research Council, Na- tional Academy Press, Washington, D.C., 1997. 19. From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance Com- puting ("Branscomb Report"), Pub. NSB 93-205, National Science Foundation, Washington, D.C., August 1993. ]9 20. Evolving the High Performance Computing and Communications Initiative to Support the Nation's Information Infrastructure ("Brooks-Sutherland Report "), National Research Council, National Academy Press, Washington, D.C., 1995. 21. Computing and Communications in the Extreme: Research for Crisis Manage- ment and Other Applications, National Research Council, National Academy Press, Washington, D.C., 1996. 22. Information Technology Research: Investing in Our Future, President' s Infor- mation Technology Advisory Committee Report to t he President, National Coordination Office, Arlington, VA, 1999. S R P : a S c a l a b l e R e s o u r c e R e s e r v a t i o n P r o t o c o l f o r t h e I n t e r n e t Wer ner Al mesber ger 1, Ti zi ana Fer r ar i 2, and Jean- Yves Le Boudec I 1 EPFL ICA, INN (Ecublens), CH-1015 Lausanne, Switzerland 2 DEIS, University of Bologna, viale Risorgimento, 2, 1-40136 Bologna, Italy; and Italian National Inst. for Nuclear Physics/CNAF, viale Berti Pichat, 6/2, 1-40127 Bologna, Italy Abs t r a c t . The Scalable Reservation Protocol (SRP) provides a light-weight reser- vation mechanism for adaptive multimedia applications. Our main focus is on good scalability to very large numbers of individual flows. End systems (i.e. senders and destinations) actively participate in maintaining reservations, but routers can still control their conformance. Routers aggregate flows and monitor t he aggregate to estimate the local resources needed to support present and new reservations. There is neither explicit signaling of flow parameters, nor do routers maintain per-flow state. 1 I n t r o d u c t i o n Many adapt i ve mul t i medi a appl i cat i ons [1] requi re a well-defined f r act i on of t hei r traffic t o r each t he dest i nat i on and t o do so in a t i mel y way. We call t hi s f r act i on t he mi n i mu m rate t hese appl i cat i ons need in or der t o ope r a t e pr op- erly. SRP aims t o allow such appl i cat i ons t o make a dependabl e r eser vat i on of t hei r mi ni mum rat e. The sender can expect t hat , as long as it adher es t o t he agr eed- upon profile, no reserved packet s will be lost due t o congest i on, b-hrt hermore, for- wardi ng of reserved packet s will have pr i or i t y over best - ef f or t traffic. Tr adi t i onal resource r eser vat i on ar chi t ect ur es t ha t have been pr opos ed for i nt egr at ed servi ce net works ( RSVP [2], ST- 2 [3], Tenet [4], ATM [5,6], et c. ) all have in common t hat i nt er medi at e syst ems ( r out er s or swi t ches) need t o st ore per-flow s t at e i nf or mat i on. The mor e r ecent l y desi gned Di f f er ent i at ed Services ar chi t ect ur e [7] offers i mpr oved scal abi l i t y by aggr egat i ng flows and by mai nt ai ni ng s t at e i nf or mat i on onl y for such aggregat es. SRP ext ends upon simple aggr egat i on by provi di ng a means for reservi ng net wor k r esour ces in r out er s al ong t he pat hs flows t ake. Recent l y, hybr i d appr oaches combi ni ng RSVP and Di f f er ent i at ed Servi ces have been pr oposed (e.g. [8]) t o over come t he scal abi l i t y pr obl ems of RSVP. Unlike SRP, whi ch r uns end- t o- end, t hey r equi r e a mappi ng of t he I NTSERV services ont o t he under l yi ng Di f f er ent i at ed Services net wor k, and a means t o t unnel RSVP signaling i nf or mat i on t hr ough net wor k regi ons wher e QoS is pr ovi ded usi ng Di fferent i at ed Services. 22 Reservation mechanism In shor t , our r eser vat i on model works as follows. A source t ha t wishes t o make a r eser vat i on s t ar t s by sendi ng da t a packet s mar ked as request packet s t o t he dest i nat i on. Packet s ma r ke d as request ar e subj ect t o packet admi ssi on cont r ol by r out er s, based on t he fol l owi ng pri n- ciple. Rout er s moni t or t he aggr egat e flows of reserved packet s and mai nt ai n a r unni ng est i mat e of what level of r esour ces is r equi r ed t o serve t he m wi t h a good qual i t y of service. The resources are ba ndwi dt h and buffer on out go- ing links, plus any i nt er nal resources as r equi r ed by t he r out e r ar chi t ect ur e. Qual i t y of servi ce is loss r at i o and delay, and is defi ned st at i cal l y. Whe n re- ceiving a request packet , a r out er det er mi nes whet her hypot het i cal l y addi ng this packet t o t he flow of reserved packet s woul d yi el d an accept abl e val ue of t he est i mat or . I f so, t he request packet is accept ed and f or war ded t owar ds t he dest i nat i on, while still keepi ng t he s t at us of a request packet ; t he r out e r mus t also updat e t he es t i mat or as if t he packet had been r ecei ved as reserved. In t he opposi t e case, t he request packet is degr aded and f or war ded t owar ds t he dest i nat i on, and t he es t i mat or is not updat ed. Degr adi ng a request packet means assigning it a lower t raffi c class, such as best - ef f or t . A packet sent as request will r each t he dest i nat i on as request onl y if all r out er s al ong t he pa t h have accept ed t he packet as request. Not e t ha t t he choi ce of an es t i mat i on met hod is local t o a r out er and act ual es t i mat or s ma y differ in t hei r pr i nci pl e of oper at i on. The dest i nat i on per i odi cal l y sends f eedback t o t he sour ce i ndi cat i ng t he r at e at which request and reserved packet s have been recei ved. Thi s feed- back does not receive any special t r e a t me nt in t he net wor k ( except possi bl y for policing, see below. Upon r ecept i on of t he feedback, t he sour ce can send packet s mar ked as reserved accor di ng t o a profi l e der i ved f r om t he r at e in- di cat ed in t he feedback. If necessary, t he sour ce may cont i nue t o send mor e request packet s in an a t t e mpt t o i ncrease t he r at e t ha t will be i ndi cat ed in subsequent feedback messages. Thus, in essence, a r out er accept i ng t o f or war d a request packet as request allows t he source t o send mor e reserved packet s in t he f ut ur e; i t is t hus a form of implicit reservat i on. Aggregation Rout er s aggr egat e flows on out put por t s, and possi bl y on any cont ent i on poi nt as r equi r ed by t hei r i nt er nal ar chi t ect ur e. The y use est i ma- t or al gori t hms for each aggr egat ed flow t o det er mi ne t hei r cur r ent r eser vat i on levels and t o pr edi ct t he i mpact of accept i ng request packet s. The exact def- i ni t i on of what const i t ut es an aggr egat ed flow is l ocal t o a r out er . Likewise, senders and sources t r e a t all flows bet ween each pai r of t he m as a single aggr egat e and use es t i mat or al gor i t hms for char act er i zi ng t hem. The est i mat or al gor i t hms in r out er s and host s do not need t o be t he same. In fact, we expect host s t o i mpl ement a fai rl y si mpl e al gor i t hm, whi l e es t i mat or al gori t hms in r out er s may evol ve i ndependent l y over t i me. 23 Fairness and security Deni al -of-servi ce condi t i ons ma y ari se i f flows can re- serve di s pr opor t i onal amount s of resources or if flows can exceed t hei r reser- vat i ons. We pr esent l y consi der fai rness in accept i ng r eser vat i ons a l ocal pol i cy issue ( much like billing) whi ch ma y be addr essed at a f ut ur e t i me. Sources vi ol at i ng t he agr eed upon r eser vat i ons ar e a real t hr e a t and need t o be pol i ced. A scal abl e pol i ci ng mechani sm t o allow r out er s t o i dent i f y non- conf or mant flows based on cer t ai n heur i st i cs is t he s ubj ect of ongoi ng re- search. Such a mechani sm can be combi ned wi t h mor e t r adi t i onal appr oaches, e.g. pol i ci ng of i ndi vi dual flows at l ocat i ons wher e scal abi l i t y is less i mpor - t ant , e.g. at net wor k edges. The r est of t hi s paper is or gani zed as follows. Sect i on 2 pr ovi des a mor e det ai l ed pr ot ocol overvi ew. Sect i on 3 descri bes a si mpl e al gor i t hm for t he i mpl ement at i on of t he t raffi c est i mat or . Fi nal l y, pr ot ocol oper at i on is illus- t r a t e d wi t h some si mul at i on r esul t s in sect i on 4 and t he pa pe r concl udes wi t h sect i on 5. 2 Ar c h i t e c t u r e o v e r v i e w The pr oposed ar chi t ect ur e uses t wo pr ot ocol s t o manage r eser vat i ons: a reser- vat i on pr ot ocol t o est abl i sh and mai nt ai n t hem, and a f eedback pr ot ocol t o i nf or m t he sender about t he r eser vat i on st at us. Sender Data & reservations Receiver Router Fi g. 1. Overview of the components in SRP. Fi gur e 1 i l l ust rat es t he oper at i on of t he t wo pr ot ocol s: 9 Da t a packet s wi t h r eser vat i on i nf or mat i on ar e sent f r om t he sender t o t he recei ver. The r eser vat i on i nf or mat i on consi st s in a packet t ype which can t ake t hr ee values, one of t he m bei ng or di nar y best - ef f or t (sect i on 2.2). I t is pr ocessed by r out er s, and ma y be modi fi ed by r out er s. Rout er s may also di scard packet s (sect i on 2.1). 9 The recei ver sends f eedback i nf or mat i on back t o t he sender. Rout er s onl y f or war d t hi s i nf or mat i on; t hey don' t need t o process it (sect i on 2.3). Rout er s moni t or t he r eser ved t raffi c whi ch is effect i vel y pr esent and adj ust t hei r gl obal s t at e i nf or mat i on accordi ngl y. Sect i ons 2.1 t o 2.3 i l l ust r at e t he r eser vat i on and feedback pr ot ocol . 24 2.1 Reservat i on prot ocol The reservation protocol is used in the direction from the sender to the re- ceiver. It is implemented by the sender, the receiver, and the routers between them. As mentioned earlier, the reservation information is a packet type which may take three values: Request This packet is part of a flow which is trying to gain reserved status. Routers may accept, degrade or reject such packets. When touters accept some request packets, then they commit to accept in the future a flow of reserved packets at the same rate. The exact definition of the rate is part of the estimator module. Reserved This label identifies packets which are inside the source' s profile and are allowed to make use of the reservation previously established by request packets. Given a correct estimation, routers should never discard reserved packets because of resource shortage. Be s t ef f ort No reservation is attempted by this packet. Packet types are initially assigned by the sender, as shown in figure 2. A traffic source (i.e. the application) specifies for each packet if t hat packet needs a reservation. If no reservation is necessary, the packet is simply sent as best-effort. If a reservation is needed, the protocol entity checks if an already established reservation at the source covers the current packet. If so, the packet is sent as reserved, otherwise an additional reservation is requested by sending the packet as request. Application Needs reservation Doesn' t need reservation Protocol stack Yes I e o r v = n l ~- est abl i shed? ~ _ Request No Best effort Fig. 2. Initial packet type assignment by sender. Each router performs two processing steps (see also figure 3). First, for each request and reserved packet the estimator updates its current estimate of the resources used by the aggregate flows and decides whether to accept the packet (packet admission control). Then, packets are processed by various schedulers and queue managers inside the router. 9 When a reserved packet is received, the estimator updates the resource es- timation. The packet is automatically forwarded unchanged to the sched- 25 ul er where it will have pr i or i t y over best - ef f or t t raffi c and nor mal l y is not di scarded. 9 When a request packet is recei ved, t hen t he es t i mat or checks whet her accept i ng t he packet will not exceed t he avai l abl e resources. I f t he packet can be accept ed, its request l abel is not modi fi ed. If t he packet cannot be accept ed, t hen it is degr aded t o best - ef f or t 9 I f a schedul er or queue manager cannot accept a r eser ved or r equest packet , t hen t he packet is ei t her di scar ded or downgr aded t o best-effort. Reserved Request Best effort SRP est i mat or Packet schedul er Update the ] / Can the packet ] ;stimated b a n d wi d t h J - - Reserved ~ | be schedule in the / Yes ~ Reserved t ~ reserved service 1__......1~. Request - ~ Yes llequeS [ class ? J ls an update of the l / estimated bandwidth / No accepmble ? , N o ~ e S t e f f o r t _ _ [ C a n t h e p a c k e t l Y e s be schedule in the ~ Best effort best effort class ? N ~ Discard Fi g. 3. Packet processing by routers. Not e t ha t t he r eser vat i on pr ot ocol may "t unnel " t hr ough r out er s t ha t don' t i mpl ement reservat i ons. Thi s allows t he use of unmodi f i ed equi pment in par t s of t he net wor k which are di mensi oned such t ha t congest i on is not a pr obl em. 2.2 Packet t ype encodi ng RFC2474 [9] defines t he use of an oct et in t he I Pv4 and I Pv6 header for Di f f er ent i at ed Services (DS). Thi s field cont ai ns t he DS Code Poi nt ( DSCP) , whi ch det er mi nes how t he r espect i ve packet is t o be t r e a t e d by r out er s ( Per - Hop Behavi our , PHB) . Rout er s are allowed t o change t he cont ent of a packet ' s DS field (e.g. t o select a di fferent PHB) . As i l l ust r at ed in figure 4, SRP packet t ypes can be expr essed by i nt r oduc- ing t wo new PHBs (for request and for reserved), and by usi ng t he pr e- def i ned DSCP val ue 0 for best -effort . DSCP values for request and reserved can be al l ocat ed l ocal l y in each DS domai n. 2.3 Feedback prot ocol The f eedback pr ot ocol is used t o convey i nf or mat i on on t he success of reser- vat i ons and on t he net wor k st at us f r om t he recei ver t o t he sender. Unl i ke t he 26 PHB DSCP Ddault = 0ooooo SRP I~lquesl I" . . . . . . SRP Rmmed = y,Jx'J3tY 9 7 t ~ os, IPv4 header 'Vet I HL i TO6 '* Total length Fragment ID Rg I Frag. offset I Protocol Checksum Source address Destination address Of Xions, data .... i Fig. 4. Packet type encoding using Differentiated Services (IPv4 example). reservation protocol, the feedback prot ocol does not need to be i nt erpret ed by routers, because t hey can det ermi ne t he reservation st at us from t he sender' s choice of packet types. Feedback i nformat i on is collected by t he receiver and it is peri odi cal l y sent t o t he sender. The feedback consists of t he number of bytes in request and reserved packets t hat have reached t he receiver, and the local t i me at the receiver at which the feedback message was generated. Receivers collect feedback i nformat i on i ndependent l y for each sender and senders mai nt ai n the reservation st at e i ndependent l y for each receiver. Not e t hat , if more t han one flow to t he same dest i nat i on exists, at t r i but i on of reservations is a local decision at t he source. 1 0 i 1 1 2 1 3 1 4 1 5 ( = 1 7 V~ I Re,terved tO t .Reaor,,~d Num REQ (tO) Re, Berved Num REQ (t) P,e~ltV~ Num RSV (tO) P, ~t ved Num RSV (t) Fig. 5. Feedback message format. Figure 5 illustrates the content of a feedback message: the t i me when t he message was generat ed (t), and t he number of bytes in request and reserved packets received at the dest i nat i on (REQ and RSV). All count ers wrap back t o zero when t hey overflow. In order t o improve tolerance to packet loss, also t he i nformat i on sent in t he previous feedback message (at t i me tO) is repeat ed. Port i ons of t he message are reserved to allow for fut ure extensions. 27 2. 4 Shapi ng at t he sender The sender decides whet her packets are sent as reserved or request based on its own est i mat e of the reservation it has request ed and on t he level of reservation along t he pat h t hat has been confirmed via t he feedback prot ocol . A source always uses the mi ni mum of these two paramet ers t o det ermi ne t he appropri at e out put traffic profile. Furt hermore, the sender needs t o filter out small differences bet ween the actual reservation and the feedback in order to avoid reservations from drifting, and it must also ensure t hat request packets do not interfere wi t h congestion-controlled traffic (e.g. TCP) in an unfair way [10]. 2.5 Exampl e Figure 6 provides the overall pi ct ure of the reservation and feedback prot o- cols for two end-systems connected t hrough rout ers R1 and R2. The initial resource acquisition phase is followed by the generat i on of request packets af- ter the first feedback message arrives. Dot t ed arrows correspond t o degraded request packets, which passed the admission control test at r out er R1 but could not be accept ed at rout er R2 because of resource shortage. Degrada- tion of requests is t aken into account by the feedback protocol. Aft er receiving the feedback i nformat i on the source sends reserved packets at an appr opr i at e rate, which is smaller t han the one at which request packets were generat ed. REQUEST data packets (e.g. at 2Mbps) Feedback trafSpec= 1M bps RESERVED data packets REQUEST packet - - - z~- RESERVED packet ....... ~ Degraded REQUEST packet Feedback packet ( BEST- EFFORT) Fig. 6. Reservation and feedback protocol diagram. 2.6 Mul t i cas t In order t o support multicast traffic, we have proposed a design t hat slightly ext ends the reservation mechanism described in this sections. Refinement of 28 this design is still t he subj ect of ongoing work. A det ai l ed descri pt i on of t he proposed mechani sm can be found in [11]. 3 Es t i mat i on modul e s edback Session flow schedulin~ [~ Enforce maximum rate Control future rate - Gu~antee R, ES.ERVED packet aomlSslon - Control admission of REQUEST packets - Est i mat e reserved rate - Schedule sending of f eedback Fig. 7. Use of estimators at senders, routers, and receivers We call estimator t he al gori t hm which at t empt s to cal cul at e t he amount of resources t hat need to be reserved. The est i mat i on measures t he number of requests sent by sources and the number of reserved packet s which act ual l y make use of t he reservation. Est i mat or s are used for several functions. 9 Senders use t he est i mat or for an opt i mi st i c predi ct i on of t he reservat i on the net work will perform for t he traffic t hey emit. This, in conj unct i on with feedback received from t he receiver, is used to decide whet her t o send request or reserved packets. 9 Rout ers use t he est i mat or for packet-wise admi ssi on cont rol and per haps also to det ect anomalies. 9 In receivers, the est i mat or is fed with t he received traffic and it generat es an est i mat e of t he reservat i on at t he last rout er. Thi s is used t o schedule t he sending of feedback messages to t he source. Figure 7 shows how t he est i mat or al gori t hm is used in all net work ele- ment s. As described in section 2.1, a sender keeps on sending requests until suc- cessful reservat i on set up is indicated with a feedback packet , i.e. even until aft er t he desired amount of resources has been reserved in t he network. I t ' s t he feedback t hat is ret urned t o t he sender, which i ndi cat es t he right al- location obt ai ned on t he pat h. When t he source is feedback-compl i ant , t he rout ers on t he pat h st ar t releasing a par t of t he over - est i mat ed reservat i on al ready allocated. The feedback t hat is ret urned to t he sender may also show 29 an i ncr eased numbe r of r equest s. The sender mus t not i nt er pr et t hos e re- quest s as a di r ect i ncr ease of t he r eser vat i on. I ns t ead, t he s ender e s t i ma t or mus t cor r ect t he f eedback i nf or mat i on accordi ngl y, whi ch is achi eved t hr ough t he c omput a t i on of t he mi ni mum of t he f eedback and of t he r es our ce a mo u n t r equest ed by t he source. Our ar chi t ect ur e is i ndependent of t he specific al gor i t hm used t o i mpl e- ment t he es t i mat or . Sect i ons 3.1 and 3.2 descr i be t wo di fferent sol ut i ons. The defi ni t i on and eval uat i on of al gor i t hms for r es er vat i on cal cul at i on in host s and r out er s is still ongoi ng work. A det ai l ed anal ysi s of t he e s t i ma t i on a l gor i t hms and addi t i onal i mpr ove me nt s can be f ound in [12]. 3. 1 Basi c e s t i mat i on al gori t hm The basi c al gor i t hm we pr esent her e is sui t abl e for sources and des t i nat i ons , and coul d be used as a r ough e s t i ma t or by r out er s. Thi s e s t i ma t or count s t he numbe r of r equest s it recei ves ( and accept s) dur i ng a cer t ai n observat i on i nt erval and uses t hi s as an es t i mat e for t he ba ndwi dt h t h a t will be used in f ut ur e i nt er val s of t he s ame dur at i on. I n addi t i on t o r equest s for new r eser vat i ons, t he use of exi st i ng r eser va- t i ons needs t o be meas ur ed t oo. Thi s way, r eser vat i ons of sour ces t h a t s t op sendi ng or t ha t decr ease t hei r sendi ng r at e can a ut oma t i c a l l y be r emoved. For t hi s pur pos e t he use of r eser vat i ons can be s i mpl y me a s ur e d by count i ng t he numbe r of reserved packet s t ha t ar e recei ved in a cer t ai n i nt er val . To c ompe ns a t e for devi at i ons caused by del ay var i at i ons, s pur i ous packet loss (e.g. in a best - ef f or t pa r t of t he net wor k) , et c. , r es er vat i ons can be "hel d" for mor e t ha n one obs er vat i on i nt erval . Thi s can be accompl i s hed by r e me m- ber i ng t he obser ved t raffi c over sever al i nt er val s and usi ng t he ma x i mu m of t hese val ues ( st ep 3 of t he following al gor i t hm) . Gi ven a hol d t i me of h obser - vat i on i nt erval s, t he ma x i mu m a mount of r esour ces whi ch can be al l ocat ed Ma x , res and req (t he t ot al numbe r of reserved and request bytes r ecei ved in a gi ven obs er vat i on i nt er val ) , t he r es er vat i on R (in byt es) is c o mp u t e d by a r out er as follows. Gi ven a packet of n byt es: i f ( p a c k e t _ t y p e = = R E Q ) i f ( R + r e q + n < M a x ) { a c c e p t ; r e q = r e q + n ; / / s t e p 1 } e l s e d e g r a d e ; i f ( p a c k e t _ t y p e = = R E S ) i f ( r e s + n < R ) { a c c e p t ; r e s = r e s + n ; ) e l s e d e g r a d e ; / / st ep 2 3o where i ni t i al l y R, r e s , r e q = O. At t he end of each obser vat i on cycl e t he following st eps ar e comput ed: f o r ( i -- h ; i > I ; i - - ) R E i ] = R [ i - l ] ; R I l l = r e s + r e q ; R = m a x ( R [ h ] , R [ h - 1 ] , . . . , R [ l ] ) ; / / s t e p 3 r e s = r e q = O ; The same al gor i t hm can be r un by t he dest i nat i on wi t h t he onl y di fference t ha t no admi ssi on checks are needed. Exampl es of t he oper at i on of t he basic al gor i t hm ar e shown in sect i on 4.1. Thi s easy al gor i t hm present s several probl ems. Fi r st of all, t he choi ce of t he ri ght val ue of t he obser vat i on i nt erval is cri t i cal and difficult. Smal l values make t he est i mat i on dependent on bur st s of r e s e r v e d or r eques t packet s and cause an over est i mat i on of t he resources needed. On t he ot her hand, l arge i nt erval s make t he es t i mat or r eact slowly t o changes in t he t raffi c profile. Then, t he st ri ct ness of traffic accept ance cont r ol is fixed, whi l e adapt i vi t y would be hi ghl y desi rabl e in or der t o make t he al l ocat i on of new resources st r i ct er as t he amount of resources r eser ved gets closer t o t he maxi mum. These pr obl ems can be solved by devi si ng an adapt i ve enhanced al gor i t hm like t he one descr i bed in t he following sect i on. 3. 2 E n h a n c e d e s t i ma t i o n a l g o r i t h m I nst ead of usi ng t he same es t i mat or in ever y net wor k component , we can en- hance t he pr evi ous appr oach so t ha t senders and recei vers still r un t he si mpl e al gor i t hm descr i bed above, while r out er s i mpl ement an i mpr oved est i mat or . Frequently updated ~ Infrequently updated I I Exp. weighted average Estimate bef~ sm~176 - - t Y ~ - - ~ Correction 13 ~e----~ .~ Unc~ I - - Service ? r e } I Effective bandwidth I i I V i r t u a l queue I J t \ ' I ~ j v I RESERVED and accepted REQUEST packets Fi g. 8. Schematic design of an adaptive estimator. Feedback We descri be an exampl e al gor i t hm in det ai l in [11]. I t consi st s of t he pri nci pal component s i l l ust r at ed in figure 8: t he effect i ve ba ndwi dt h used by r es er ved and accept ed r eques t packet s is measur ed and t hen s moot hed by cal- cul at i ng an exponent i al l y wei ght ed aver age (7). Thi s cal cul at i on is per f or med for ever y single packet . 31 The e s t i ma t e 7 is mul t i pl i ed wi t h a cor r ect i on f a c t or / 3 in or der t o cor r ect for s ys t emat i c er r or s in t he es t i mat i on. Packet s ar e added t o a vi r t ual queue (i.e. a count er ) , whi ch is e mpt i e d at t he es t i mat ed r at e. I f t he e s t i ma t e is t oo hi gh, t he vi r t ual queue shri nks. I f t he e s t i ma t e is t oo low, t he vi r t ual queue grows. Based on t he size of t he vi r t ual queue, ~ can be adj us t ed. 4 S i m u l a t i o n Sect i on 4.1 pr ovi des a t heor et i c descr i pt i on of t he behavi or of t he r es er vat i on me c ha ni s m in a ver y si mpl e exampl e, whi l e sect i on 4.2 shows t he s i mul at ed behavi or of t he pr opos ed ar chi t ect ur e. 4. 1 Re s e r v a t i o n e x a mp l e The net wor k we use t o i l l ust r at e t he oper at i on of t he r es er vat i on mechani s m, is shown in fi gure 9: t he sender sends over a del ay-l ess l i nk t o t he r out er , whi ch pe r f or ms t he r eser vat i on and f or war ds t he t raffi c over a l i nk wi t h a del ay of t wo t i me uni t s t o t he recei ver. The recei ver per i odi cal l y r e t ur ns f eedback t o t he sender. The sender and t he recei ver bot h use t he basi c e s t i ma t or al gor i t hm de- scr i bed in sect i on 3.1. The r out er ma y - and t ypi cal l y will - use a di fferent al gor i t hm (e.g. t he one descr i bed in sect i on 3.2). Sender Router Receiver ~xNDelay=0u ~ ' ] Delay=2u O Local estimate and reservation in feedback Fi g. 9. Exampl e network configuration. The ba ndwi dt h es t i mat e at t he source and t he r es er vat i on t ha t has been acknowl edged in a f eedback message f r om t he r ecei ver ar e meas ur ed. In fi gure 10, t hey ar e shown wi t h a t hi n cont i nuous line and a t hi ck das hed line, respect i vel y. The packet s e mi t t e d by t he sour ce ar e i ndi cat ed by ar r ows on t he r es er vat i on line. A full ar r ow head cor r es ponds t o request packet s, an e mp t y ar r ow head cor r es ponds t o reserved packet s. For si mpl i ci t y, t he sender and t he r ecei ver use exact l y t he s ame obs er vat i on i nt er val in t hi s exampl e, and t he f eedback r at e is const ant . The sour ce sends one packet per t i me uni t . Fi r st , t he sour ce can onl y send r equest s and t he r out er r eser ves some r esour ces for each of t hem. At poi nt (1), t he e s t i ma t or di scovers t ha t it has est abl i shed a r es er vat i on for six packet s in four t i me uni t s, but t ha t t he sour ce has onl y sent f our packet s in t hi s i nt erval . Ther ef or e, it cor r ect s i t s e s t i ma t e and pr oceeds. The first 32 Bandwidth reservation estimate at the sender Bandwidth \ RTT/2 RTT/2 \ I' 'I i: :i Reservation indicated il ~(~) i (~ / in feedback (at sender) . . . . . . . . . . . . . . . . . . . . . . . . . . . . time RTT+Feedback cycle~F~edback c~)cle unit .b, I ,~ ~ r e que s t Observation"'~ intervall ~at sender ~ r e s e r v e d Fig. 10. Basic estimator example. feedback message reaches t he sender at poi nt (2). It indicates a reservat i on level of five packet s in four t i me units (i.e. t he est i mat e at t he receiver at t he t i me when t he feedback was sent), so t he sender can now send reserved packets i nst ead of requests. At point (3), t he next observat i on i nt erval ends and the est i mat e is corrected once more. Finally, t he second feedback arri ves at point (4), indicating t he final rat e of four packet s in four t i me units. The reservat i on does not change aft er t hat . 4. 2 Si mul a t i on r e s ul t s The net work configuration used for t he si mul at i on is shown in figure 11.1 The grey pat hs mar k flows we exami ne below. Fig. 11. Configuration of the simulated network. There are eight rout ers (labeled R1. . . RS) and 24 host s (labeled 1. . . 24) . Each of t he hosts 1. . . 12 tries occasionally t o send t o any of t he host s 13. . . 24. Connect i on par amet er s are chosen such t hat t he average number of concur- rent l y act i ve sources sending vi a t he R1- R2 link is appr oxi mat el y fifty. Flows 1 The programs and configuration files used for the simulation are available on http ://ircwww. epfl. ch/srp/ 33 have an on-off behavi our , wher e t he on and off t i mes ar e r a n d o ml y chosen f r om t he i nt er val s [5, 15] and [0, 30] seconds, r espect i vel y. The ba ndwi dt h of a flow r emai ns cons t ant while t he flow is act i ve and is chosen r a n d o ml y f r om t he i nt er val [1,200] packet s per second. All links in t he net wor k have a ba ndwi dt h of 4000 packet s per second a nd a del ay of 15 ms. 2 We allow up t o 90% of t he l i nk capaci t y t o be al l ocat ed t o r eser ved t raffi c. The l i nk bet ween R1 and R2 is a bot t l eneck, whi ch can onl y handl e a b o u t 72% of t he offered t raffi c. The del ay obj ect i ve D of each queue is 10 ms. The queue size per link is l i mi t ed t o 75 packet s. Total offered traffic ...... Real queue size at RI BOCO Estimated reservat~'~ at RI ........ ~" 1OOOO , ,o , 0 , 0 2O 30 4O 5O T i m e (s) Fi g. 12. Est i mat i on and actual traffic at R1 towards R2. 0 10 2O 30 4O 5O 60 70 Q u e u e length (packets) Fi g. 13. Queue length at R1 on t he link towards R2. Fi gur e 12 shows t he R1 - R2 l i nk as seen f r om R1 . We show t he t ot a l offered r at e, t he es t i mat ed r es er vat i on (7,/3) and t he s moot he d act ual r at es of request and reserved packet s. Fi gur e 13 shows t he be ha vi our of t he r eal queue. The s ys t e m succeeds in l i mi t i ng queui ng del ays t o a ppr oxi ma t e l y t he del ay goal of 10 ms, whi ch cor r es ponds t o a queue size of 40 packet s. The queue l i mi t of 75 packet s is never reached. 130 1 i i R E ! E R V E D p ! ...... 0 140 Estimation at host 15 ...... Rese~'ation de=ired b y I~ost 4 130 . . . . . . . . . . . . . . . . . ! 8O 60 a_ 40 ' ' ' V ' ~ 2OO RESERVED packets arriving at host 19 - - Estirr~tN~n at hOSt 19 150 Reservation d e s i r e d b y host 4 ...... ' Z i 7 i 0 6 8 10 12 14 16 18 24 26 28 30 32 34 36 Ti me(s) T i m e ( s ) Fi g. 14. End-t o-end reservation from Fi g. 15. End-t o-end reservation from host 4 to host 15. host 4 to host 19. Fi nal l y, we exami ne some end- t o- end flows. Fi gur e 14 shows a successful r eser vat i on of 84 packet s per second f r om host 4 t o 15. The r eques t ed r at e, t he es t i mat i on at t he dest i nat i on, and t he ( s moot hed) r at e of reserved packet s 2 Small random variations were added to link bandwi dt h and delay to avoid t he entire network from being perfectly synchronized. 34 ar e shown. Si mi l arl y, fi gure 15 shows t he s a me d a t a for a less successful r eser vat i on host 4 a t t e mp t s l at er t o 19, at a t i me when t he offered t raffi c is al mos t t wi ce a hi gh as t he ba ndwi dt h avai l abl e at t he bot t l eneck. 3 Dur i ng t he ent i r e s i mul at ed i nt er val of 50 seconds, 3' 368 request packet s and 164' 723 r es er ved packet s were sent f r om R1 t o R2 . Thi s is 83% of t he ba ndwi dt h of t ha t link. 5 C o n c l u s i o n We have pr opos ed a new scal abl e r esour ce r es er vat i on ar chi t ect ur e for t he I n- t er net . Our ar chi t ect ur e achi eves scal abi l i t y for a l ar ge n u mb e r of concur r ent flows by aggr egat i ng flows at each link. Thi s aggr egat i on is ma de possi bl e by del egat i ng cer t ai n t raffi c cont rol deci si ons t o end s ys t e ms - an i dea bor r owed f r om TCP. Reser vat i ons are cont r ol l ed wi t h e s t i ma t i on al gor i t hms , whi ch pr edi ct f ut ur e r esour ce usage based on pr evi ousl y obs er ved t raffi c. Fur t her - mor e, pr ot ocol pr ocessi ng is si mpl i fi ed by a t t a c hi ng t he r es er vat i on cont r ol i nf or mat i on di r ect l y t o da t a packet s. We di d not pr esent a concl usi ve speci f i cat i on but r a t he r des cr i bed t he gener al concept s, gave exampl es for i mpl e me nt a t i ons of core el ement s, i ncl ud- i ng t he desi gn of e s t i ma t or al gor i t hms for sources, des t i nat i ons and r out er s , and showed some i l l ust r at i ve si mul at i on resul t s. Fur t her wor k will focus on compl et i ng t he speci fi cat i on, on eval uat i ng and i mpr ovi ng t he al gor i t hms de- scr i bed in t hi s paper , and finally on t he i mpl e me nt a t i on of a pr ot ot ype . R e f e r e n c e s 1. Diot, Christophe; Huitema, Christian; Turletti, Thierry. Mul t i medi a Applica- tions should be Adaptive, f t p ://www. f r l r o d e o l d i o t / n c a - h p c s , ps . gz, HPCS' 95 Workshop, August 1995. 2. RFC2205; Braden, Bob (Ed.); Zhang, Lixia; Berson, Steve; Herzog, Shai; Jami n, Sugih. Resource ReSerVat i on Protocol ( RS VP) - Version 1 Funct i onal Specifi- cation, I ETF, Sept ember 1997. 3. RFC1819; Delgrossi, Luca; Berger, Louis. ST2+ Protocol Specification, I ETF, August 1995. 4. Ferrari, Domenico; Banerjea, Anindo; Zhang, Hui. Net work Support f or Multi- media - A Discussion of the Tenet Approach, Comput er Networks and ISDN Systems, vol. 26, pp. 1267-1280, 1994. 5. The ATM Forum, Technical Commi t t ee. A T M User-Net work Int erf ace ( UNI ) Signalling Specification, Version 4. O, f t p : / / f t p , atm.forum, corn/pub/ a ppr ove d- s pe c s / a f - s i g- 0061. 000. ps , The ATM Forum, Jul y 1996. 6. The ATM Forum, Technical Commi t t ee. A T M Forum Tr a~c Management Specification, Version ~{.0, f t p : / / f t p , atraforum, cor n/ pub/ appr oved- s pecs / af - t m- 0056. 000. ps, April 1996. 3 In this simulation, sources did not back off if a reservation progressed t oo slowly. 35 7. RFC2475; Blake, Steven; Black, David; Carlson, Mark; Davies, Elwyn; Wang, Zheng; Weiss, Walter. An Architecture for Differentiated Services, IETF, De- cember 1998. 8. Bernet, Yoram; Yavatkar, Raj; Ford, Peter; Baker, Fred; Zhang, Lixia; Speer, Michael; Braden, Bob; Davie, Bruce. Integrated Services Op- eration Over Diffserv Networks (work in progress), Internet Draft d r a f t - i e t f - i s s l l - d i f f s e r v - r s v p - O 2 . t x t , J u n e , 1 9 9 9 . 9. RFC2474; Nichols, Kathleen; Blake, Steven; Baker, Fred; Black, David. Def- inition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers, IETF, December 1998. 10. Floyd, Sally; Mahdavi, Jamshid. TCP-Friendly Unicast Rate-Based Flow Con- trol, http ://www. psc. edu/networking/papers/tcp_friendly, html, Technical note, January 1997. 11. Almesberger, Werner; Ferrari, Tiziana; Le Boudec, Jean-Yves. SRP: a Scalable Resource Reservation Protocol for the Internet, Proceedings of IWQoS'98, pp. 107-116, IEEE, May 1998. 12. Ferrari, Tiziana. QoS Support for Integrated Networks, http://www.cnaf. i n f n . i t / ' f er r ar i / t es i dot . ht ml , Ph.D. thesis, November 1998. Di f f erent i at ed I nt e r ne t Se r vi c e s Florian Baumgart ner, Torsten Braun, Hans Joachi m Einsiedler and Ibrahi m Khalil Institute of Computer Science and Applied Mathematics University of Berne, CH-3012 Bern, Switzerland, Tel -}-41 31 631 8681 / Fax -}-41 31 631 39 65 http : / / ~ . Jam. unibe, ch/~rvs/ Abst r act . With the grown popularity of the Internet and the increasing use of business and multimedia applications the users' demand for higher and more pre- dictable quality of service has risen. A first improvement to offer better than best- effort services was made by the development of the integrated services architecture and the RSVP protocol. But this approach proved only suitable for smaller IP networks and not for Internet backbone networks. In order to solve this problem the concept of differentiated services has been discussed in the IETF, setting up a working group in 1997. The Differentiated Services Working Group of the IETF has developed a new concept which is better scalable than the RSVP-based approach. Differentiated Services are based on service level agreements (SLAs) that are nego- tiated between users and Internet service providers. With these SLAs users describe the packets which should be transferred over the Internet with higher priority than best-effort packets. The SLAs also define parameters such as the desired bandwidth for these higher priority packets. The implementation of this concept requires addi- tional functionality such as classification, metering, marking, shaping, policing etc. within touters at the domain boundaries. This paper describes the Differentiated Service architecture currently being defined by the IETF DiffServ working group and the required components to implement the DiffServ architecture. 1 I n t r o d u c t i o n The Internet, currently based on the best-effort model, delivers only one t ype of service. Wi t h this model and FI FO queuing deployed in the network, any non-adapt i ve sources can take advant age to grab high bandwi dt h while depriving others. One can always run multiple web browsers or st art multiple FTP connections and grab substantial amount of bandwi dt h by exploiting the best effort model. The Internet is also unable to support real time applications like audio or video. Incredible rapid growth of Int ernet has resulted in massive increases in demand for network bandwi dt h performance guarantees to support bot h ex- isting and new applications. In order to meet these demands, new Qual i t y of Service (QoS) functionalities need to be i nt roduced t o satisfy cust omer requirements including efficient handling of bot h mission critical and band- width hungry web applications. QoS, therefore, is needed for various reasons: 38 9 Better control and efficient use of networks resources (e.g. bandwidth). 9 Enable users to enjoy multiple levels of service differentiation. 9 Special treatment to mission critical applications while letting others to get fair treatment without interfering with mission sensitive traffic. 9 Business Communication. 9 Virtual Private Networks (VPN) over IP. 1.1 A Pragmat i c Approach t o QoS A pragmatic approach to achieve good quality of service (QoS) is an adap- tive design of the applications to react to changes of the network characteris- tics (e.g. congestion). Immediately after detecting a congestion situation the transmission rate may be reduced by increasing the compression ratio or by modifying the A/ V coding algorithm. For this purpose functions to monitor quality of service are needed. For example, such functions are provided by the Real-Time Transport Protocol (RTP) [SCFJ96] and the Real-Time Control Protocol (RTCP). A receiver measures the delay and the rate of the pack- ets received. This information is transmitted to the sender via RTCP. With this information the sender can detect if there is congestion in the network and adjust the transmission rate accordingly. This may affect the coding of the audio or video data. If only a low data rate is achieved, a coding algo- rithm with lower quality has to be chosen. Without adaptation the packet loss would increase, making the transmission completely useless. However, rate adaptation is limited since many applications need a minimum rate to work reasonably. 1.2 Reservat i on- based Approach To achieve the QoS objective as mentioned in the earlier section, basically two approaches can be offered in a heterogeneous network like the Internet : I nt egr at ed Servi ce Appr oach: The Integrated Services Architecture based on the Resource Reservation Setup Protocol (RSVP) is based on absolute network reservation for specific flows. This can be supported in small LANs, where routers can store a small number of flow states. In the backbone, however, it would be extremely difficult, if not impossible, to store millions of flow states even with very powerful processors. More- over, for short-lived HTTP connections, it is probably not practical to reserve resources in advance. Di f f er ent i at ed Servi ce (DiffServ): To avoid the scaling problem of RSVP, a differentiated service is provided for an aggregated stream of packets by marking the packets and invoking some differentiation mech- anism (e.g. forwarding treatment to treat packets differently) for each marked packet on the nodes along the stream' s path. A very general ap- proach of this mechanism is to define a service profile (a contract between 39 a user and the ISP) for each user (or group of users), and to design other mechanisms in the router that favors traffic conforming to those service profiles. These mechanisms might be classification, prioritization and re- source allocation to allow the service provider to provision the network for each of the offered classes of service in order to meet the application (user) requirements. 2 Di f f Se r v Ba s i c s a nd Te r mi nol ogy The idea of differentiated services is based on the aggregation of flows, i.e. reservations have to be made for a set of related flows (e.g. for all flows between two subnets). Furthermore, these reservations are rather static since no dynamic reservations for a single connection are possible. Therefore, one reservation may exist for several, possibly consecutive connections. IP packets are marked with different priorities by the user (either in an end system or at a router) or by the service provider. According to the dif- ferent priority classes the routers reserve corresponding shares of resources, in particular bandwidth. This concept enables a service provider to offer dif- ferent classes of QoS at different costs to his customers. The differentiated services approach allows customers to set a fixed rate or a relative share of packets which have to be transmitted by the ISP with high priority. The probability of providing the requested quality of service depends essentially on the dimensions and configuration of the network and its links, i.e. whether individual links or routers can be overloaded by high priority data traffic. Though this concept cannot guarantee any QoS parameters as a rule it is more straightforward to be implemented than continuous resource reservations and it offers a better QoS than mere best-effort services. 2.1 Po pul a r Se r vi c e s of t he Di f f Ser v Appr oa c h At present, several proposals exist for the realization of differentiated services. Examples are: As s ur e d and Pr e mi um Servi ces: The approach allowing the combina- tion of different services like Premium and Assured Service seems to be very promising. In both approaches absolute bandwidth is allocated for aggregated flows. They are based on packet tagging indicating the service to be provided for a packet. Actually, assured service does not provide absolute bandwidth guarantee but offers soft guarantee with high prob- ability t hat traffic marked with high priority tagging will be transmitted with high probability. Us e r Shar e Di f f er ent i at i on and Ol ympi c Ser vi ce: An alternative ap- proach called User-Share Differentiation (USD) assigns bandwidth pro- portionally to aggregated flows in the routers (for example all flows from 40 or to an IP address or a set of addresses). A similar service is provided by the Olympic service. Here, three priority levels are distinguished as- signing different fractions of bandwidth to the three priority levels gold, silver and bronze, for example 60% for gold, 30% for silver and 10% for bronze. 2.2 DS byte marking In differentiated services networks where service differentiation is the main objective, the differentiation mechanisms are triggered by the so-called DS byte (or ToS byte) marking of the IP packet header. Various service differ- entiation mechanisms (queuing disciplines), as we will study them in section 3, can be invoked dependent on the DS byte marking. Therefore, marking is one of most vital DS boundary enabling component and all DS routers must implement this facility. Vemion J IHL ") TOS f" Total Length Idenlification | Rag I Fragment Offset Time to Uve I Protocol | Header Checksum Source Address J Destination Address Fig. 1. DS byte in IPv4 [NBBB98] In the latest proposal for packet marking the the first bit for IN or OUT-of-Profile traffic, the first 6 bits, called Differentiated Services Code point (DSCP), are used to invoke PHBs (see Figure 1). Router implementa- tion should support recommended code point-to-PHB mappings. The default PHB, for example, is 000000. Since the DSCP field has 6 bits, the number of code points that can defined is 26 = 64. This proposal will be the basis of future DiitServ development. Many existing routers already use IP precedence field to invoke various PHB treatment similar to the fashion of DSCP. To remain compatible, routers can be configured to ignore bit 3,4 and 5. Code point 101000 and 101010 would, therefore, map to the same PHB. Router designers must consider the semantics described above in their implementation and do necessary and appropriate mapping in order to remain compatible with old systems. 2.3 Per Hop Behavior (PHB) An introduction of PHB has already been given while discussing DS byte marking 2.2. Further [BW98] writes: "Every PHB is the externally observable 41 .forwarding behavior applied at a DS capable node to a st ream of packet s t hat have a part i cul ar value in the bits of the DS field ( DS code poi nt ) . PHBs can also be grouped when it is necessary to describe the several f orwardi ng behaviors si mul t aneousl y wi t h respect to some common const rai nt s. " However, t her e is no ri gi d as s i gnment s of PHBs t o DSCP bi t pa t t e r ns . Thes e has several reasons: 9 Ther e ar e (or will be) a l ot of mor e PHBs defi ned, t h a n DSCPs avai l abl e, ma ki ng a st at i c ma ppi ng i mpossi bl e. 9 The under s t andi ng of good choices of PHBs is at t he begi nni ng. 9 I t is desi r abl e t o have compl et e fl exi bi l i t y in t he cor r es pondence of P HB val ues and behavi or s. 9 Ever y I SP shal l be abl e t o c r e a t e / ma p PHBs in his Di ffServ domai n. For t hese r easons t her e ar e no st at i c ma ppi ngs bet ween DS code poi nt s and PHBs . The PHBs ar e e nume r a t e d as t he y be c ome defi ned and can be ma p p e d t o ever y DSCP wi t hi n a Di ffServ domai n. As l ong as t he e nume r a t i on space cont ai ns a l arge numbe r of val ues (232), t her e is no danger of r unni ng out of space t o list t he PHB val ues. Thi s list can be ma d e publ i c for ma x i mu m i nt er oper abi l i t y. Because of t hi s i nt er oper abi l i t y, ma ppi ngs bet ween PHBs and DSCPs ar e pr opos ed, even when ever y I SP can choose ot her ma ppi ngs for t he PHBs in his Di ffServ domai n. Unt i l now, t wo PHBs and cor r es pondi ng DSCPs have been defi ned. Ta b l e 1. The 12 different AF code points Drop Precedences AF Code points Class 1 Class 2 Class 3 Class 4 Low Drop Precedence 001010 010010 011010 100010 Medium Drop Precedence 001100 010100 011100 100100 High Drop Precedence 001110 010110 011110 100110 As s u r e d F o r wa r d i n g P HB: Based on t he cur r ent Assur ed For war di ng PHB (AF) gr oup [ HBWW99] , a pr ovi der can pr ovi de f our i ndependent AF cl asses wher e each class can have one of t hr ee dr op pr ecedence val ues. Thes e classes ar e not aggr egat ed in a DS node and Ra n d o m Ea r l y De- t ect i on ( RED) [F J93] is consi der ed t o be t he pr ef er r ed di s car di ng mech- ani sm. Thi s r equi r ed al t oget her 12 di fferent AF code poi nt s as gi ven in t abl e 1. I n a Di f f er ent i at ed Servi ce (DS) Doma i n each AF cl ass recei ves a cer t ai n a mount of ba ndwi dt h and buffer space in each DS node. Dr op pr ecedence i ndi cat es r el at i ve i mpor t a nc e of t he packet wi t hi n an AF class. Dur i ng congest i on, packet s wi t h hi gher dr op pr ecedence val ues ar e di s car ded fi rst 42 to protect packets with lower drop precedence values. By having multi- ple classes and multiple drop precedences for each class, various levels of forwarding assurances can be offered. For example, Olympic Service can be achieved by mapping three AF classes to it' s gold, silver and bronze classes. A low loss, low delay, low jitter service can also be achieved by us- ing AF PHB group if packet arrival rate is known in advance. AF doesn' t give any delay related service guarantees. However, it is still possible to say that packets in one AF class have smaller or larger probability of timely delivery than packets in another AF class. The Assured Service can be realized with AF PHBs. Expedi t ed For war di ng PHB: The forwarding treatment of the Expe- dited Forwarding (EF) PHB [JNP98] offers to provide higher or equal departure rate than the configurable rate for aggregated traffic. Services which need end-to-end assured bandwidth and low loss, low latency and low low jitter can use EF PHB to meet the desired requirements. One good example is premium service (or virtual leased line) which has such requirements. Various mechanisms like Priority Queuing, Weighted Fair Queuing (WFQ), Class Based Queuing (CBQ) are suggested to imple- ment this PHB since they can preempt other traffic and the queue serving EF packets can be allocated bandwidth equal to the configured rate. The recommended code point for the EF PHB is 101110. 2.4 Service Profi l e A service profile expresses an expectation of a service received by a user or group of users or behavior aggregate from an ISP. It is, therefore, a contract between a user and provider and also includes rules and regulations a user is supposed to obey. All these profile parameters are settled in an agreement called Service Level Agreement (SLA). It also contains Traffic Condition- ing Agreement (TCA) as a subset, to perform traffic conditioning actions (described in the next subsection) and rules for traffic classification, traffic re-marking, shaping, policing etc. In general, a SLA might include perfor- mance parameters like peak rate, burst size, average rate, delay and jitter parameters, drop probability and other throughput characteristics. An Ex- ample is: Service Profile 1: Code point: X, Peak rat e= 2Mbps, Burst size=1200 bytes, avg. rate = 1.8 Mbps Only a static SLA, which usually changes weekly or monthly, is possible with today' s router implementation. The profile parameters are set in the router manually to take appropriate action. Dynamic SLAs change frequently and need to be deployed by some automated tool which can renegotiate resources between any two nodes. 43 2. 5 Tr af f i c Condi t i one r Traffic conditioners [BBC+98] are requi red t o i nst ant i at e services in DS ca- pable rout ers and t o enforce service allocation policies. These conditioners are, in general, composed of one or more of t he followings: classifiers, mark- ers, meters, policers, and shapers. When a traffic st ream at t he i nput port of a rout er is classified, it t hen might have t o travel t hr ough a met er (used where appropri at e) to measure t he traffic behavi or against a traffic profile which is a subset of SLA. The met er classifies part i cul ar packets as IN or OUT-of-profile depending on SLA conformance or violation. Based on the st at e of t he met er furt her marking, dropping, or shaping act i on is activated. One or more of: "++ F/ / ~, d Fig. 2. DS Traffic Conditioning in Enterprise Network (as a set of queues) Traffic Conditioners can be applied at any congested net work node (Fig- ure 2) when t he t ot al amount of i nbound traffic exceeds t he out put capaci t y of the switch (or rout er). In Figure 2 rout ers between source and dest i nat i on are model ed as queues in an enterprise net work t o show when and where traffic conditioners are needed. For example, rout ers may buffer traffic (i.e. shape t hem by delaying) or mark t hem t o be discarded l at er duri ng medi um network congestion, but might require t o discard packets (i.e. police traffic) during heavy net work congestion when queue buffers fill up. As t he number of rout ers grows in a network, congestion increases due t o expanded volume of traffic and hence proper traffic conditioning becomes more i mport ant . Traffic conditioners might not need all four elements. If no traffic profile exists t hen packets may only pass t hrough a classifier and a marker. Classifier: Classifiers categorize packets from a traffic st ream based on the cont ent of some port i on of t he packet header. It mat ches received packets to statically or dynami cal l y allocated service profiles and pass t hose pack- ets t o an element of a traffic condi t i oner for furt her processing. Classifiers 44 must be confi gured by some management pr ocedur es in accor dance wi t h t he appr opr i at e TCA. Two t ypes of classifiers exist: BA Cl a s s i f i e r : classifies packet s based on pat t er ns of DS byt e (DS code poi nt ) only. MF cl as s i f i er : classifies packet s based on any combi nat i on of DS field, pr ot ocol ID, source address, dest i nat i on address, sour ce por t , des t i nat i on por t or even appl i cat i on level pr ot ocol i nf or mat i on. Ma r k e r s : Packet mar ker s set t he DS field of a packet t o a par t i cul ar code poi nt , addi ng t he mar ked packet t o a par t i cul ar DS behavi or aggr egat e. The mar ker can (i) mar k all packet s whi ch ar e ma ppe d t o a single code poi nt , or (ii) mar k a packet t o one of a set of code poi nt s t o sel ect a PHB in a PHB gr oup, accor di ng t o t he s t at e of a met er . Me t e r s : Af t er bei ng classified at t he i nput of t he bounda r y r out er , t raffi c f r om each class is t ypi cal l y passed t o a met er . The me t e r is used t o mea- sure t he r at e ( t empor al pr oper t i es) at whi ch t raffi c of each class is bei ng s ubmi t t ed for t r ansmi ssi on whi ch is t hen compar ed agai nst a t raffi c pro- file specified in TCA ( negot i at ed bet ween t he Di ffServ pr ovi der and t he DiffServ cust omer ) . Based on t he t he compar i son some par t i cul ar pack- et s are consi der ed conf or mi ng t o t he negot i at ed profi l e (IN-profi l e) or non- conf or mi ng (OUT-of-profi l e). When a met er passes t hi s s t at e i nfor- mat i on t o ot her condi t i oni ng funct i ons, an appr opr i at e act i on is t r i gger ed for each packet which is ei t her IN or OUT- of - pr of i l e (see Tabl e 1). S h a p e r s : Shaper s del ay some packet s in a t raffi c s t r eam usi ng a t oken bucket in or der t o force t he st r eam i nt o compl i ance wi t h a t raffi c profile. A shaper usual l y has a finite-size buffer and packet s are di scar ded if t her e is not sufficient buffer space t o hol d t he del ayed packet s. Shaper s ar e gener al l y pl aced af t er ei t her t ype of classifier. For exampl e, shapi ng for EF t raffi c at t he i nt er i or nodes hel ps t o i mpr ove end t o end per f or mance and also pr event s t he ot her classes f r om bei ng st ar ved by a bi g EF bur st . Onl y ei t her a pol i cer or a shaper is supposed t o appear in t he same t raffi c condi t i oner. Po l i c e r " When classified packet s ar r i ve at t he pol i cer it moni t or s t he dy- nami c behavi or of t he packet s and di scar d or r e- mar k some or all of t he packet s in or der t o force t he s t r eam i nt o compl i ance (i.e. force t he m t o compl y wi t h confi gured pr oper t i es like r at e and bur s t size) wi t h a traffic profile. By set t i ng t he shaper buffer size t o zer o (or a few pack- ets) a pol i cer can be i mpl ement ed as a special case of a shaper . Like shaper s pol i cers can also be pl aced af t er ei t her t ype of classifier. Po- licers, in general , are consi dered sui t abl e t o police t raffi c bet ween a site and a pr ovi der ( edge r out er ) and af t er BA classifiers ( backbone r out er ) . However, most researchers agree t ha t pol i ci ng shoul d not be done at t he i nt eri or nodes since it unavoi dabl y involves flow cl assi fi cat i on. Pol i cers are usual l y pr esent in ingress nodes and coul d be based on si mpl e t oken bucket filters. 3 Re a l i z i ng PHBs : T h e Q u e u i n g Compone nt s 45 Since differentiated service is a kind of service discrimination, some traffic need to be handled with priority, some of the traffic needs to be discarded earlier than other traffic, some traffic needs to be serviced faster, and in general, one type of traffic always needs to better t han the other. In earlier sections we have discussed about service profile and PHBs. It was made clear t hat in order to conform to the contracted profile and implement the PHBs, queuing disciplines play a crucial role. The queuing mechanisms typically need to be deployed at the output port of a router. Since we need different kinds of differentiation under specific situations, the right queuing component (i.e PHB) needs to be invoked by the use of a particular code point. In this section, therefore, we will describe some of the most promising mechanisms which have already been or deserve to be considered for implementation in varieties of DS routers. 3.1 Abs ol ut e Pri ori t y Que ui ng In absolute priority queuing (Figure 3), the scheduler gives higher-priority queues absolute preferential treatment over lower priority queues. Therefore, the highest priority queue receives the fastest service, and the lowest priority queue experiences slowest service among the queues. The basic working mechanism is as follows: the scheduler would always scan the priority queues from highest to lowest to find the highest priority packet and then transmit it. When that packet has been completely served, the scheduler would start scanning again. If any of the queues overflows, packets are dropped and an indication is sent to the sender. While this queuing mechanism is useful for mission critical traffic (since this kind of traffic is very delay sensitive) this would definitely starve the lower priority packets of the needed bandwidth. 3.2 WFQ WFQ [Kes91](Figure 4)is a discipline t hat assigns a queue for each flow. A weight can be assigned to each queue to give a different proportion of the network capacity. As a result, WFQ can provide protection against other flows. WFQ can be configured to give low-volume traffic flows preferential treat- ment to reduce response time and fairly share the remaining bandwidth be- tween high volume traffic flows. With this approach bandwidth hungry flows are prevented from consuming much of network resources while depriving other smaller flows. WFQ does the job of dynamic configuration since it adapts automatically to the changing network conditions. TCP congestion control and slow-start 46 t~ Priority 1 [ ] D[ ] [ ] Priority 4 l i e Scheduler~-----~ Ab~o]ut$ ~.riority ~cneaunng Fi g. 3. Absolute Pri- ority Queuing. The queue with the highest priority is served at first 8 Flow 1 Flow 2 _ Flow(n-l) v Fl ow n _ v Queue #1 NNNN Queue #2 NINN Queue #(n-l) Hi Queue #n Hi m Weighted Round Robi n Fig. 4. Weighted Fair Queuing (WFQ) features are also enhanced by WFQ, resulting in predictable throughput and response time for each active flow. The weighted aspect can be related to values in the DS byte of the IP header. A flow can be allocated more access to queue resources if it has a higher precedence value. 3.3 Class Based Queui ng (CBQ) In an environment where bandwidth must be shared proportionally between users, CBQ [F J95] (Figure 6) provides a very flexible and efficient approach to 47 first classifying user traffic and then assigning a specified amount of resources to each class of packets and serving those queues in a round robin fashion. A class can be an individual flow or aggregation of flows representing different applications, users, departments, or servers. Each CBQ traffic class has a bandwidth allocation and a priority. In CBQ, a hierarchy of classes (Figure 5) is constructed for link sharing between organizations, protocol families, and traffic types. Different links in the network will have different link-sharing structures. The link sharing goals are: 9 Each interior or leaf class should receive roughly its allocated link-sharing bandwidth over appropriate time intervals, given the sufficient demand. 9 If all leaf and interior classes with sufficient demand have received at least their allocated link-sharing bandwidth, the distribution of any ex- cess bandwidth should not be arbitrary, but should follow some set of reasonable guidelines. 25% 10% I ~/ ~X X ~ 5 6 9% 12% 4% 4% Fi g. 5. Hierarchical Link-Sharing The granular level of control in CBQ can be used to manage the allocation of IP access bandwidth across the departments of an enterprise, to provision bandwidth to the individual tenants of a multi-tenant facility. Other than the classifier that assigns arriving packets to an appropriate class, there are three other main components that are needed in this CBQ mechanism: scheduler, rate-limiter (delayer) and estimator. Schedul er : In a CBQ implementation, the packet scheduler can be imple- mented with either a packet-by-packet round robin (PRR) or weighted round robin (WRR) scheduler. By using priority scheduling the sched- uler uses priorities, first scheduling packets from the highest priority level. Round-robin scheduling is used to arbitrate between traffic classes within the same priority level. In weighted round robin scheduling the scheduler uses weights proportional to a traffic class's bandwidth allocation. This weight finally allocates the number of bytes a traffic class is allowed to 48 set overlimit 9 , ~' ,, -I i iDa Robin Class n Queue ~ I / Fig. 6. Class Based - 9 9 9 Queuing: Main Corn- Queue for Class n ~ ponents send during a round of t he scheduler. Each class at each r ound gets t o send its weighted share in byt es, including finishing sendi ng t he current packet. That class' s weighted share for t he next round is decr ement ed by the appr opr i at e number of byt es. When a packet t o be t r ans mi t t ed by a WRR traffic class is larger t han t he traffic class' s weight but t hat class is underl i mi t 1 , t he packet is still sent, allowing t he traffic class t o bor r ow ahead from its weighted al l ot ment for fut ure rounds of t he round-robi n. Ra t e - Li mi t e r : If a traffic class is overl i mi t 2 and is unabl e t o bor r ow from it' s parent classes, t he scheduler st art s t he overl i mi t act i on which mi ght include si mpl y droppi ng arri vi ng packet s for such a class or r at e- l i mi t overlimit classes t o their al l ocat ed bandwi dt h. The rat e-l i mi t er comput es the next t i me t hat an overlimit class is allowed t o send traffic. Unless this future t i me has arrived, this class will not be allowed t o send anot her packet unt i l . Es t i ma t or : The est i mat or est i mat es t he bandwi dt h used by each traffic class over t he appr opr i at e t i me interval and det ermi nes whet her each class is over or under its al l ocat ed bandwi dt h. 1 If a class has used less than a specified fraction of its link sharing bandwidth (in bytes/sec, as averaged over a specified time interval) 2 If a class has recently used more than its allocated link sharing bandwidth (in bytes/sec, as averaged over a specified time interval) 49 3.4 Random Early Detection (RED) Random Early Detection (RED) IF J93] is designed to avoid congestion by monitoring traffic load at points in the network and stochastically discarding packets when congestion starts increasing. By dropping some packets early rather than waiting until the buffer is full, RED keeps the average queue size low and avoids dropping large numbers of packets at once to minimize the chances of global synchronization.Thus, RED reduces the chances of tail drop and allows the transmission line to be used fully at all times. This approach has certain advantages: 9 bursts can be handled better, as always a certain queue capacity can be reserved for incoming packets. 9 by the lower average queue length real-time applications are better sup- ported. The working mechanism of RED is quite simple. It has two thresholds, minimum threshold X1 and a maximum threshold X2 for packet discarding or admission decision which is done by a dropper. Referring to Figure 7, when a packet arrives at the queue, the average queue (av_queue) is computed. If, av_queue < X1, the packet is admitted to the queue; if av_queue >_ X2, the packet is dropped. In the case, when the average queue size falls between the thresholds X1 < av_queue < X2, the arriving packet is either dropped or queued, mathematically saying, it is dropped with linearly increasing proba- bility. When congestion occurs, the probability that the RED notifies a par- ticular connection to reduce its window size is approximately proportional to that connection's share of the bandwidth. The RED congestion control mechanism monitors the average queue size for each output queue and using randomization choose connections to notify of that congestion. - - - I D- IIIIIMIIIIII T "~ ~" o i / [funl X2 X1 [empt y] r .queue Fig. 7. Random Early length Detection It is very useful to the network since it has the ability to flexibly specify traffic handling policies to maximize throughput under congestion conditions. 50 RED is especially able to split bandwidth between TCP dat a flows in a fair way as lost packets automatically cause a reduction to a TCP dat a flow's packet rate. More problematic is the situation if non TCP conforming dat a flows (e.g. UDP based real-time or multicast applications) are involved. Flows not reacting to packet loss have to be handled by reducing their dat a rate specially to avoid an overloading of the network. In general, RED statistically drops more packets from large users than from small ones. Therefore, traffic sources t hat generate the most traffic are more likely to be slowed down than traffic sources t hat generate little traffic. 3. 5 RED wi t h In and Out ( RI O) The queuing algorithm proposed for assured service RIO (RED with In and Out) [CW97] is an extension of the RED mechanism. This procedure shall make sure, t hat during overload primarily packets with high drop precedence (e.g. best-effort instead of assured service packets) are dropped. A dat a flow can consist of packets with various drop precedences, which can arrive at a common output queue. So changes to the packet order can be avoided affecting positively the TCP performance. For in and out-of-profile packets a common queue using different dropping techniques for the different packet types is provided. The dropper for out of profile packets discards packets much earlier (e.g. a lower queue length) than the dropper for in profile packets. Further more the dropping probability for out of profile packets increases more than the probability for in packets. So, it shall be achieved that the probability for dropping in profile packets is kept very low. While the out-dropper used the number of all packets in the queue for the calculation of his probability, the in-dropper only uses the number of in profile packets (see figure 8). Using the same queue both types of packets will have the same delay. This might be a disadvantage of this concept. By dropping all out-of-profile packets at a quite small queue length this effect can be reduced but not eliminated. ~ - - [ c o u n t e r 1 '9 1 ~ s s i f i c a ~ [ O ut drO pper ' " - ' ' ' ' - ' ' l ~ i I I I l i l l l l l I ' I ~ ' 1 in counter I , q l In:+ 1 In :-1 Fig. 8. RIO-Queuing 4 Di f f e r e nt i a t e d Se r vi c e s i n End- t o- End Sc e nar i os 51 4.1 Premi um Service and Expedi t ed Forwardi ng Wi t h Pr emi um Service the user negotiates with his ISP a maxi mum band- wi dt h for sending packets t hrough t he ISP network. Furt hermore, t he aggre- gat ed flow is described by t he packets' source and dest i nat i on addresses or address prefixes. In Figure 9 users and ISPs have agreed on a rat e of t hree packet s/ s for traffic from A t o B. The user configures t he first-hop rout er in t he individual subnet accordingly. In the exampl e above a packet rat e of two packet s/ s is allowed in every first-hop rout er as it can be expect ed t hat no two end systems will use the full bandwi dt h of two packet s/ s at t he same time. !!~!:~ !~!e': ~! ~! ! ! ~! ' ~" h~g~ ;~i~ii~il;:;~,]~;~ ~ ~i;~O . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i . . . . . . . . . . . ' " ' " . . . . . . . . " " : . . , " . . . ' " : . . ' . . . . . . " : ~ -% ~ . . . . " . - . . . . . . . . . . . . . . . . . . . .. ': f Fig. 9. Premium Service First-hop rout ers have the task to classify t he packets received from t he end systems, i.e. t o analyze if the Pr emi um Service shall be provi ded t o the packets or not. If yes, the packets are tagged as Pr emi um Service and t he dat a st ream is shaped according t o the maxi mum bandwi dt h. The user' s border rout er re-shapes the st ream (e.g. t hree packets per second) and t ransmi t s the packets t o t he ISP' s border router, which performs policing functions, i.e. it checks whet her the user' s border rout er remains below t he negot i at ed bandwi dt h of t hree packet s/ s. If each of t he two first-hop t out ers allows two 52 packets/s, one packet per second will be dropped by shaping or policing at the border routers. All first-hop and border-routers own two queues, one for EF-packets and one for all other (see Figure 9). If the EF-queue contains packets these are transmitted prior to others. The implementation of two queues in every router of the network (ISP and user network) equals to the realization of a virtual network for Premium Service traffic. Premium Service offers a service corresponding to a private leased line, with the advantage of making free network capacities available to other tasks, resulting in lower fees for the users. 4.2 Assur ed Servi ce A potential disadvantage of Premium Service is the weak support for bursts and the fact t hat a user has to pay even if he is not using the whole bandwidth. The Assured Service tries to offer a service which cannot guarantee bandwidth but provides a high probability that the ISP transfers high-priority-tagged packets reliably. The definition of concrete services has not yet happened, but it is obvious to offer services similar to the IntServ controlled load service. The probability for packets to be transported reliably depends on the network capacity. An ISP may choose the sum of all bandwidths for Assured Service to remain below the bandwidth of the weakest link. In this case, only a small portion of the available capacity may be allocated in the ISP network. An advantage of the Assured Service is that users do not have to establish a reservation for a relative long time. With ISDN or ATM, users might be unable to use the reserved bandwidth because of the burstiness of their traffic, whereas Assured Service allows the transmission of short time bursts. With the Assured Service the user negotiates a service profile with his service provider, e.g. the maximum amount or rate of high priority, i.e. As- sured Service, packets. The user may then tag his packets as high priority within the end system or the first-hop router, i.e. assign them a tag for as- sured forwarding (AF) (see Figure 10). To avoid modifications in the end systems the first-hop router may analyze the packets with respect to their IP addresses and UDP-/TCP-Port and then assign them the according priority, i.e. set the AF-DSCP for conforming Assured Service packets. The maximum rate of high-priority (AF-DSCP) packets must not be exceeded. This is done by (re-)classification in the first-hop routers and in the user's border routers at the border to the ISP network. Nevertheless, the service provider has to check if the user remains below the maximum rate for high priority packets and apply corrective actions such as policing if necessary. For example, the border router at the network entrance will tag the non- conforming packet as low priority (out of service, out of profile). An alterna- tive would be to charge higher fees for non-conforming packets by the ISP. The tagging of low and high priority packets is done by use of the DS byte. 53 ::: ...................................... i i Fig. 10. Assured Service Bursts are supported by making buffer capacity available for buffering bursty traffic. Inside the network, especially in backbone networks bursts can be expected to be compensated statistically. 4.3 Traffic Condi t i oni ng for As s ur ed and Pr e mi um Se r vi c e The implementation of Assured and Premium Service requires several modi- fications of the routers. Mainly classification, shaping, and policing functions have to be performed to the router. These functions are necessary at the border between two networks, for example at the transition of the customer network to the ISP or between the ISPs. Service profiles have to be negotiated between the ISPs similar to the transition to the user. Fi r s t - hop r out e r Figure 11 shows the first-hop router function for Premium and Assured Service. Received packets are classified and according to this the AF or EF-DSCP is set if the packet should be supported with Assured or Premium Service. As a parameter for the classification, source and destination addresses or information of higher protocols (e.g. port numbers) may be used. There are separate queues for each AF class, for EF and best effort traffic. So, a pure best-effort packet will be forwarded directly to the best-effort RED queue and the Assured Service packets get to their RED queues. The Assured Service packets are checked whether they conform to the service profile. The 54 drop precedence will only be kept unchanged if the Assured Service bucket contains a token. Otherwise the drop precedence will be increased. The RED- based queuing shall guarantee t hat AF packets with higher drop precedence are dropped prior to AF packets with lower drop precedence, if the capacity is exceeded. Q ~ constant rate premium best-effo~ D-~ assured Fig. 11. First-hop router for Premium, Assured and best effort services Border router Similar to the first-hop router an intermediate router has to perform shaping functions in order to guarantee t hat not more than the allowed packet rate is transmitted to the ISP. This is important since the ISP will check whether the user remains within the negotiated service profile. The border router in Figure 12 will therefore drop non conforming Premium service packets and increase the drop priority of non conforming Assured Service packets. Packets within an AF class but with different precedence values share the same queue since both types of packets may belong to the same source. A common queue avoids re-ordering of packets. This is especially important for TCP performance reasons. Fi r st - Hop and Egress Border Rout ers Figure 13 shows the working principle of a first hop and an egress router for assured service. An egress border router is the border router, at which the packets are leaving the dif- ferentiated service domain. Received packets are classified and the AF DSCP is set, if assured service should be given to the packet. Source and destina- tion addresses and information of higher protocols (e.g. port numbers) may 55 0 ., constant rate - IIII - yos best-effort assured - 7- 77- ] ye s ~ / / - - ~ I l l ~- ~- t t oke n ~o/ / ' / ~i nc r e a s e ~' ~ Fig. 12. Policing in a border router be used as classification parameters.A pure best effort packet will directly be pushed to the output queue. The AF-DSCP is set according to the availability of a token and then written to the AF output queue. Normal best effort traffic is directly pushed to the best effort queue. The token buckets are configured according to the SLAs consisting of bit rates and the burst parameter. The bucket may be capable of keeping several tokens to support short time bursts. The bucket's depth depends on the arranged burst properties. The difference between a first hop and an egress border router is the fact, that at the first hop router a packet is classified for the first time for this task information of higher protocols (TCP ports, type of the application) may be used, whereas the egress border router is capable of changing the drop precedence to meet the negotiated service profile. Ingress Bor der Rout er The ISP has to ensure that the user meets the negotiated traffic characteristics. To achieve this, the ISP has to check in his ingress border router, which transmits the packets into his DS domain whether the user keeps the SLA. So the ingress border router of Figure 14 will change the drop precedence of non conforming packets. 56 ~ t ~ p r e c e d e n c e ~ = ~ i n g t o ~ best-effort a s s u r e d Fi g. 13. First hop and egress border rout er for Assured Service st~176 0d0o00H ,o [ i Low Drop Precedence ~ / New cl~lfication~ Yes i Low Drop ] , / w ~ i f NO token / N, ~ P r e c e d e n c e I ~ ~/ Medium Drop Precedence ~ s i f i y a , . i o n ' ~ Yes ~ ] I I I ] ] I L - - i \ High Drop Precedence ~- ~ ~ n l g h Drop ; Precedence , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fi g. 14. Ingress border router with three drop precedences for Assured Service 4. 4 Us e r - Sha r e Di f f e r e nt i a t i o n Based upon packet t aggi ng Pr e mi u m and Assur ed Servi ce model s can fulfill t he s t i pul at ed servi ce pa r a me t e r s like bi t r at es wi t h a hi gh degr ee of pr obabi l - i t y onl y if t he I SP net wor k is di mensi oned a ppr opr i a t e l y and non bes t - ef f or t t raffi c is t r a ns mi t t e d bet ween cer t ai n known net wor ks only. I f for i nst ance t wo users have cont r act ed a bi t r at e of 1 Mbps for As s ur ed Servi ce packet s wi t h an I SP and bot h wi sh t o recei ve d a t a s i mul t aneous l y at a r at e of 1 Mbps each f r om a WWW ser ver whi ch is connect ed t o t he ne t wor k wi t h a 1.5 Mbps link, t he r equest ed qual i t y of servi ce cannot be pr ovi ded. The Us er - Shar e Di f f er ent i at i on a ppr oa c h [Wan97] avoi ds t hi s pr obl e m by cont r act i ng not absol ut e ba ndwi dt h p a r a me t e r s but r el at i ve ba ndwi dt h shares. A user will be guar ant eed onl y a cer t ai n r el at i ve a mo u n t of t he avai l - abl e ba ndwi dt h in an I SP net wor k. I n pr act i ce, t he size of t hi s s har e will be in di rect r el at i on t o t he char ged cost s. In Fi gur e 15, user A has al l ocat ed onl y hal f of t he ba ndwi dt h of user B and one t hi r d of t he bandwi dt h of user C. I f A and B access t he ne t wor k on 57 Fig. 15. User Share Differentiation (USD) bottleneck link low bandwidth links with a capacity of 30 kbps at the same time, e.g. user B will receive a bandwidth of 20 kbps but user C will get merely 10 kbps. If B and C access the same or possibly a different network via a common high bandwidth link with a capacity of 25 Mbps, B will receive 10 Mbps and C only 15 Mbps. Simpler router configuration is an important advantage of the USD ap- proach. However, absolute bandwidth guarantees cannot supported. An ad- ditional drawback is that not only edge routers must be configured (as in the case of Premium or Assured Service) but also interior routers must be configured with the bandwidth shares. 5 Conc l us i on a nd Out l ook Standardization of Differentiated services is still under discussion. So far most of discussions have been centered around RED and Assured Service. Virtual Leased Line (or Premium Service) and it's implementations by EF PHB has been recently been discussed in [JNP98] which would require implementa- tion of Priority Queuing, WFQ, CBQ etc. It is not clear where the policing and shaping should take place. Although, both AF and EF PHBs have been proposed, interaction between these two is a debatable issue. 58 RED and it's variants are complimentary to different scheduling algo- rithms, and fit very nicely with CBQ. RED is designed to keep queue sizes small (smaller than their maximum in a given implementation), and thus avoid tail drop and global TCP resynchronization. It is, therefore, expected that in router implementation all these service discipline need to coexist and some of those be complementary to each other. Nevertheless, new propos- als for both AF and EF PHB strongly suggests t hat Class Based Queuing (CBQ), WFQ, and their variants will play stronger roles in the implementa- tion of DiffServ. Regarding interaction between the PHBs the EF draft says t hat other PHBs can coexist at the same DS node given t hat the requirements of AF classes are not violated. These requirements include timely forwarding which is at the heart of EF. On the other end, the AF PHB group distinguishes between the classes based on timely forwarding. The AF draft also says t hat "any other PHB groups may coexist with the AF group within the same DS domain provided that the other PHB groups do not preempt the resources allocated to the AF classes". The question here is: If they coexist should EF have more timely forwarding than the highest timely forwarded AF class by preempting any AF class as the EF document basically states? What is needed here is EF must leave AF whatever has been allocated for AF.This would mean EF can actually preempt forwarding resources for AF. For example, one could take a 1.5 Mbps link and allow for 64 Kbps of it to be available to EF, with the remaining capacity available to AF. One could also state that EF has absolute priority over AF (up to the 64 Kbps allocated). In this case, EF would preempt AF (so long as it conforms to the 64 Kbps limit) and AF would always be assured t hat it has 1.5 Mbps - 64 Kbps of the link throughput. There are lot more issues which are debatable and need attention for fur- ther research. However, we should always keep in mind t hat the whole point of DiffServ is to allow service providers to implement QoS pricing strategies in the first place. Re f e r e nc e s [BBC+98] [BW98] [cw971 [FJ93] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weis. An architecture for differentiated services. Request for Comments 2475, December 1998. Marty Borden and Christoph White. Management of phbs. Internet Draft drafz-i et f-di ffserv-phb-mgmt -00. t xt , August 1998. work in progress. D. Clark and J. Wroclawski. An approach to service al- location in the internet, work in progress. Internet Draft dr af t - cl ar k- di f f - svc- al l oc- 00. t xt , Juli 1997. work in progress. Sally Floyd and Van Jacobson. Random early detection gateways for congestion avoidance. I EEE/ ACM Transactions on Networking, Au- gust 1993. [FJ95] [HBWW99] [JNP98] [Kes91] [NBBB98] [SCFJ96] [Wan97] 59 Sally Floyd and Van Jacobson. Link-sharing and resource management models for packet networks. IEEE/A CM Transactions on Networking, 3(4), August 1995. Juha Heinanen, Fred Baker, Walter Weiss, and John Wro- clawski. Assured forwarding phb group. Int ernet Draft d r a f t - i e t f - d i f f s e r v - a f - O 6 . t x t , February 1999. work in progress. Van Jacobson, K. Nichols, and K. Poduri. An expedi t ed forwarding phb. Int ernet Draft d r a f t - i e t f - d i f f s e r v - a f - 0 2 . t x % Oct ober 1998. work in progress. S. Keshav. Congestion Control in Computer Networks. PhD thesis, Berkeley, Sept ember 1991. K. Nichols, S. Blake, F. Baker, and D. Black. Definition of t he differ- entiated services field (ds field) in t he ipv4 and ipv6 headers. Request for Comment s 2474, December 1998. H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. Rt p: A t ransport protocol for real -t i me applications. Request for Comment s 1889, Januar y 1996. Z. Wang. User-share differentiation (usd) scalable band- width allocation for differentiated services. Int ernet Draft d r a f t - wa n g - d i f f - s e r v - u s d - 0 0 . t x t , November 1997. work in progress. A Po r t a bl e S u b r o u t i n e Li brary for S o l v i n g Li near Co nt r o l Pr o b l e ms on Di s t r i b u t e d Me mo r y Co mp u t e r s * Pe t e r Be nne r ~, En r i q u e S. Qu i n t a n a - Or t f 2, a nd Gr e gor i o Qu i n t a n a - Or t l 3 Zent r um fiir Technomat hemat i k, Fachbereich 3 - Mat hemat i k und I nf or mat i k, Universit/it Bremen, D 28334 Bremen, Ger many; benner ~mat h. uni - br er nen. de. 2 Depar t ament o de InformXtica, Uni versi dad Jai me I, 12080 Castelldn, Spain; q u i n t anaOi nf , uj i . e s . 3 Same address as second aut hor; gqui nt a nOi nf , u j i . e s . Ab s t r a c t . Thi s paper describes t he design of a software l i brary for sol vi ng t he basic comput at i onal probl ems t hat arise in analysis and synt hesi s of linear cont rol syst ems. The l i brary is i nt ended for use in hi gh per f or mance comput i ng envi ron- meat s based on parallel di st ri but ed memor y archi t ect ures. The por t abi l i t y of t he l i brary is ensured by using t he BLACS, PBLAS, and SeaLAPACK as t he basic layer of communi cat i on and comput at i onal routines. Pr el i mi nar y numeri cal resul t s demons t r at e t he per f or mance of t he devel oped codes on parallel comput er s. 1 I n t r o d u c t i o n I n r ecent year s, ma n y new a nd r el i abl e nume r i c a l me t h o d s have be e n devel - oI)(' d for anal ys i s and s ynt hes i s of mo d e r a t e si ze l i near t i me - i nva r i a nt ( LTI ) s ys t e ms . I n gener al i zed s t a t e - s pa c e f or m, s uch s ys t e ms ar e de s c r i be d by t he f ol l owi ng model s . Co n t i n u o u s - t i me L T I s ys t em: E.~,(t) = Ax ( t ) + B u ( t ) , t > O, x(O) = x (~ y ( t ) = Cx ( t ) , t >_ O. (1) Di s c r e t e - t i me L T I s ys t em: Ex k +l = Axe: + Bu k , k = 0, 1, 2 . . . . , x0 = x (~ (2) y~. = Cx~., k = O, 1, 2 . . . . . I n b o t h cases, A , E C IR '~ B E ~.'~ a nd C E IR p Her e we a s s u me t h a t E is nons i ngul a r . De s c r i pt or s ys t e ms wi t h s i ngul a r E al so l ead - - a f t e r * Par t i al l y suppor t ed by t he DAAD pr ogr amme Acciones Integradas Hispano- AIemanas. Enri que S. Qui nt ana- Or t i and Gregori o Qui nt ana- Or t i were also sup- por t ed by t he Spani sh CI CYT Pr oj ect TIC96-1062-C03-03. 62 a ppr opr i a t e t r ans f or mat i ons t o t he above pr obl em f or mul at i on wi t h a non- si ngul ar ( and usual l y, di agonal or t r i angul ar ) ma t r i x E; see, e.g., [42,55]. The t r adi t i onal appr oach t o desi gn a r egul at or for t he above LTI s ys t e ms i nvol ves tile mi ni mi zat i on of a cost f unct i onal of t he f or m 0 0 1 / Jc(Xo, u) = ~ ( y( t ) TQy( t ) + 2y ( t ) TSu( t ) + u ( t ) TRu ( t ) ) dt o (3) in t he cont i nuous - t i me case, and 1 s T 2y TSuk T Yd(xo, U) = ~ (YkQYk + +' a k Ru k ) (4) k=O in t he di scr et e- t i me case. The mat r i ces Q 6 IR pxp, S 6 IR px' *, and R 6 11:/''* ar e chosen in or der t o wei ght i nput s and out put s of t he s ys t em. The linear-quadratic regulator problem consi st s in mi ni mi zi ng (3) or (4) s ubj ect t o t he dynami cs (1) or (2), respect i vel y. I t is wel l -known t ha t t he sol ut i on of t hi s opt i ma l cont r ol pr obl em is gi ven by t he cl osed-l oop cont r ol u*(t) = - R - I ( B T X c E + s T C) x ( t ) =: F~x(t ), t > O, (5) in t he cont i nuous - t i me case, and 'a~*. = - ( R+ B T X a B ) - 1 ( B T X d A + S T C) x k = : Fdxk, k = 0 , 1 , 2 , . . . , ( 6 ) in t he di scr et e- t hne case. See, e.g., [1,42,38] for det ai l s and f ur t her references. The mat r i ces X~ and Xd ill ( 5 ) and (6) denot e par t i cul ar sol ut i ons of t he (generalized) cont i nuous- t i me algebraic Riccati equation ( CARE) 0 = ~. c ( X ) : = c T Q c -}- A T X E + E T X A - - (ETXB + c T s ) R - I ( B T X E + S TC) , ( 7 ) and t he (generalized) discrete-time algebraic Riccati equation ( DARE) 0 = 7</ ( X) : = CT QC + A T X A - E T X E - (8) - ( A T X B + C T s ) ( R + B T X B ) I ( B T X A + S T C ) . The opt i mal cont r ol in (5) and (6) is obt ai ned f r om t he stabilizing sol ut i ons of (7) and (8). Th a t is, we need t o c omput e Xc and Xd such t ha t tile r esul t i ng closed-loop i nat r i ces A c : = E-~(A + BF~), F~ : = - R- ' ( BTXcE + sTc), ( 9 ) and Aa : = E - ~ ( A + B F d ) , Fd : = - - ( R- i - BTXdB) - I ( BTXd A-i -~TC) ( 1 0 ) 63 have st abl e spect r a. In t he cont i nuous- t i me case this means t ha t Ac has all its ei genval ues in t he open left hal f pl ane while in t he di scr et e- t i me case all t he eigenvalues of Ad are of modul us less t han one. We will call mat r i ces ( mat r i x pencils) wi t h all eigenvalues in t he open left hal f pl ane c stable and t hose wi t h s pect r a inside t he open uni t disk will be called d stable. The mat r i ces F~. and F,t are called t he opt i mal feedback gain matrices. Under s t andar d as s umpt i ons on LTI syst ems and t he wei ght i ng mat r i ces in t he cost f unct i onal s J~. and ,7~l, t he st abi l i zi ng sol ut i ons of t he CARE and DARE exist, are uni que, and symmet r i c. See [38] for a det ai l ed account on condi t i ons for exi st ence and uni queness of sol ut i ons t o (7) and (8). The al gebrai c Ri ccat i equat i ons in (7) and (8) can be f or mul at ed in t he mor e general forms 0 = Tgc(X) = Q, + ] t T X E + E T X f t - X G X (11) for t he CARE, and o = 7 r = 0 + A r x A - ~ , r x f , - (12) - ( Xr x # + S)(R + # r x b ) - ~ ( # r X. Zl + S) for t he D A R E . Wr i t t en in this form, t hey also i ncl ude t he al gebrai c Ri ccat i equat i ons arising in many areas of moder n cont r ol t heor y like r obus t cont r ol , H2- and H~- c ont r ol , model r educt i on, etc.; see, e.g., [27,47,48,56]. The algo- r i t hms used here for t he mt meri cal sol ut i on of equat i ons of t he forms (11) and (12) do not depend on t he par t i cul ar f or m gi ven in (7) and (8) and hence can be used t o solve any al gebrai c Ri ccat i equat i ons given as in (11) and (12). Thr oughout this paper we will assume t hat st abi l i zi ng sol ut i ons of t he CARE (11) and t he DARE (12) exist and hence t hat t he cl osed-l oop mat r i ces .4, and A~l defi ned above are c st abl e and (t s t abl e, respect i vel y. In t he course, of solving t he al)ove nonl i near syst ems of (~quations vi a Newt on' s met hod and in many ot her anal ysi s and synt hesi s prol fl ems for LTI cont r ol probl ems, linem' mat r i x equat i ons of t he fornl ~4x#' + d XD + # = o (13) have t o be solved. Here A, C e IR '~xn, s e IR "~xm, and # , X E IR ' ' Li near syst ems of equat i ons as in ( 1 3 ) are called generalized Syl vcst er equa- tions. Some par t i cul ar i nst ances of (13) are given below: d x +x D+# =o , d x # - x + # = o, and for /~ = E T Ax + x . 4 r + # = 0, A x d ' T + d x A r + E = o, A x A T - x + [~ = o, A x A T - d x d T + # = o, ( Syl vest er equation) (14) ("di scret e" Syl vest er equation) (15) ( Lyapunov equation) (16) (generalized Lyapunov equation) (17) ( St ei n equation) (18) (generalized St ei n equation) (19) 64 St ei n equat i ons are oft en also r ef er r ed t o as discrete Lyapunov equations. In addi t i on t o t he above, we will consi der speci al cases of ( gener al i zed) Lya punov and St ei n equat i ons wher e E is semi def i ni t e and f act or ed as E = + E I E ~ . In t hi s case, if A - $C is a s t abl e ma t r i x penci l ( t he gener al i zed ei genval ues of t he ma t r i x penci l are st abl e) , t hen t he sol ut i on of t he cor r e- s pondi ng Lya punov or St ei n equat i on is al so semi def i ni t e and can be f act or ed as X = Thi s is t he case, e.g., when c omput i ng t he controllability Gramian W,. and observability Gramian Wo of a cont i nuous - t i me LTI s ys t e m vi a t he Lya punov equat i ons AWc E T + EW~A T + BB T = O, ATWoE + E:rWoA + C:rC = O. (20) ( 2 ~ ) I n t he di s cr et e- t i me case t hese Gr a mi a ns ar e gi ven by t he cor r es pondi ng St ei n equat i ons AW~A r + EW~E T + BB r = O, Ar WoA - Er WoE + Cr C = O. (22) (23) The Gr a mi a ns of LTI s ys t ems pl ay a f unda me nt a l role in ma n y anal ysi s and desi gn pr obl ems of LTI s ys t ems as c omput i ng bal anced, mi ni mal , or par t i al r eal i zat i ons, t he Hankel si ngul ar val ues and Hankel nor m of LTI syst enl s, and nl odel r educt i on. Oft en, t he Cholesky factors X1 of t he sol ut i ons t o t he above equat i ons are needed. Hence, speci al al gor i t hms ar e desi gned t o c omput e t hese f act or s wi t hout ever f or mi ng t he sol ut i on ma t r i x expl i ci t l y. We consi der speci al al gor i t hI ns for all t he above equat i ons. The s ubr ou- t i nes r esul t i ng f r om i mpl ement i ng t hese al gor i t hms will be used in or der t o t ackl e some c omput a t i ona l pr obl ems for LTI syst ems: C1 st abi l i ze an LTI syst em, i.e., find F E IR. "*x'* such t ha t E - )~(A + BF) is a st abl e ma t r i x pencil; C2 l nodel r educt i on, i.e., find l ow- or der mat r i ces ( Er , At , B,., C,.) such t h a t tile LTI s ys t e m defined by t hese mat r i ces a ppr oxi ma t e s t he i n p u t - o u t p u t behavi or of t he ori gi nal syst em; C3 solve t he l i near - quadr at i c opt i mi zat i on pr obl ems di scussed above usi ng (5) and (6); C4 c omput e t he opt i mal H9 cont rol l er; C5 c omput e a s ubopt i ma l H~ cont rol l er. In addi t i on t o t he c omput a t i ona l s ubr out i nes pr ovi ded by t he PBLAS and Sc a LAPACK [15] and t he solvers for t he above l i near and nonl i near ma t r i x equat i ons we will al so need t ool s for t he s pect r al decompos i t i on of mat r i ces and ma t r i x penci l s in or der t o accompl i sh Tas k C1. The need for paral l el comput i ng in t hi s ar ea can be seen f r om t he f act t h a t a h e a d y for a s ys t e m wi t h s t at e- s pace di mensi on n = 1000, t he cor r es pondi ng Syl vest er , Lyapunov, St ei n, or Ri ccat i equat i ons r epr es ent a set of l i near or 65 nonl i near equat i ons wi t h one mi l l i on unknowns. Sys t ems of such a di men- sion dr i ven by or di nar y di f f er ent i al ( - al gebr ai c) equat i ons ar e not u n c o mmo n in chemi cal engi neer i ng appl i cat i ons and ar e s t a nda r d for second or der sys- t ems ari si ng f r om model i ng mechani cal mul t i body s ys t ems or l ar ge fl exi bl e space st r uct ur es. We as s ume her e t ha t t he coefficient mat r i ces ar e dense and n < 6000. Lar ger syst ems, as t hose ari si ng f r om t he di s cr et i zat i on of par - t i al di fferent i al equat i ons, usual l y i nvol ve spar se mat r i ces . I f s par s i t y is t o be expl oi t ed, ot her c omput a t i ona l t echni ques have t o be empl oyed [34,45]. Ti m al gor i t hms consi der ed here are i mpl ement ed in For t r an 77 usi ng t he kernel s in l i br ar i es BLACS, PBLAS, and ScaLAPACK. The r esul t i ng s ubr out i nes will fornl a s ubr out i ne l i br ar y wi t h t ent at i ve name P L I L CO, Pa r a l l e l Sof t war e Li b r a r y for l i near Cont r ol t heory. Thi s pr os pect us of t he f ut ur e PLI LCO is or gani zed as follows. In Sect i on 2 we will r evi ew t he basi c numer i cal al gor i t hms t ha t can be empl oyed in or der t o sol ve t he c omput a t i ona l pr obl ems needed t o accompl i sh t he r equi r ed t asks. In or der t o obt ai n a hi gh por t abi l i t y of t he s ubr out i nes t o be i mpl ei nent ed, we will follow t he gui del i nes and c omput a t i on model used in Sc a LAPACK [15] as well as t he i mpl e me nt a t i on and doc ume nt a t i on s t a nda r ds gi ven in [16]. A shor t revi ew of t he paral l el comput i ng pa r a di gms used and a s ur vey of t he desi gn and cont ent s of t he pr os pect i ve l i br ar y will be gi ven in Sec- t i on 3. Pr el i mi nar y resul t s in Sect i on 4 will de mons t r a t e t he pe r f or ma nc e of t he devel oped subr out i nes in several paral l el comput i ng envi r onment s wi t h s har ed/ di s t r i l ) ut ed memor y. An out l ook on fl l t urc act i vi t i es is gi ven in Se(> l i on 5. 2 Nu me r i c a l Al g o r i t h ms 2.1 The QR and QZ al gori t hms Ti m t r adi t i onal appr oaches t o sol vi ng t he c omput a t i ona l t uobl enl s i nt r oduced in t he pr ecedi ng sect i on i nvol ve t he c omput a t i on of i nvar i ant / def l at i ng sub- spaces by means of tile QR/ QZ al gor i t hms; see, e.g., [26,49]. The QR al gor i t hm consi st s of an initial r educt i on st ep whi ch t r ansf l ) r ms a gi ven ma t r i x A C g{'~ t o upt )er Hessenber g f or m, i.e., wher e Uo is or t hogonal . Af t er war ds, a sequence of si mi l ar i t y t r a ns f or ma t i ons Aj +I : = U~r+tAjUj+I for j = 0, 1 , 2 , . . . is per f or med. The t r a ns f or ma t i on mat r i ces Uj are chosen such t ha t all i t er at es Aj a r e upper Hes s enber g mat r i ces and conver ge t o upper quas i - t r i angul ar form. Th a t is, if A-I. = l i mj _+~ A j , t hen A. is upper t r i angul ar wi t h 1 x I and 2 x 2 bl ocks oil tile di agonal . The i x 1 bl ocks cor r es pond t o real ei genval ues of A while 2 x 2 bl ocks r epr es ent pai r s of compl ex conj ugat e ei genval ues of A. Usually, conver gence t akes pl ace in 6 6 COOt ) i t er at i ons. The si mi l ar i t y t r a ns f or ma t i ons wi t h U~ can be i mpl e me nt e d at a c omput a t i ona l cost of O( n '~) such t ha t tile overal l c omput a t i ona l cost of t hi s al gor i t hm is (,9(7~:~). I f we set D : = limj-~oo J 1-Ik=0 Uj , t hen , 4. = UTAD, Ti l e upper quas i - t r i angul ar ma t r i x t i . is called t he (real) Schur forru of A, Appl yi ng a finite sequence of or t hogonal si mi l ar i t y t r a ns f or ma t i ons t o 4 . , t he di agonal bl ocks can be s wapped such t ha t t he uppe r k x k bl ock of tile t r a ns f or me d ma t r i x cont ai ns t hose ei genval ues of A t ha t ar e i nsi de s ome subset of t he compl ex pl ai n which is closed under compl ex conj ugat i on. I f we denot e t he accumul at ed t r a ns f or ma t i on mat r i ces t ha t achi eve t hi s r e- or der i ng by U and set U : = DU t hen t he first k col umns of U s pan t he A- i nvar i ant s ubs pace cor r es pondi ng t o t hese ei genval ues. The QZ al gor i t hm appl i ed t o ma t r i x penci l A - AE c omput e s or t hogonal mat r i ces ~, 2 E IR '~ such t hat UT( A - A E ) 2 = A. - AE. , wher e A. is upper quas i - t r i angul ar and E. is upper t r i angul ar . Agai n t he mat r i ces ~r, 2 (' an be chosen such t hat t he first k cohnnns of Z :-- 22 s pan a par t i cul ar ri ght def l at i ng s ubs pace of A - AE cor r es pondi ng t o s ome desi red subset of ei genval ues of A - AE. The QZ al gor i t hm is equi val ent t o appl yi ng t he Q1R al gor i t hnl t o A E 1 wi t hout ever f or mi ng t he pr oduc t or t he i nverse expl i ci t l y. The ul at r i x penci l A. - AE. is cal l ed t he generalized (real) Schur f or m of A - AE. In or der t o c omput e a spect r al decompos i t i on of a ma t r i x or ma t r i x penci l as r equi r ed, e.g., in Task C1, t he QR (QZ) al gor i t hm can be appl i ed t o t he ma t r i x (penci l ). The r e- or der i ng i nust t hen be per f or nl ed such t ha t t he s pe c t r um of t he l eadi ng k x ~: di agonal bl ock of A. ( A. - AB. ) cor r es ponds to t he ei genval ues on t he one side of t he line di vi di ng tile s pect r unl whi l e t he t r ai l i ng di agonal bl ock cor r esponds t o t he ei genvahms of t he ot her side of this line. For (' (mt i nuous-t i nm syst ems, usual l y a s pect r al divisi(m ah)ng t he i magi nar y axi s is needed while tbr di scr et e- t i me syst ems, t he usual s pect r al di vi si on line is tile uni t circle. When sol vi ng t he s ymmet r i c l i near ma t r i x equat i ons (16) (19) wi t h t he mos t wi del y used met hod, t he Bart el s- St ewart method, t he Q1R. and QZ al- gor i t hms are used for iuitial r educt i ons of tile i nvol ved ma t r i x ~4 or I nat r i x penci l ~J,- A6' to upper quas i - t r i angul ar form. Thi s i ni t i al st age is followed by a backs ubs t i t ut i on process in or der t o solve t he r esul t i ng t r i angul ar syst ems. Not e t ha t t he mai n c omput a t i ona l wor k is done dur i ng t he i ni t i al r educt i on. Thi s a ppr oa c h is used, e.g., in [5,22,23,45] for t he equat i ons (16) (19) and also when sol vi ng semi defi ni t e Lya punov and St ei n equat i ons of t he f or m (20) ( 23) i n [30,54,45]. For t he nons ymme t r i c equat i ons (14) and (15), it. is usual l y sufficient t o t r ans f or m one of tile coefficient nl at r i ces t o uppe r quas i - t r i angul ar f or m and t he ot her one t o Hessenber g form. Thi s a ppr oa c h is cal l ed t he Hesse'uberg- Sch, ur met hod following [25] and is ext ended t o (13) in [22,23]. The al gebr ai c Ri ccat i equat i ons (11) and (12) can be sol ved vi a tile Ql q/ QZ al gor i t hms usi ng t he r el at i on t o cer t ai n i nvar i ant or def l at i ng sub- spaces of t he cor r es pondi ng ma t r i c e s / ma t r i x penci l s. I f t he st abl e ri ght t i e - 67 flating subspace of [ 0 1 is spanned by [z 'x lz 2 1 ' Zl l , Z21 9 ~, z and Zl l is invertible, t he stabilizing ~ J solution of (11) is given by X~ = - Z 1 2 Z ~ I E - 1 . Hence the CARE (11) can be solved applying the QZ algorithm to H - A_K and re-ordering the eigenvalues such t hat the stable eigenvalues (i.e., those with negative real parts) appear in the upper n x n diagonal block of the generalized Schur form of H - )~K. Th e n the first n c o l u mn s of the mat ri x Z c o mp u t e d by the QZ al gori t hm span the required stable right deflating subspace of H - )~K. Note t hat the optimal control u . ( t ) can be comput ed using Fc = R - : ( B T Z v 2 Z l a - s T c ) without solving the CARE explicitly. In case E = I,~, it is sufficient to appl y the QR algoritiun to the H a m i l t o n i a n ' ma t r i x H from (25) and to order the Schur form of H accordingly. This approach was first suggested in [39] and outlined in [3] ['or E r L~. The, resulting met hods are calh'(t the (ge'n, e r ( di z e d) Sch, wr veer, or me t h o d s . Similar observations as in the continuous-time case lead to Schur vector met hods for DAREs as given in (12). Here the QZ algorithm and an appro- priate re-ordering are to be applied to M - ,~L = - t ? r ~ - A - , U . ( 2 6 ) ~r 0 ~ -,O 'r If the generalized Schur form of 3,1 - AL is ordered such t hat the lea(ling n. x n. (liagonal block contains the eigenvalues inside the unit disk, then the first n (:ohmms of the Z-mat ri x comput ed by the QZ algorithln span the stabl(' (with respect to the unit circle) right deflating subspace of M - AL. Part i t i oni ng these ,,. colunms of Z as [ Z[ ] , Z~i , z T ] T, where Z,I,Z.2i E ~, , x, , Z:~, 9 ~,,,x,~, and assuming Zll nonsingular, Xa = Zz~ Z~I/ ~ -~ and Fd = Z:sl Zll l [42]. Note t hat using this approach it is possible to comput e the opt i mal control u~! directly without solving the DARE explicitly. The comput at i onal cost of this approach can be lowered if R is invertible and well-conditioned by applying the QZ algorithm to /~f - AL = LQ - ~T/ ~- , ~ / ) r -- A 0 (.4 -- /~.--1s)TJ " (27) If E = I,~, l ~J- AL is a symplectic matrix pencil. These Schur vector met hods tbr the discrete-time case have been proposed in [3,44,53]. If tile st andard approaches to the spectral division problem and to the solution of the linear and nonlinear matrix equations described above are 68 t o be us e d for c o mp u t a t i o n s o n par al l e l d i s t r i b u t e d me mo r y c o mp u t e r s , we wi l l n e e d ef f i ci ent i mp l e me n t a t i o n s o f t h e QR a n d QZ a l g o r i t h ms f or t h e s e comput i ng envi r onment s. In S c a L AP ACK, onl y t he QR al gor i t hm is avai l abl e so far. However, in or der t o solve t he l i near mat r i x equat i ons consi der ed here, tile QR al gor i t hm can onl y be used for (14)-(16) and (18). In all ot her cases, t he QZ al gor i t hm has t o be empl oyed in t he initial st age when sol vi ng t hese equat i ons via t he most f r equent l y used Hessenber g- Schur and Bar t el s - St ewar t met hods as descrit)ed above. Solving (11) and (12) by tile (general i zed) Schur vect or met hods, agai n t he QR. al gor i t hm can only be used in tile CARE case wi t h E r I,,; for all ot her cases, t he QZ al gor i t hm is needed. A di fferent appr oach t o solving t he al gebrai c Ri ccat i equat i ons (11) a n d (12) is t o consi der t hese equat i ons as nonl i near sets of equat i ons. Fr om t hi s per spect i ve, t he most obvi ous choice t o solve al gebrai c Ri ccat i equat i ons is Newt on' s met hod. In each i t er at i on st ep of Newt on' s met hod appl i ed t o CAREs or DAREs [ 3 , 3 3 , 3 7 , 3 8 , 4 2 ] , a (general i zed) Lyapunov or St ei n equat i on of t he form (16) (19) has t o be solved; see Sect i on 2.4 below. Thus a par al - lel i mpl ement at i on of Newt on' s met hod also depends heavi l y on t he paral l el per f or mance of t he Lyapunov or St ei n solver empl oyed, i.e., if t he Bar t el s- St ewar t i net hod is t o be used, once mor e on tile efficiency of tile paral l el i zed QR/ Qz al gori t hms. e'.From t he above consi derat i ons we can concl ude t hat in or der to use t he t r adi t i onal al gor i t hms fl)r solving linear and al gebrai c Ri ccat i mat r i x equa- tions, it is necessary t o have efficient paral l el i zat i ons of tile QR and QZ al gori t hms. However, several exper i ment al st udi es r epor t tile difficulties in 1)arallelizing t he doubl e implicit shi ft ed QR al gor i t hm on par al M di s t r i but ed mul t i t )rocessors (see, e.g., [17,24,31,51]). The al gor i t hm present s a fine gr an- ul ar i t y whi(:h introdu(:es perfi )rmance losses due t o eonmnmi cat i on s t ar t - up over head (l at ency). Besides, t r adi t i onal da t a l ayout s ( c ohmm/ r ow }flock scat - t(,re(l) lead t o an unbal ance(t di st r i but i on of t he comput at i onal load. A ( l i f f(,rent at)t~roa(:h relies on a block Hankel di st r i but i on, whicll i mp, oves t he bal anci ng of t he comt mt at i onal load [ 3 1 ] . At t empt s t o i ncrease t he gr mml ar - i t y by empl oyi ng mul t i shi ft t echni ques have heen r ecent l y I)roposed in [32]. Nt wert hel ess, t he paral l el i sm and scal abi l i t y of t hese al gor i t hms are still far fl'om t hose of mat r i x i nul t i pl i eat i ons, mat r i x f act or i zat i ons, t r i angul ar l i near syst ems solvers, etc.; see, e.g., [15] and tile references given t her ei n. Al t hough tile paral l el i zat i on of tile QR al gor i t hm has been t hor oughl y st udi ed, ill cont r ast , tile paral l el i zat i on of t he QZ al gor i t hm r emai ns unex- I)lored t o tile best of our knowledge. Moreover, since bot h t he QR and t he QZ al gor i t hms are composed of tile same t ype of fi ne-grai n comput at i ous , sinfilar or even worse paral l el i sm and scal abi l i t y resul t s are t o be expect ed fl'om tile QZ al gori t hm. In or der to avoid t he t)roblems ari si ng from tile difficult par al l el i zat i on of t he QR and QZ al gori t hms, we will use a di fferent comput at i onal appr oach here. It is well-known t hat under sui t abl e assumpt i ons, t he above mat r i x 69 equat i ons can be sol ved vi a t he sign f unct i on nl et hod. I t has l ong been ac- knowl edged t ha t al gor i t hms based on t he sign funct i on are r el at i vel y eas y t o paral l el i ze. The met hods t ha t will be empl oyed in t he PLILCO will lie consi der ed in tile next sect i ons. 2 . 2 T h e S i g n F u n c t i o n M e t h o d a n d t h e S m i t h I t e r a t i o n Ti l e sign f unct i on me t hod was first i nt r oduced in 1971 by Robe r t s [46] for sol vi ng al gebr ai c Ri ccat i equat i ons of tile f or m (11) wi t h E = I .... Robe r t s also shows how t o solve st abl e Syl vest er and Lya punov equat i ons vi a t he ma t r i x sign funct i on. Ti l e appl i cat i on t o CAREs and DAREs wi t h E 7 L I,, is investigate(1 in [20,21] while tile appl i cat i on t o (16) wi t h E # I,, is examin(~d in [ 1 3 ] . The c omput a t i on of t he sign f unct i on requi res basi c numer i cal l i near al ge- br a t ool s like ma t r i x mul t i pl i cat i on, i nversi on a nd/ or sol vi ng l i near s ys t ems . These (: omt )ut at i ons are i mpl ement ed efficiently on mos t par al l el ar chi t ec- t ur es and, in par t i cul ar , Sc a LAPACK [15] pr ovi des easy t o use and por t a bl e c omput a t i ona l kernel s for t hese oper at i ons . Hence, t he sign f l mct i on met ho( l is an a ppr opr i a t e t ool t o desi gn and i mpl ement efficient and por t a bl e i mmer - ical sof t war e for di s t r i but ed me mo r y paral l el comput er s . Let. Z C g/'~x'~ have no ei genval ues oi1 t he i magi nar y axi s and denot e [ , , ] l)y Z = S ]7 .1 + its J or da n deconl posi t i on wi t h J - E C A' .1+ E ff;(,,-t.) x (,,-t.) cont ai ni ng t he .]or(lan bl ocks correst )ondi ng t o t l m (,ig(mvahws in tho op(~ll left and ri ght hal f pl anes, r/~st)(;('tively. Th(m t he mat'ri:r s i . q ' l t .fl,nction of Z is defi ned as si gn (Z ) : = S [ -I ~'O I ,,() ] S - ' _ ~ , . ( 2 8 ) Not e t ha t sign ( Z) is uni que and i ndependent of t he or(ler of t he ei g( mvahws in t he J or da n decomt )osi t i on of Z (see, e.g., [38, Sect i on 22.1]). Ma ny ot her equi val ent defi ni t i ons for sign ( Z) can be given; see, e.g., t he r ecent s ur vey pa pe r [35]. Ti m appl i cat i on of t he ma t r i x sign f l mct i on metho(1 t o a ma t r i x penci l Z - AY as gi ven in [ 2 0 ] ill case Z an(1 }~ ar e nonsi ngul ar can be 1)IUs(mted as 1 z , , : = z , z ~ . + , : - % ( z ~ + , . ~ . ~ z ~ . ~ ) , a: = 0 , 1 , 2 , . . . , ( 2 9 ) where, c~, is a scal i ng par amet er . E. g. , for det er mi nant al scal i ng, c~. is giv(m as c~, = (1 det ( Zk) t / [ det ( Y) l ) 88 [20]. Thi s i t er at i on is equi val ent t o c omput i ng t he sign f unct i on of t he ma t r i x Y- I Z vi a t he s t a nda r d Newt on i t er at i on as pr opos ed in [46]. The pr ope r t y needed her e is t ha t if Zoo : = l i mk~o~ Zk, t hen (Zoo. - Y) / 2 (or ( Z~ + Y) / 2) defines t he skew pr oj ect i on ont o t he s t abl e ((,i" ant i - st abl e) ri ght defl at i ng s ubs pace of Z - AY paral l el t o tile ant i - s t abl e (or st abl e) def l at i ng suhspace. 7O In [20] t he i t er at i on (29) is used t o c omput e t he st abi l i zi ng sol ut i on of t he CARE (11) and t he DARE (12) usi ng t he ma t r i x penci l s (25) and (27). The al gebr ai c Ri ccat i equat i ons (11) can be sol ved by appl yi ng (29) t o Z - AY = H - AK and t hen f or mi ng tile r esul t i ng pr oj ect or Z~o - Y ont o tile s t abl e def l at i ng s ubs pace of H - AK. A basi s of t hi s s ubs pace is t hen gi ven by t he r ange of t ha t pr oj ect or . Thi s s ubs pace is usual l y not c omput e d expl i ci t l y as XE : = Xc E can be obt ai ned by sol vi ng t he over det er mi ned but cons i s t ent set of l i near equat i ons 1 X E Z . ~ 2 + E j [ Z21 [ z , , z12]. see [18,20,38,46]. The ma t r i x X~ can be obt a i ne d by wher e Z ~ = [z21 z ~ , sol vi ng XE = X(. E while t he opt i ma l gai n ma t r i x and t her ef or e t he opt i ma l cont r ol is obt ai ned di r ect l y usi ng F(, = - R, -1 ( BTx E + s Tc ) . The DARE (12) can not be sol ved di r ect l y usi ng t he sign f l mct i on me t h o d as we n e e d t h e d s t a b l e d e f l a t i n g s u b s p a c e o f k T / - / , f l ' om ( 2 7 ) . On e p o s s i b i l i t y t o s w i t c h b a c k - a n d - f o r t h b e t w e e n c a n d d s t a b l e ma t r i x p e n c i l s A - AB ( o r c a n d d st abl e defl at i ng subspaces) is tile Cayley transformation C, , ( . 4- AB) = ( # A+ B) - A( . 4 - ~ B) , I / * l = l , d e t ( A- t t B) # 0 . In or der t o keel) comput at i ons real, one has t o choose # = her e we r es t r i ct our sel ves t o It = 1. I t is wel l -known (see, e.g., [40,43]) t ha t if A - AB is r s t abl e (d st abl e) , t hen C~,(A - AB) is d s t abl e (c st abl e) and t he c s t abl e ((l st abl e) ri ght defl at i ng subsl mce of A - AB is tim (l st abl e (c st abl e) r i ght def l at i ng sul)sI)ace of Ct , (A - AB). t l ence, t he DARE (12) C}Ul be sol ved wi t h l he sign f l mct i on me t hod at)plied t o C,~(A;I - AL,). The sol ut i on X,t is t hen obt ai ned f l om (30) r epl aci ng X~. by Xd. Not e t hat none of t he me t hods consi der ed so far can be used t o soh' e ( 1 2 ) v i a ( 2 6 ) : as we need tile d s t abl e def l at i ng s ubs pace of M - AL, the. si gn f l mct i on me t hod can not be appl i ed di rect l y. Though t hi s subst )ace is gi ven 1) 3' t he c st abl e ri ght defl at i ng s ubs pace of t he Cayl ey t r a ns f or me d ma t r i x penci l CI , ( M- AL), t he sign f l mct i on me t hod can in gener al not be used her e as M + L and M - L ma y be si ngul ar. A di fferent appr oach t o solve tile s pect r al di vi si on pr obl em and t he con- si(lered ma t r i x equat i ons is revi ewed in Sect i on 2.3. Thi s a ppr oa c h will al so (~vercome t he I uobl ems for t he DARE (12) ment i oned above. The (general i zed) Lyapunov and St ei n equat i ons (16) ar e speci al i ns t ances of t he CARE (11) and DARE (12), respect i vel y. Thi s i mpl i es t ha t one can solve (16) and (17) by means of t he sign f unct i on me t hod appl i ed t o t he ma t r i x penci l in (25) which t hen t akes t he f or m H - - , / r ( = _ ~ ' i ' -- A 0 d T " ( 3 1 ) 71 For st abl e ma t r i x penci l s . 4 - AC, H- , k K is r egul ar and has an n- di mens i onal s t abl e def l at i ng s ubs pace such t ha t t he sol ut i on of (16) can be obt a i ne d anal - ogousl y t o t ha t of (11). Ill [13] it is obser ved t ha t appl yi ng tile gener al i zed Newt on i t er at i on (29) t o t he ma t r i x t)encil H - ,XK in (31) and expl oi t i ng tile bl ock- t r i angul ar s t r uct ur e of all mat r i ces i nvol ved, (29) boils down t o 1 ( ) A0 : = A, A~.+, : = 5 A k + O A ~ . l O , k = 0 , 1 , 2 , . . . 1 (E~: +C T A [ . T E k A ~ . I c ) , E0 : = / ), Ek+l := (32) l ~ - T ( l i mk _ ~ Ek I n case = and t ha t X = 5C ) ~ - 1 . C I,~, t he i t er at i on in (32) has al r eady been deri ved by Robe r t s [46]. The semi def i ni t e Lya punov equa- t i ons as in (20) (23) call be sol ved usi ng a f act or ed versi on of t he i t er at i on for t he /~k' s in (32), i.e., t he i t er at i on is per f or med s t ar t i ng wi t h tile f act or of / ) = FTI 6. Thi s i t er at i on t hen conver ges to 2 Xl C if t he sol ut i on is f act or ed as X = x T x I . Det ai l s of t hi s al gor i t hm call be f ound ill [13] and its appl i cat i on to comput i ng tile s ys t em Gr a mi a ns for cont i nuous - t i me LTI .systems as gi ven in (20), (21) is descr i bed in [11]. In case t he s pect r a of A- AC and / ) - A/ ) sat i sf y a ( A, C) C C- and cr ( / ) , / ) ) C C- , t he Sylw~.ster equat i on (14) (:an al so be sol ved usi ng the. sign f unct i on me t hod appl i ed t o Usi ng agai n t he bl ot : k- t r i angul ar s t r uct ur e of t he ma t r i x penci l H - Al (, t he i t er at i on can be l mr f or med on t he bl ocks a,s follows: . 40 : = . ~i , D o : = D , E 0 : = k , 9 - ' ( A . + C A ~, ~C) - 4A. +1 . - - 5 k = 0 , 1 , 2 . . . . . ( 3 4 ) ' (Dk + BD;ID) Dk+I : = ~ z~.+, .-.- ~' (E~: + CAklZ~.D;:'~) The sol ut i on of'^(13) is t hen gi ven by t he sol ut i on of t he l i near s ys t e m of e(l uat i ons 2 C X B = limk-,oo Ek. In case C = I , and / ) = I .... ot her i t er at i ve schemes for c omput i ng t he sign f unct i on like t he Newt on- Schul z i t er at i on or Hal l ey' s me t hod can al so be i mpl ement ed efficiently t o solve tile cor r es pondi ng Lya punov and Syl vest er equat i ons (16) and (14); det ai l s of t he r esul t i ng al gor i t t uns will be r e por t e d in [14]. So far we have only (:onsidered t he l i near ma t r i x equat i ons for cont i nu( ms- t.ime cont r ol probl el ns. That. is, we have as s umed st abi l i t y wi t h r es pect t o t he i magi nar y axis. In di scr et e- t i me cont rol pr obl ems , st abi l i t y pr ot ) er t i es ar e 72 gi ven wi t h r espect t o t he uni t circle. The l i near ma t r i x equat i ons encount er ed in di scr et e- t i me cont rol pr obl ems ar e (15), (18), and (19). Let us first c o n s i d e r ( 1 5 ) o f w h i c h ( 1 8 ) i s a s p e c i a l i n s t a n c e . I f w e r ewr i t e t he equat i on in fixed poi nt form, X = / i , X/ ) + / 9 a n d f or m t he fixed poi nt i t er at i on X0 : = / ~ , X~:+i = / ) + AXk B, k = O , 1 , 2 , . . . . t hen t hi s i t er at i on converges t o X if A and B are d- st abl e. The conver gence r at e of t hi s i t er at i on is linear. A quadr at i cal l y conver gent versi on of t he fixed poi nt i t er at i on is suggest ed in [19,50], A0 : = ~4, B0 : = / ), Xo : = E, X~.+I : = A ~ . X k B k + X k , k = 0 , 1 , 2 , . . . . (35) Aa : + i : = A~. , Bk+l : = B 2, The above i t er at i on is r ef er r ed t o as t he S mi t h i t erat i on. We empl oy it t o solve (18) and (15). In case (19) is t o be sol ved wi t h t he Smi t h i t er at i on, one has t o appl y (35) t o (Ad-~)Tx(AO -~) - x + 0 - r E 0 - * = 0. Thi s has t he di s advant age t ha t t he i t er at i on is s t a r t e d wi t h da t a t ha t is al r eady cor r upt ed r oundof f er r or s basi cal l y det er mi ned by cond ( C) , i.e., t he condi t i on of by \ / wit h r espect t o ma t r i x i nversi on defi ned by cond ( d ' ) -- I l d l l l l d ' - ' l l . One possi bi l i t y t o avoi d t he i ni t i al i nversi on of C when sol vi ng (19) by t he Smi t h i t er at i on is to t r ans f or m (19) t o a gener al i zed Lya punov equat i on wi t hout i nver t i ng any mat r i ces usi ng t he Cayl ey t r a ns f or ma t i on and t hen appl yi ng (32) t o t he t r ans f or med equat i on (~4 + c ) T x ( f l - C) + (,~, - C ) 7 X ( , 4 + C) + 2(2 = 0 (36) which has t he s ame sol ut i on as (19). Of course, t he s ame a ppr oa c h can be used for (18) set t i ng C = In. But t hi s yi el ds a gener al i zed Lya punov equat i on. In or der t o obt ai n a s t andar d Lya punov equat i on of t he f or m (16) one has t o nml t i pl y (36) fronl t he left, by (,4 - d ) - T and (~4 - d ) -1 f r om t he ri ght . Thi s i nt r oduces agai n unnecessar y r oundi ng er r or s and we will t her ef or e not fbllow t hi s appr oach here. 2 . 3 T h e D i s k F u n c t i o n M e t h o d Let Z - AY, Z, Y 6 gt. '~x'~, be a r egul ar ma t r i x penci l havi ng no ei genvahms on t he uni t circle. Suppose t he Wei erst raf l ( Kr onecker ) canoni cal f o r m of Z - AY is gi ven by Z - AY = T [ J ~ - AI O ] , ] ~ - A N wher e J or da n bl ocks cor r espondi ng t o ei genval ues inside t he uni t di sk ar e col l ect ed in .10 while J ~ cor r es ponds t o ei genval ues out si de t he uni t di sk 73 and N cont ai ns ni l pot ent bl ocks cor r es pondi ng t o infinite ei genval ues. The matrix pencil disk flmction is defi ned in [6] as d i s k ( Z ' Y ) := S ( [ Ik 0 0 - A [ 0 0 ] ) S- 1 =: D z - A D Y " I, , -k A matrix disc function was al so i nt r oduced in [46] usi ng a di fferent appr oach. In [6] it is shown t ha t t hi s is a speci al case of t he above defi ni t i on usi ng Y = I,,. / . From t he di sk funct i on, we can obt ai n t he d- s t a bl e def l at i ng s ubs paee of Z - AY as Dz is a skew pr oj ect or ont o t hi s subspace. Hence, a basi s for t hi s subst )ace is gi ven by a basi s of t he col umn space of Dz . The di sk f unct i on has recei ved some i nt er est in r ecent year s as it pr ovi des t he ma t he ma t i c a l f r amewor k for an al gor i t hm pr opos ed in [41] and ma de feasi bl e for pr act i cal comput at i ons in [4] for sol vi ng t i m s pect r al di vi si on pr obl em. Tiffs inverse free spectral division algorithm can be gi ven as follows: Zo := Z, Yo : = Y, [ r g l l U12] [ ~ 0 k ] --Zl,.J : = [U21 U22J ( Q R d e c o m p o s i t i o n ) , k = 0 , 1 , 2 , . . . . Zk+l := u r z k , ~'kq-1 : = U~Yk, (38) I t follows t ha t di sk (Z, Y) = (Zoo + } J ~ ) - i ( } ~ _ AZoo). Hence a basi s for t he d st abl e ri ght def l at i ng subspace of Z - AY can be c omput e d vi a a r ank- r eveal i ng Q R decomt msi t i on of (Zoo + } ~ ) - l } ~ , wher e limA.~ oo (Z~., }~. ) =: (Zoo, 1~, ). Not e t hat t hi s QR (t ecomposi t i on can be c omput e d wi t hout ex- pl i ci t l y i nver t i ng ( Z~ + }~. ); see [4] for det ai l s. Mor eover , a compl et e s pect r al decompos i t i on of Z - AY al ong t he uni t circle can be c omput e d usi ng onl y one i t er at i on of t he f or m (38); see [52]. / . From t he above consi der at i ons we can concl ude t ha t t he DARE (12) can be sol ved appl yi ng i t er at i on (38) t o M- AL f l om (26). I t is shown in [6] t ha t it is not necessar y t o c omput e a basi s for t he d st abl e def l at i ng s ubs pace expl i c- itly. Usi ng t he r el at i on bet ween t he nul l spaces Ke r ( Dz ) = Ker ( Zoo) , and t he fact t hat if t he st abi l i zi ng sol ut i on Xd of (12) exi st s, t hen [I.~ ( XdE) T ~T] E Ke r ( Dz ) , one can show (see [6]) t ha t t he sol ut i on of t he DARE and t he opt i - mal gai n ma t r i x of t he di scr et e- t i me opt i mal cont r ol prol )l em can be obt a i ne d f r om t he sol ut i on of t he over det er mi ned l i nt consi st ent set of l i near equat i ons F z , , l (39) Z32 Z33 J LZ3i J Here, t he Zkj, k , j = 1, 2, 3, define a bl ock par t i t i oni ng of Zoo conf or mal t o t he par t i t i oni ng in (26). Sol vi ng (12) wi t h t hi s a ppr oa c h will be r ef er r ed t o as t he disk function method. Not e t ha t t he CARE (11) can also be sol ved wi t h t he disk f unct i on me t hod by appl yi ng t he i t er at i on (38) t o C~(H - AK) wi t h H - AK as in (25). The 74 sol ut i on X~ of (11) as well as t he opt i ma l gai n ma t r i x /Pc. can be obt a i ne d from t he over det er mi ned but consi st ent set of l i near equat i ons [ , . . 1 r , . . 1 [ , . , . . ] Z22j XE = [Z21 , Zoo : = 221 Z22 , and Xc = XE j ~- l , Fc = --(BT XE n t- STc ) . The l i near ma t r i x equat i ons (18) and (19) can al so be sol ved vi a t he di sk funct i on me t hod appl i ed t o (27) not i ng t ha t (18), (19) ar e speci al i nst ances of (12). The cor r es pondi ng ma t r i x penci l t hen t akes tile f or m (4O) Equat i ons (16) and (17) ar e speci al i nst ances of (11) and hence can t)e sol ved vi a t he disk funct i on me t hod appl i ed t o C~,(H - .~K) for H - )~t( as in (31). Unf or t unat el y, t he i t er at i on in (38) can not be decompos ed i nt o i t er a- t i ons on t he ma t r i x bl ocks such t ha t no c omput a t i ona l savi ngs is obt a i ne d compar ed t o t he sol ut i on of t he DARE. Hence, t he c omput a t i ona l cost for solving l i near ma t r i x equat i ons wi t h t he di sk f l mct i on me t hod is in genc' ral prohi bi t i ve. Mor e det ai l s and c omput a t i ona l aspect s of t he di sk f l mct i on me t hod can be found in [4,6,7,52]. Though a gener al scal i ng s t r a t e gy in or der t o accel er at e conw~.rgence in (38) is ye.t not known, an i ni t i al scal i ng of H - A/ ( in (25) {:an si gni fi cant l y i mpr ove tile di sk f l mct i on me t hods for CAREs ; see [6]. 2 . 4 Ne wt o n ' s Me t h o d The met hods pr esent ed so far have addr es s ed tile al gebr ai c Ri ccat i equat i ons by t hei r r el at i on t o ei genpr obl ems. By nat ur e, t hey ar e s ys t ems of nonl i near equat i ons. I t is t her ef or e s t r ai ght f or war d t o appl y me t hods for sol vi ng non- l i near equat i ons. In [37], Kl ei nman shows t ha t Newt on' s me t hod, appl i ed t o t he CARE (11) wi t h E = I~ and pr oper l y i ni t i al i zed, conver ges t o tile desi red st abi l i zi ng sol ut i on of t he CARE. The appl i cat i on t o t he gener al i zed equa- t i on (11) is consi dered in [3,42]. Gi ven some i ni t i al guess X0, t he r esul t i ng al gori t hl n (:an be s t at ed in di fferent ways. We have chosen her e t t w var i ant t hat is mos t r obust wi t h r espect t o accumul at i on of r oundi ng errors. FOR k = 0, 1, 2 , . . . "unt i l conver gence" 1. Ak : = J--(~Xk. 2. Solve for N~. in t he gener al i zed Lya punov equat i on r ~ 0 = 7r + A k Nk E +/~N~:AA:. 3. Xk+l : = X~ : + Nk . 75 The mai n c omput a t i ona l cost in t hi s al gor i t hm comes f r om t he sol ut i on of t he ( gener al i zed) Lya punov equat i on in each i t er at i on st ep. I f 0 is posi t i ve semi def i ni t e and under t he a s s umpt i on used t hr oughout t hi s paper , i.e., X~ exi st s such t ha t cs ( / ) - I ( , ~_ GX~) ) C C- , it can be shown t ha t conver gence t o X~. is gl obal l y quadr at i c if X0 is chosen such t ha t / ) - I A 0 is c st abl e. Remark 1. Conver gence of Ne wt on' s me t hod for al gebr ai c Ri ccat i equat i ons can al so be pr oved under sl i ght l y mor e gener al as s umpt i ons t ha n used her e [29,28]: s uppos e t her e exi st s a gai n ma t r i x/ ~c such t ha t ~i. c in (9) is c- s t abl e, t hat is, t he under l yi ng LTI s ys t e m (1) is c stabilizable. Fur t her mor e, as s ume t hat G is posi t i ve semi def i ni t e and t her e exi st s a ma xi ma l s ymme t r i c solu- t i on X+ of (11), i.e., X+ > X for any ot her s ymme t r i c sol ut i on of (11). Then A+ : = / ) - 1 ( ~ _ GX+) has all i t s ei genval ues in t he closed left hal f pl ane. The ma t r i x X+ is t her ef or e cal l ed almost stabilizing. I t is t he uni que sol ut i on of (11) wi t h t hi s pr ope r t y and coi nci des wi t h Xc if t he l at t er exi st s [38, Cha pt e r 7]. Then t he Newt on i t er at i on conver ges t o X+ f r om any st a- bilizing i ni t i al guess X0 [29,38]. The conver gence r at e is usual l y l i near if X+ is not st abi l i zi ng, but quadr at i c conver gence ma y still occur. A si mpl e t r i ck pr es ent ed in [29] can i mpr ove t he conver gence in t he l i near case si gni fi cant l y. Anal ogous obser vat i ons hol d in t he di s cr et e- t i me case [28]. Fi ndi ng a st abi l i zi ng X0 usual l y is a difficult t as k and requi res t he st abi l i za- t i on of an LTI s ys t em, i.e., t he sol ut i on of Tas k C1. The c omput a t i ona l cost is equi val ent t o one i t er at i on st ep of Ne wt on' s met hod; see, e.g., [49] and t he references t her ei n. Moreover, X0 det er mi ned by a st abi l i zat i on pr ocedur e ma y lie far fron: X,.. Though ul t i mat el y quadr at i c conw~rgent, Newt on' s me t hod may i ni t i al l y conver ge slowly. Thi s can be due t o a l arge er r or I I x 0 - x , . l l or to a di s as t r ous l y bad first st ep, l eadi ng t o a l arge er r or IIX1 - x c l l ; see, e.g., [36,6,8]. Due t o t he i ni t i al slow conver gence, Newt on' s me t hod oft en requi res t oo ma n y i t er at i ons t o be compet i t i ve wi t h ot her Ri ccat i sol vers. Ther ef or e it is mos t f r equent l y onl y used t o refine an a ppr oxi ma t e CARE sol ut i on c omput e d by any ot her met hod. Recent l y an exact line search pr ocedur e was suggest ed t ha t accel er at es t he initial conver gence and avoi ds " ba d" first st eps [6,8]. Specifically, St ep 3. of Newt on' s me t hod gi ven above is modi f i ed t o Xk+l = Xk + tkNk, wher e tt. is chosen in or der t o nfi ni mi ze t he Fr obeni us nor m of t he resi dual Tic(Xk +tNk). As c omput i ng t he exact mi ni mi zer is ver y cheap c ompa r e d t o a Newt on st ep and usual l y accel er at es t he i ni t i al conver gence si gni fi cant l y while benef i t i ng fl' om t he quadr at i c conver gence of Ne wt on' s me t hod close t o t he sol ut i on, t hi s me t hod becomes at t r act i ve, even as a sol ver for CAREs (at l east in some cases), see [6,8,9] for det ai l s. Mor eover , for some i l l -condi t i oned CAREs , exact line sear ch i mpr oves Newt on' s me t hod also when used onl y for i t er at i ve r ef i nement . Not e t ha t t he line search s t r a t e gy di scussed in [6,8,9] al so i ncl udes t he t r i ck descr i bed in Re ma r k (1) for accel er at i ng t he l i near conver gence in case Xc does not exist. 76 Si mi l arl y, Newt on' s me t hod can be appl i ed t o t he DARE (12). The re- s u l t i n g a l g o r i t h m i s de s c r i be d i n [33] for /9 = I,~ a n d i n [3, 42] for /9 r I , , . T h e ma i n c o mp u t a t i o n t he r e is a g a i n t h e s o l u t i o n o f a l i ne ar ma t r i x e q u a t i o n which i s i n t hi s c a s e a (general i zed) St ei n equat i on. Agai n, line sear ches can be empl oyed t o ( par t i al l y) over come t he difficulties ment i oned above as t hey appl y anal ogousl y t o DAREs [6]. In bot h t he cont i nuous- and di scr et e- t i me case, in each i t er at i on s t ep of Newt on' s me t hod, a l i near ma t r i x equat i on has t o be solved. Hence t he key t o an efficient par al l el i zat i on of Newt on' s me t hod is an efficient sol ver for t he l i near ma t r i x equat i on in quest i on. Hence we empl oy t he i t er at i ve schemes dis- cussed above (sign funct i on me t hod or Smi t h i t er at i on) . Not e t ha t all ot her c omput a t i ons requi red by Newt on' s me t hod a pa r t f r om sol vi ng Lya punov equat i ons basi cal l y consi st of ma t r i x mul t i pl i cat i ons and can t her et br e be i m- pl ement ed efficiently on paral l el comput er s . The par al l el i zat i on of Ne wt on' s me t hod wi t h exact line search based upon sol vi ng t he gener al i zed Lya punov equat i ons vi a (32) is di scussed in [9] wher e al so several numer i cal e xpe r i me nt s ar e r epor t ed. 3 P r o s p e c t u s o f t h e P L I L C O In t hi s sect i on we first descri be tile Sc a LAPACK l i l )rary [15] whi ch is used as t he paral l el i nf r as t r uct ur e for our PLI LCO r out i nes. We t hen descr i be t he specific r out i nes in PLI LCO, i ncl udi ng bot h t he avai hd)l e r out i nes and t hose t ha t will be i ncl uded in a near f ut ur e t o ext end t he f l met i onal i t y of t he l i brary. 3 . 1 T h e S c a L A P A C K l i b r a r y The Sc a LAPACK (Seal abl e LAPACK) l i br ar y [15] is desi gned as an ext en- sion of t he suceesstifl LAPACK l i br ar y [2] for pa r a l M di s t r i but ed ul enl ol ' v multiI>r()cessors. SeaLAPACK mi nl i es tile LAPACK, t)oth in s t r uc t ur e and not at i on. The paralh' .l kernel s in t hi s l i br ar y rel y on t he use of t hose in t he PBLAS ( Par al l el BLAS) l i br ar y and t he BLACS (Basi c Li near Al gebr a Com- muni cat i on Subr out i nes) . The seri al c omput a t i ons are per f or med by calls t o r out i nes f l om t he BLAS and LAPACK l i brari es; t he c omnmni c a t i on r out i nes in BLACS are usual l y i mpl el nent ed on t op of a s t a nda r d c ommuni c a t i on l i br ar y as MPI or PVM. Thi s s t r uct ur ed hi er ar chy of dependences (see Fi gur e 2) enhances t he por t abi l i t y of t he codes. Basically, a paral l el al gor i t hm t ha t uses ScaLA- PACK r out i nes can be mi gr at ed t o any vect or pr ocessor , s uper s cal ar pr oces- sor, shar ed me mor y mul t i pr ocessor , or di s t r i but ed me mor y nml t i c omput e r where t he BLAS and t he MPI (or PVM) are avai l abl e. Sc a LAPACK i nl pl ement s paral l el r out i nes fbr sol vi ng l i near syst ems, lin- ear l east squar es pr obl ems, ei genval ue pr obl ems, and si ngul ar val ue prol )l ems. 77 The performance of these rout i nes depends on those of the serial BLAS and the communi cat i on l i brary ( MPI or PVM). ScaLAPACK empl oys t he so-called message-passi ng par adi gm. That is, t he processes col l aborat e in solving the probl em and explicit communi cat i on requests are performed whenever a process requires a dat um t hat is not stored in its local memory. In ScaLAPACK the comput at i ons are performed by a logical grid of P,. x P(. processes. The processes are mapped ont o t he physical processors, dependi ng on the available number of these. All dat a (mat ri ces) have to be di st ri but ed among the process grid prior to t he i nvocat i on of a ScaLA- PACK routine. It is the user' s responsibility to perform this dat a distribution. Specifically, in ScaLAPACK t he mat ri ces are part i t i oned into nb x 7tb square blocks and these blocks are di st ri but ed (and stored) among t he processes in col umn- maj or order. A graphi cal represent at i on of t he dat a l ayout is given in Fi gure 1 for a logical grid of 2 x 3 processes. . . . . i p i:i~i~i~i~iiiiiiiiii!i!iii m - i r - - e , , , iiiiiiiiiiiiii~ Po i m P~2 ,ii i Pm P~2 ::: : P()I i Po2 i i , I . . . . n . . . . . . . . ! ~176 Fig. 1. Data layout in a logical grid of 2 x 3 processes. Al t hough not strictly part of ScaLAPACK, t he l i brary also provides rou- tines for di st ri but i ng a mat r i x among t he process grid. The communi cat i on overhead of this initial di st ri but i on is well bal anced in most medi um and large-scale appl i cat i ons by t he i mpr ovement s in performance achieved with parallel comput at i on. 78 3 . 2 S t r u c t u r e o f P L I L CO PLI LCO heavily relies on the use of the available parallel i nfrast ruct me in ScaLAPACK (see Figure 2). Although ScaLAPACK is incomplete, the ker- nels available in the current version (1.6) allows us to implement most of our PLI LCO routines. PLI LCO will benefit from future extensions and de- velopments in the ScaLAPACK project. Improvement s in performance of the PBLAS kernels will also be specially welcome. APACK PBLAS _ ~ r H, , h, , I Fig. 2. Structure of PLILCO. In PLI LCO the routines are named, following the convention in LAPACK and ScaLAPACK, as PDxxyyzz. The PD- prefix in each name indicates t hat this is a Parallel routine with Double-precision arithmetic. The following two letters, xx indicate the type of LTI system addressed by the routine. Thus, GE or GG indicate, respectively, a st andard LTI system (E = In) or a generalized LTI system (E ~ I,~). The last four letters in the name indicate the spe- cific problem (yy), and the met hod employed for t hat problem (zz). In most cases, a sequence Cyzz indicates t hat the routine deals with a continuous-time problem while a sequence Dyzz indicates a discrete-time problein. The PLI LCO routines can be classified in 4 groups according to their functionality (comput at i on of basic mat ri x functions, linear mat ri x equation solvers, optimal control, and feedback stabilization). Two more groups will be included in the near future. We next review the routines in each of these 4 groups. Group MTF: Basic matrix functions. The routines in this group implement iterative schemes to comput e functions 7 9 of matrices or matrix pencils. For instance, three routines are available for comput i ng the sign function of a matrix. These routines employ different variants for the matrix sign function iteration: - PDGESGNW. The Newton iteration. - PDGESGNS. The Newton-Schulz iteration. - PDGESGHA. The Halley iteration. Two more routines are designed for comput i ng the sign function or the disk flmction of matrix pencils: - PDGGSGNW. The generalized Newton iteration for the mat ri x sign function. - PDGGDKMA. The iteration (38) for comput i ng the disk function. Note t hat the iteration (38) for the disk function only deals with mat ri x pencils and therefore PLI LCO does not provide any routine for the st andard problem. The disk function of a matrix Z is obtained by appl yi ng routine P D G G D K M A to Z -- A[. Table 1 lists the PLI LCO routines in the MTF group. Problem Type of Function Standard I Generalized Matrix sign function PDGESGNW PDGGSGNW PDGESGNS PDGESGHA Ma t r i x p e n c i l d i s k f u n c t i o n PDGGDKMA Table 1. PLILCO routines in MTF group. Group LME: Linear matrix equation solvers. The routines in this group are solvers for several particular instances of gen- eralized Sylvester matrix equations, see (13) (19). All the solvers for those equations arising in continuous-time LTI systems are based on tile matrix sign flmction and require t hat tile coefficient matrices of the equation are stable. PLI LCO includes three solvers for stable Sylvester equations t hat differ in tile iteration used for tile comput at i on of the matrix sign flmction: - PDGECSNW. The Newton iteration. - PDGECSNS. The Newton-Sctmlz iteration. - PDGECSHA. The Halley iteration. In the generalized problem a solver for generalized Sylvester equations is also included: 8 0 - PDGGCSNW. The gener al i zed Newt on i t er at i on as in (34). All t hese sol vers have t hei r anal ogous r out i nes for st abl e Lya punov equa- t i ons (t hree r out i nes) and st abl e gener al i zed Lya punov equat i ons (one r ou- tine): - PDGECLNW. The Newt on i t er at i on. - PDGECLNS. The Newt on- Schul z i t er at i on. - PDGECLHA. The Hal l ey i t er at i on. - PDGGCLNW. The gener al i zed Newt on i t er at i on as in (32). Fur t her mor e, in case t he cons t ant t e r m/ ) is semi def i ni t e it is al so possi bl e t o obt ai n t he Chol esky f act or of t he sol ut i on di r ect l y by means of r out i nes - PDGECLNC. The Newt on i t er at i on for t he Chol esky fact or. - PDGGCLNC. The gener al i zed Newt on i t er at i on for t he Chol esky f act or . In t he di scr et e- t i me case, t he i t er at i ve sol vers in PLI LCO ar e bas ed on t he Smi t h i t er at i on and requi re t he coefficient mat r i ces t o be s t abl e (in tile di scr et e- t i me sense). So far PLI LCO onl y i ncl udes t wo sol vers, for t he di scr et e- t i me Syl vest er equat i on (15) and t he St ei n (or di s cr et e- t i me Lya- t mnov) equat i on (18), respect i vel y: - PDGEDSSM. The Smi t h i t er at i on for (15). - PDGEDLSN. The Smi t h i t er at i on for (18). Special versi ons of t he Smi t h i t er at i on for c omput i ng tile Chol esky f act or s of semi defi ni t e St ei n equat i ons as in (22) and (23) ar e t o be devel oped in t he fltture. The sol ut i on of gener al i zed di s cr et e- t i me l i near equat i ons can be obt a i ne d by t r ans f or mi ng t hi s equat i on i nt o an s t a nda r d one, as t her e is no geuer al i zed versi on of t he Smi t h i t er at i on. Not e t ha t t hi s t r a ns f or ma t i on i nvol ves expl i ci t i nversi on of t he coefficient mat r i ces in tile gener al i zed equat i on. Tabl e 2 s ummar i zes t he PLI LCO r out i nes in t he LME gr oup. Gr oup RIC: Ri ccat i ma t r i x equat i on sol vers. We include in t hi s gr oup sol vers bot h for CARE and DARE. In t he cont i nuous- t i me case, PLI LCO pr ovi des sol vers based on t hr ee different met hods, t hese are, Newt on' s met hod, t he ma t r i x sign f l mct i on, and t he ma t r i x disk funct i on. Mor eover , in t he s t a nda r d case, t hr ee di fferent var i ant s are pr opos ed for Newt on' s me t hod dependi ng on t he Lya punov sol ver t hat is empl oyed. Thus, we have t he following CARE solvers: - PDGECRNW. Newt on' s me t hod wi t h t he Newt on i t er at i on for sol vi ng t he Lyapunov equat i ons. - PDGECRNS. Newt on' s me t hod wi t h t he Newt on- Schul z i t er at i on for sol vi ng t he Lyapunov equat i ons. - PDGECRHA. Newt on' s me t hod wi t h tile Hal l ey i t er at i on for sol vi ng t he Lyapunov equat i ons. t Problem Type of equation ~ Generalized Syl vester PDGECSNW PDGGCSNW PDGECSNS PDGECSHA L y a p u n o v PDGECLNW PDGGCLNW PDGECLNS PDGECLHA PDGECLNC PDGGCLNC Di scret e- t i me Syl vest er PDGEDSIqW Stei n PDGEDLSM Tabl e 2. PLILCO routines in the LME group. 81 - PDGECRSG. The matrix sign function met hod. - PDGECRDK. The matrix disk function met hod. Similarly, we have the following generalized CARE solvers: - PDGGCRNW. Newt on' s met hod with the generalized Newton iterative scheme for solving the generalized Lyapunov equations. - PDGGCRSG. The generalized mat ri x sign function method. PDGGCRDK. The matrix disk flmction met hod. PLI LCO also includes the following solvers for DARE (two routines) and generalized DARE (two routines): - PDGEDRSM. Newt on' s met hod with the Smith iteration for solving the discrete-time Lyapunov equations. - PDGEDRDK. The matrix disk function met hod for DARE. - PDGGDRSM. Newt on' s met hod with the Smith iteration for solving the discrete-time generalized Lyapunov equations. - PDGGDRDK. Tile matrix disk function met hod for generalized DARE. Table 3 lists the PLI LCO routines ill tile RIC group. Group STF: Feedback stabilization of LTI systems. This group includes routines for partial and complete (state feedback) sta- bilization of LTI systems. The routines in this group use the linear mat ri x equations solvers in group LME to deal with the different equations arising in st andard and generalized, continuous-time and discrete-time LTI systems. PLI LCO thus includes several state feedback stabilizing routines, which differ in the linear mat ri x equation t hat has to be solved and, therefore, the iter- ation employed. The feedback stabilization of continuous-time LTI systems (:an be obtained by means of routines: 82 Type of Equat i on C A R E Pr obl e m St andar d] PDGECRNW PDGECRNS PDGECRHA PDGECRSG PDGECRDK Gener al i zed PDGGCRNW PDGGCRSG PDGGCRDK D A R E PDGEDRSM PDGGDRSM PDGEDRDK PDGGDRDK T a b l e 3. PLI LCO r out i nes in t he RI C gr oup. PDGECFNW. T h e Ne wt o n i t e r a t i o n . PDGECFNS. Th e Ne wt o n - Sc h u l z i t e r a t i o n . - PDGECFHA. Th e Ha l l e y i t e r a t i o n . PDGGCFNW. Th e g e n e r a l i z e d Ne wt o n i t e r a t i o n . hi t h e d i s c r e t e - t i me case, t he u n i q u e r o u t i n e a v a i l a b l e so f ar is t h e f ol l ow- i ng: - PDGEDFSM. Th e S mi t h i t e r a t i o n . Ta b l e 4 l i s t s t h e n a me s of t he r o u t i n e s ill g r o u p S TF . Type of LTI s ys t em cont i nuous - t i me Pr obl e m St andar d[ Gener al i zed PDGECFNW PDGECFNS PDGECFHA di s cr et e- t i me PDGEDFSM PDGGSTNW T a b l e 4. PLI LCO r out i nes in STF group. F u t u r e e x t e n s i o n s of P L I L C O wi l l i n c l u d e a t l e a s t t wo mo r e g r o u p s : Gr o u p MRD: Mo d e l r e d u c t i o n of LTI s y s t e ms . Gr o u p H2I : C o mp u t a t i o n of H2- a n d Ho o - c o n t r o l l e r s . 83 4 P r e l i mi n a r y Re s u l t s In this section we present some of the preliminary results obtained with the PLI LCO routines on different parallel architectures. Specifically, we report results for the Lyapunov equation solver PDGECLNW and the generalized Lya- punov equation solver PDGGCLNW. As target parallel distributed memory architectures we evaluate our al- gorithms on an IBM SP2 and a Cray T3E. In bot h cases we use the na- tive BLAS, the MPI communication library, and the LAPACK, BLACS, and ScaLAPACK libraries [2,15] to ensure the portability of the algorithms. The IBM SP2 t hat we used consists of 80 RS/ 6000 nodes at 120 MHz, and 256 MBytes RAM per processor. Internally, the nodes are connect ed by a TB3 high performance switch. The Cray T3E-600 has 60 DEC Al pha EV5 nodes at 300 MHz, and 128 Mbytes RAM per processor. The communi cat i on network has a bidimensional torus topology. Table 5 reports the performance of tile Level-3 BLAS mat ri x product (in Mflops, or millions of floating-point arithmetic operations per second), and tile latency and bandwi dt h for the communication system of each platform. IBM SP2 C r a y T3E DGEMM (Mflops) 200 400 Latency (sec.) 30 x 10 -~ 50 x 10 -~ Bandwith (Mbit/sec.) 90 166 Tabl e 5. Basic performance parameters of the parallel architectures. Ii1 bot h matrix equations, the coefficient matrix A is generated with ran- dora mfifolm entries. This matrix is stabilized by a shift of the eigenvalues (.4 := A- HA[[FI,~ ill the continuous-tinm case and A :-- A/J[AIIF ill the discrete-time case!. In case a LTI system is required, A - AE is obt ai ned fronl as .4 = R, E = QH where Q and R are obtained from a QR factor- ization A. Tile sohltion matrix X is set to a matrix with all entries equal to one and matrix Q is then chosen to satisfy the corresponding linear mat ri x equation. All experiments were performed using Fortran 77 and mEE double-precision arithmetic ( em 2.2 x 10 16). In our examples, the solution is obt ai ned with the accuracy t hat could be expected from the conditioning of the problem. A inore detailed st udy of the accuracy of these solvers is beyond the scope of this paper. For details and numerical examples demonst rat i ng the performance and numerical reliability of the proposed equation solvers, see [9 14]. The figures show the Mflops ratio per node when the number" of nodes is increased and the ratio n/p is kept constant. Thus, we are measuri ng the 84 scal abi l i t y of our paral l el r out i nes. The r esul t s in t he fi gures ar e aver aged for 5 execut i ons on di fferent r a ndoml y gener at ed mat r i ces . I n t hese fi gures t he solid line i ndi cat es t he ma x i mu m a t t a i na bl e real pe r f or ma nc e ( t hat of DGEMM) and t he dashed line r epr esent s t he per f or mance of t he cor r es pondi ng l i near mat r i x equat i on solver. Fi gur e 3 r epor t s t he Mfl ops r at i o per node for r out i ne PDOECLNW on t he Cr ay T3E pl at f or m and r out i ne PDGGCLNW on t he I BM SP2 pl at f or m. 40( 35( 30c c ~ 25C ~o o~20C ~15C 10C -x. . . . . . . . . . 0 5 10 15 20 25 30 2 4 6 8 10 12 14 16 Number of nodes Number of nodes Fi g. 3. Mflop ratio for routine PDGECLNW on the Cray T3E with n/p = 750 (left), and routine PDGGCLNW on the IBM SP2, with n/p = 1000 (right). Bot h figures show si mi l ar resul t s. The per f or mance per node of t he al- gor i t hms decreases when t he numbe r of pr ocessor s is i ncr eased f r om 1 t o 4 due to t he communi cat i on over head of t he paral l el al gor i t hm. However , as t he number of pr ocessor s is f ur t her i ncr eased, t he pe r f or ma nc e onl y decr eases sl i ght l y showi ng t he scal abi l i t y of t he sol vers. 5 Co n c l u d i n g Re ma r k s \'~> have descr i bed t he devel opment of a sof t war e l i br ar y for sol vi ng tile ( : omput at i onal pr obl ems t ha t ari se in anal ysi s and synt hesi s of l i near con- t rol syst ems. The l i br ar y is i nt ended for sol vi ng medi um- s i ze and l ar ge- scal e pr obl ems and t he numer i cal r esul t s de mons t r a t e its pe r f or ma nc e on s har ed and di st r i but ed me mor y paral l el ar chi t ect ur es. The por t abi l i t y of t he l i br ar y is ensured by usi ng tile PBLAS, BLACS, and Sc a LAPACK. I t is hoped t ha t t hi s hi gh- per f or mance comput i ng a ppr oa c h will enabl e users t o deal wi t h l arge scale pr obl ems in l i near cont r ol t heory. 85 Re f e r e n c e s 1. B. D. O. Ander s on and J. B. Moore. Optimal Control - Linear Quadratic Meth- ods. Pr ent i ce- Hal l , Engl ewood Cliffs, N J, 1990. 2. E. Ander son, Z. Bai, C. Bischof, J. Demmel , J. Dongar r a, J. Du Croz, A. Gr een- baum, S. Hammar l i ng, A. McKenney, S. Ost r ouchov, and D. Sor ensen. LA- PACK Users' Guide. SI AM, Phi l adel phi a, PA, second edi t i on, 1995. 3. W. F. Ar nol d, I I I and A. J. Laub. Gener al i zed ei genpr obl em al gor i t hms and soft ware for al gebr ai c Ri ccat i equat i ons. Proc. IEEE, 72:1746-1754, 1984. 4. Z. Bai, J. Demmel , and M. Gu. An i nverse t ree par al l el s pect r al di vi de and conquer al gor i t hm for nons ymmet r i c ei genpr obl ems. Numer. Math., 76(3):279 308, 1997. 5. R. H. Bar r el s and G. W. St ewar t . Sol ut i on of t he ma t r i x equat i on AX+XB = C: Al gor i t hm 432. Comm. ACM, 15:820-826, 1972. 6. P. Benner. Contributions to the Numerical Solution of Algebraic Riccati Equa- tions and Related Eigenvalue Problems. Logos Verl ag, Berl i n, Ger many, 1997. Also: Di s s er t at i on, Fakul t ~t ffir Mat hemat i k, TU Che mni t z Zwi ckau, 1997. 7. P. Benner and R. Byers. Di sk f unct i ons and t hei r r el at i ons hi p t o t he ma t r i x si gn f unct i on. In Proc. European Control Conf. ECC 97, Pa pe r 936. BELWARE I nf or mat i on Technol ogy, ~,Vaterloo, Bel gi um, 1997. CD- ROM. 8. P. Benner and 1~.. Byers. An exact line sear ch me t h o d for sol vi ng gener al i zed cont i nuous - t i me al gebr ai c Ri ccat i equat i ons. IEEE Trans. Automat. Control, 43(1):101 107, 1998. 9. P. Benner , IR. Byers, E.S. Qui nt ana- Or t f , and G. Qui nt a na - Or t i . Sol v- ing al gebr ai c Ri ccat i equat i ons on par al l el comput er s usi ng Newt on' s me t hod wi t h exa(:t line search. Beri cht e aus der Te c hnoma t he ma - t i k, Hepor t 98 05, Uni ver si t ~t Br emen, Augus t 1998. Avai l abl e from h t t p : / / www. mat h, u n i - b r e me n , d e / z e ' c e m/ b e r i c h t e , ht ml . 10. P. Benner, M. Cast i l l o, V. Her nandez, and E.S. Qui nt ana- Or t / . Par al l el par t i al st abi l i zi ng al gor i t hms for l arge l i near cont r ol syst ems, d. Supereomputing, t o appear . 11. P. Benner, J. M. Cl aver, and E.S. Qui nt ana- Or t f . Efficient sol ut i on of coupl ed Lyat mnov equat i ons vi a ma t r i x sign f l mct i on i t er at i on. I n A. Dour a do et al. , ed- i t or, Proc. 3 "d Portuguese Conf. on Automatic Control CONTROLO' 98, Coi n> br a, pages 205 210, 1998. 12. P. Benner, .1.M. Cl aver, and E. S. Qui nt ana- Or t i . Par al l el di s t r i but e d sol vers fl)r l arge st abl e gener al i zed Lyapunov equat i ons. Parallel Proecssing Letters, t o appear . 13. P. Benner and E.S. Qui nt ana- Or t i . Sol vi ng st abl e gener al i zed Lya punov equa- t i ons wi t h t he ma t r i x sign funct i on. Numer. Algorithms, t o appear . 14. P. Benner , E. S. Qui nt a na - Or t i , and G. Qui nt a na - Or t i . Sol vi ng l i near ma t r i x equat i ons vi a r at i onal i t er at i ve schemes. I n pr epar at i on. 15. L.S. Bl ackford, J. Choi , A. Cl eary, E. D' Azevedo, ,I. Demmel , I. Dhi l l om J. Don- gar r a, S. Hammar l i ng, G. Henry, A. Pet i t et , K. St anl ey, D. Wal ker , and R. C. \ Vhal ey. ScaLAPACK Users' Guide. SI AM, Phi l adel phi a, PA, 1997. 16. I. Bl anquer , D. Guer r er o, V. Her nandez, E. Qui nt a na - Or t / , and P. Rui z. Pa r a l l e l - SLI COT i mpl ement at i on and doc ume nt a t i on s t a nda r ds . SLI COT Wor ki ng Not e 1998 1, h t t p : / / www. wi n. r ue . n l / n i c o n e t / , Se pt e mbe r 1998. 86 17. D. Bol ey and R. Mai er. A par al l el QR al gor i t hm for t he uns ymme t r i c ei genval ue pr obl em. Techni cal Re por t TR- 88- 12, Uni ver s i t y of Mi nnes ot a at Mi nneapol i s , De pa r t me nt of Comput e r Sci ence, Mi nneapol i s, MN, 1988. 18. R. Byers. Sol vi ng t he al gebr ai c Ri ccat i equat i on wi t h t he ma t r i x si gn f unct i on. Linear Algebra Appl., 85:267-279, 1987. 19. E. J. Davi son and F. T. Man. The numer i cal sol ut i on of A' Q+QA = - C. I EEE Trans. Automat. Control, AC-13: 448 449, 1968. 20. J. D. Gar di ner and A. J. Laub. A gener al i zat i on of t he ma t r i x- s i gn- f unc t i on sol ut i on for al gebr ai c Ri ccat i equat i ons. Internat. J. Control, 44:823 832, 1986. 21. J. D. Gar di ner a nd A. J. Laub. Par al l el al gor i t hms for al gebr ai c Ri ccat i equa- t i ons. Internat. J. Control, 54:1317-1333, 1991. 22. J. D. Gar di ner , A. J. Laub, J. J. Ama t o, and C. B. Moler. Sol ut i on of t he Syl ves t er ma t r i x equat i on AXB + CXD = E. ACM Trans. Math. Software, 18:223 231, 1992. 23. J. D. Gar di ner , M. R. Wet t e, A. J. Laub, J. J. Ama t o, and C.B. Mol er. Al gor i t hm 705: A For t r an- 77 soft ware package for sol vi ng t he Syl ves t er ma t r i x e qua t i on AXB T + CXD T = E. ACM Trans. Math. Software, 18:232-.238, 1992. 24. G. A. Gei st , R. C. War d, G. J. Davi s, and R. E. Funder l i c. Fi ndi ng ei genval ues and ei genvect ors of uns ymme t r i c mat r i ces usi ng a hype r c ube mul t i pr oces s or . In G. Fox, edi t or , Proc. 3rd Conference on Hypercube Concurrent Computers and Appl., pages 1577-1582, 1988. 25. G. H. Gol ub, S. Nash, and C. F. Van Loan. A Hes s enber g- Schur me t h o d for t he pr obl em AX + XB = C. I EEE Trans. Automat. Control, AC-24: 909 913, 1979. 26. G. H. Gol ub and C. F. Van Loan. Matrix Computations. Johns Hopki ns Uni - ver si t y Press, Bal t i mor e, t hi r d edi t i on, 1996. 27. M. Gr een and D. J. N Li mebeer . Linear Robust Control. Pr ent i ce- Hal l , Engl e- wood Cliffs, N J, 1995. 28. C. -H. Guo. Newt on' s me t hod for di scr et e al gebr ai c Ri ccat i equat i ons when t he cl osed-l oop ma t r i x has ei genval ues on t he uni t circle. SI AM J. Matrix A'nal. Appl., 20:279-294, 1998. 29. C.-H. Guo and P. Lancast er . Anal ys i s and modi f i cat i on of Ne wt on' s me t h o d for al gebr ai c Ri ccat i equat i ons. Math. Comp., 67:1089 1105, 1998. 30. S. J. Hammar l i ng. Numer i cal sol ut i on of t he st abl e, non- negat i ve def i ni t e Lya- punov equat i on. IMA J. Numer. Anal., 2:303 323, 1982. 31. G. Henr y and R. van de Gei j n. Par al l el i zi ng t he QR al gor i t hm for t he uns ym- met r i c al gebr ai c ei genval ue pr obl em: myt hs and real i t y. SI AM J. Sci. Comput., 17:870-883, 1997. 32. G. Henry, D.S. Wat ki ns , and J. J. Dongar r a. A par al l el i mpl e me nt a t i on of t he nons ymmet r i c QR al gor i t hm for di s t r i but e d memor y ar chi t ect ur es . LAPACK Wor ki ng Not e 121, Uni ver si t y of Tennessee at Knoxvi l l e, 1997. 33. G. A. Hewer. An i t er at i ve t echni que for t he c omput a t i on of s t e a dy s t a t e gai ns for t he di scr et e opt i ma l r egul at or . I EEE Trans. Automat. Control, AC-16: 382 384, 1971. 34. A.S. Hodel and K. R. Pol l a. Heur i st i c appr oaches t o t he sol ut i on of ver y l arge spar se Lyapunov and al gebr ai c Ri ccat i equat i ons. I n Proc. 27th I EEE Conf. Decis. Cont., Aust i n, TX, pages 2217-2222, 1988. 35. C. Kenney and A. J. Laub. The ma t r i x si gn f unct i on. I EEE Trans. Automat. Control, 40(8):1330-1348, 1995. 87 36. C. Kenney, A. J. Laub, and M. Wet t e. A s t a bi l i t y- e nha nc i ng scal i ng pr ocedur e for Schur - Ri ccat i solvers. Sys. Control Lett., 12:241-250, 1989. 37. D. L. Kl ei nman. On an i t er at i ve t echni que for Ri ccat i equat i on comput at i ons . IEEE Trans. Automat. Control, AC-13: 114-115, 1968. 38. P. Lancas t er and L. Rodma n. The Algebraic Riccati Equation. Oxf or d Uni ver - si t y Pr ess, Oxf or d, 1995. 39. A. J. Laub. A Schur me t hod for sol vi ng al gebr ai c Ri ceat i equat i ons. IEEE Trans. Automat. Control, AC-24:913 921, 1979. 40. A. J. Laub. Al gebr ai c as pect s of gener al i zed ei genval ue pr obl ems for sol vi ng Ri ccat i equat i ons. I n C.I. Byr nes and A. Li ndqui s t , edi t or s, Computational and Combinatorial Methods in Systems Theory, pages 213-227. El sevi er ( Nor t h- Hol l and) , 1986. 41. A. N. Mal yshev. Par al l el al gor i t hm for sol vi ng some s pect r al pr obl ems of l i near al gebr a. Linear Algebra Appl., 188/ 189: 489-520, 1993. 42. V. Mehr mann. The Autonomous Linear Quadratic Control Problem, Theory and Numerical Solution. Numbe r 163 in Lect ur e Not es in Cont r ol and Infor- ma t i on Sciences. Spr i nger - Ver l ag, Hei del ber g, J ul y 1991. 43. V. Mehr mann. A s t ep t owar d a uni f i ed t r e a t me nt of cont i nuous and di scr et e t i me cont r ol pr obl ems. Linear Algebra Appl., 241-243: 749-779, 1996. 44. T. Pappas , A. J. Laub, and N. R. Sandel l . Oi l t i l e numer i cal sol ut i on of t he di s cr et e- t i me al gebr ai c Ri ccat i equat i on. IEEE Trans. Automat. Control, AC- 25:631-641, 1980. 45. T. Penzl . Numer i cal sol ut i on of gener al i zed Lya punov equat i ons. Adv. Comp. Math., 8:33 48, 1997. 46. ,I.D. Rober t s . Li near model r educt i on and sol ut i on of t he al gebr ai c Ri ccat i equat i on by use of t he si gn funct i on. Internat. J. Control, 32:677-687, 1980. ( Repr i nt of Techni cal Repor t No. TR- 13, CUED/ B- Cont r ol , Cambr i dge Uni - versi t y, Engi neer i ng De pa r t me nt , 1971). 47. A. Saber i , P. Sannut i , and B. M. Chen. H2 Optimal Control. Pr ent i ce- Hal l , Her t f or dshi r e, UK, 1995. 48. G. Schcl fl l out . Model Reduction for Control Design. PhD t hesi s, Dept . El ect r i - c:al Engi neer i ng, KU Leuven, 3001 Leuven- Hever l ee, Bel gi um, 1996. 49. V. Si ma. Algorithms for Linear-Quadratic Optimization, vol ume 200 of Pure and Applied Mathematics. Mar cel Dekker , Inc. , New York, NY, 1996. 50. R. A. Smi t h. Mat r i x equat i on XA+ BX = C. SIAM J. Appl. Math., 16(1):198 201, 1968. 51. G. W. St ewar t . A par al l el i mpl e me nt a t i on of t he QR al gor i t hm. Parallel Com- puting, 5:187-196, 1987. 52. X. Sun and E. S. Qui nt ana- Or t f . Spect r al di vi si on me t hods for bl ock general - i zed Schur decomposi t i ons. PRI SM Wor ki ng Not e #32, 1996. Avai l abl e from h t t p : / / w w w - c . m c s . a n l . g o v / P r o j e c t s / P R I S M . 53. P. V a n D o o r e n . A g e n e r a l i z e d e i g e n v a l u e a p p r o a c h for s o l v i n g R i c e a t i e q u a t i o n s . Sl AM J. Sei. Statist. Comput., 2:121 135, 1981. 54. A. Varga. A not e on Ha mma r l i ng' s al gor i t hm for t he di scr et e Lyapunov equa- t i on. Sys. Control Lett., 15(3):273-275, 1990. 55. A. Varga. Comput a t i on of Kr onecker - l i ke forms of a s ys t em penci h Appl i ca- t i ons, al gor i t hms and soft ware. In Proe. CA CSD'96 Symposium, Dearborn, MI, pages 77 82, 1996. 56. K. Zhou, ,J.C. Doyl e, and K. Gl over. Robust and Optimal Control. Pr ent i ce- Hal l , Uppe r Saddl e Ri ver , N J, 1995. ParaSt at i on Us er Level Communi c at i on J oa c hi m M. Bl um and Th o ma s M. War schko and Wal t er F. Ti chy Inst i t ut ffir Programmst rukt uren und Datenorganisation, Fakults ffir Informat i k, Am Fasanengarten 5, Universits Karlsruhe, D-76128 Karlsruhe, Germany S u mma r y . PULC (ParaSt at i on User Level Communication) is a user-level com- munication library for workstation clusters. PULC provides a multi-user, multi- programmi ng communication library for user-level communication on top of high- speed communication hardware. This paper describes the design of the communi- cation subsystem, a first implementation on top of the ParaSt at i on communi cat i on adapter, and benchmark results of this first implementation. PULC removes the operating syst em from the communication pat h and of- fers a multi-process environment with user-space communication. Additionally, it moves some operating system functionality to the user-level to provide higher effi- ciency and flexibility. Message demultiplexing, protocol processing, hardware inter- facing, and mut ual exclusion of critical sections are all implemented in user-level. PULC offers the programmer multiple interfaces including TCP user-level sockets, MPI [CGH94], PVM [BDG+93], and Active Messages [CCHvE96]. Thr oughput and latency are close to the hardware performance (e.g., the TCP socket protocol has a latency of less t han 9 #s). Keywords: Wor ks t at i on Cl ust er, Par al l el and Di s t r i but ed Comput i ng, User- Level Communi c a t i on, Hi gh- Speed I nt er connect s. 1. I n t r o d u c t i o n Co mmo n net wor k pr ot ocol s are desi gned for general pur pos e c ommuni c a t i on in a L AN/ WAN envi r onment . Thes e pr ot ocol s reside in t he kernel of an op- er at i ng s ys t e m and are bui l t t o i nt er act wi t h di vers c ommuni c a t i on har dwar e. To handl e t hi s di versi t y, ma n y s t andar di s ed l ayers exi st . Each l ayer offers an i nt er f ace t hr ough which t he ot her l ayers can access i t s servi ces. Thi s l ayer ed ar chi t ect ur e is useful for s uppor t i ng di vers har dwar e but l eads t o hi gh and inefficient pr ot ocol st acks. Pr ot ocol s whi ch are usi ng s t andar di s ed i nt er f aces of t he oper at i ng s ys t e m are unawar e of super i or har dwar e f unct i onal i t y and oft en r ei mpl ement f eat ur es in sof t war e even if t he har dwar e al r eady pr ovi des t hem. Anot her inefficiency is due t o copy oper at i ons bet ween kernel - and user- space and wi t hi n t he kernel itself. To t r a ns mi t a message t he kernel has t o copy t he da t a f r om or t o user-space. The copyi ng bet ween pr ot ect ed addr ess space boundar i es oft en adds mor e l at ency t ha n t he physi cal t r ans mi s s i on of a message. In addi t i on, t he kernel copies t he da t a several t i mes f r om one buffer t o anot her while t r aver si ng l ayers of t he pr ot ocol st ack. On t he posi t i ve side, t he t r adi t i onal c ommuni c a t i on pa t h wi t h t he kernel as si ngl e poi nt of access t o t he har dwar e ensures correct i nt er act i on wi t h t he har dwar e and mu t u a l excl usi on of compet i ng processes. 90 For parallel comput at i on on clusters of workstations, many of the proto- cols which are designed for wide area networks are too inefficient. Therefore, cluster comput i ng must take new approaches. The most promising technique is to move prot ocol processing to user-level. This technique opens up the opport uni t y to investigate opt i mi sed protocols for parallel processing. Wi t h user-level protocols there is no need t o use the standardised interfaces between the operat i ng syst em and the device driver. Thus, the rei mpl ement at i on of services in software which are al ready provi ded by the hardware can be avoided. E Application User Environment - - I I User Libc I [ [ System Library Sys t e m TCP/IP I I Ethernet Network Protocols Device Driver ! I Network I [Network I Hardware Fig. 1.1. User-level communication highway User-level communi cat i on removes the kernel from the critical pat h of dat a transmission. Figure 1.1 shows how user-level communi cat i on short cut s the access to the communi cat i on hardware. Hi gh-performance communi cat i on protocols are based on superior hardware features to speed up communi ca- tion. Copying dat a between kernel- and user-space is avoided and the imple- ment at i on of t rue zero-copy protocols is possible. These key issues minimise latency and lead to high t hroughput . But user-level communi cat i on has also its drawbacks, because now the single point of access to the communi cat i on hardware, namel y the kernel, is missing. Therefore many user-level communi cat i on libraries restrict the num- ber of processes on a node to a single process. Enabling mul t i pl e processes on one node in user-level raises difficulties, but also offers a lot of benefits. Once problems, such as demultiplexing of messages and ensuring correct interac- tion between multiple processes are solved, the high-speed communi cat i on 91 net wor k can be used si mi l ar t o a cl ust er wi t h regul ar communi cat i on chan- nels such as Uni x sockets. The goal of PULC is t o pr ovi de a mul t i - user , mul t i - pr ogr ammi ng commu- ni cat i on l i br ar y for user-level communi cat i on on t op of a hi gh- speed commu- ni cat i on har dwar e. The first i mpl ement at i on of PULC uses t he Pa r a St a t i on communi cat i on adapt er , whi ch is descri bed in sect i on 3.. Sect i on 4. pr esent s design al t er nat i ves and t he opt i mi s at i on t echni ques used. In sect i on 5., t hi s paper describes t he i mpl ement at i on of PULC on t op of Par aSt at i on. Per f or - mance figures for t wo different har dwar e pl at f or ms are pr esent ed in sect i on 6.. The l ast t wo sect i ons present t he concl usi on and t he pl ans for f ut ur e work. 2. Re l a t e d Wo r k Ther e are several appr oaches t ar get i ng efficient paral l el comput i ng on work- st at i on clusters. Some of t hem use cust om har dwar e whi ch s uppor t me mor y ma ppe d communi cat i on. SHRI MP [DBDF97] bui l ds a client server c omput i ng envi r onment on t op of a vi r t ual shar ed memor y. Si mi l ar t o PULC, SHRI MP offers st andar di sed i nt erfaces such as Uni x sockets. Di gi t al ' s Me mor y Chan- nel [FG97] is pr opr i et ar y t o DEC Al phas and uses address space ma ppi ng t o t r ansf er dat a f r om one process t o anot her . On t op of t hi s low level mecha- ni sm Memor y Channel offers MPI and PVM. Many recent paral l el machi nes, e.g. IBM SP2, are a col l ect i on of regul ar wor kst at i ons connect ed wi t h a hi gh speed i nt er connect . Ot her s use commodi t y har dwar e t o i mpl ement communi cat i on subsys- t ems. OSCAR (e.g. [JR97]) i mpl ement s MPI on t op of SCI cards. Fast Mes- sages [ CPL+97] and Act i ve Messages [CCHvE96] are appr oaches for MPP syst ems por t ed t o wor kst at i on clusters. Bot h offer low l at ency pr ot ocol s whi ch can be used t o bui l d ot her communi cat i on l i brari es on t op. As an exampl e t he Berkeley Fast Socket pr ot ocol [SR97] is bui l d on t op of Act i ve Messages. Si mi l ar t o PULC, it provi des an obj ect code compat i bl e socket i nt erface. I t ' s l at ency is about 75 ps and i t ' s t hr oughput reaches 33 MByt e / s on Myr i net . But in cont r ast t o PULC it has some rest ri ct i ons in t he use of f o r k ( ) and e xe c ( ) calls. Di fferent l y f r om t he cur r ent PULC i mpl ement at i on, it pr ovi des i nt er oper abi l i t y bet ween Fast Socket and ot her appl i cat i ons on t he same clus- t er whereas PULC onl y provi des it for out - of - cl ust er communi cat i on. BI P [PT97] and Myr i com GM [myr] i mpl ement low level i nt erfaces t o t he Myr i net har dwar e. The y are compar abl e wi t h t he PULC har dwar e ab- st r act i on l ayer but lack on hi gher prot ocol s. Ga mma [CC97] bui l ds Act i ve Messages on t op of Fast Et her net cards and gets near l y full per f or mance by addi ng a syst em call and bui l di ng a speci al pr ot ocol in t he Li nux kernel . UNet [WBvE97] uses Fast Et her net and ATM t o bui l d an abs t r act i on of t he net wor k i nt erface. Dependent on t he har dwar e suppor t , t hey use kernel or user-level communi cat i on. The y' ve even bui l t a me mor y ma na ge me nt s ys t em 92 t o enabl e DMA t r ansf er t o previ ousl y unpi nned pages. Such a me mo r y man- agement is not i mpl ement ed in PULC, but coul d be done as soon as har dwar e wi t h DMA t r ansf er and on- boar d processors are used. In t he Berkel ey NOW pr oj ect [ACP95], GLUni x offers a t r ans par ent global view of a cluster. As in PULC t he net wor k of wor kst at i ons can be used si mi l ar t o a single paral l el machi ne. Thei r mai n focus is on Act i ve Mes- sages and t her ef or e no ot her pr ot ocol s are i mpl ement ed. 3. P a r a S t a t i o n Ha r d wa r e The first i mpl ement at i on of PULC uses t he Par aSt at i on hi gh-speed commu- ni cat i on card as communi cat i on har dwar e. Par aSt at i on is t he r eengi neer ed MPP- net wor k of Tr i t on/ 1 [ HWTP93] , an MPP- s ys t e m bui l t at t he Uni ver- si t y of Karl sruhe. Wi t hi n a wor kst at i on cl ust er t he Pa r a St a t i on har dwar e is dedi cat ed t o paral l el appl i cat i ons while t he oper at i ng syst em cont i nues t o use s t andar d har dwar e (e.g., Et her net ) . The net wor k t opol ogy is based on a t wo- di mensi onal t or oi dal mesh. Tabl e- based, sel f-rout i ng packet swi t chi ng t r ans por t s da t a using vi r t ual cut - t hr ough rout i ng. The size of a packet can var y f r om 4 t o 508 byt es. Packet s are delivered in or der and no packet s are lost. Flow cont rol is pr ovi ded at link level and t he uni t of flow cont rol is one packet . These f eat ur es enabl e t he sof t war e t o use a si mpl e f r agment at i on/ def r agment at i on scheme. The c ommuni c a t i on processor used involves a r out i ng del ay of about 250as per node and offers a ma xi mum t hr oughput of 16 MByt e / s per link. The Par aSt at i on har dwar e resides on an i nt erface car d whi ch pl ugs i nt o t he PCI - bus of t he host syst em. Thus, it is possible t o use Pa r a St a t i on on a wide range of machi nes f r om different vendors. A mor e det ai l ed descr i pt i on of t he har dwar e is gi ven in [WBT97]. 4. De s i g n o f P UL C A new communi cat i on subsyst em has t o fulfil several issues t o be hel pful for paral l el comput i ng. Fi rst , paral l el comput i ng is hi ghl y dependent on ver y low l at ency and hi gh t hr oughput . The per f or mance avai l abl e for t he user has t o be close t o t he har dwar e l i mi t s. Ther ef or e, deep pr ot ocol st acks are deadl y for paral l el comput i ng. Second, communi cat i on har dwar e is get t i ng fast er and mor e i nt el l i gent . New approaches, such as DMA t r ansf er s and communi cat i on processors on t he i nt erface cards enabl e high per f or mance and flexible pr ot ocol processi ng. A new communi cat i on pr ot ocol has t o be wel l -sui t ed for t hese t echnol ogi es. Thi r d, communi cat i on l i brari es offer different i nt erfaces and semant i cs t o t he pr ogr ammer . Not each communi cat i on l i br ar y is wel l -sui t ed for all 93 users of a cl ust er of wor kst at i ons. Ther ef or e, a new c ommuni c a t i on s ubs ys t e m has t o offer di fferent i nt erfaces ( c ommuni c a t i on l i brari es). I t shoul d al so be ext ensi bl e for new appr oaches in t hi s field. Four t h, wor ks t at i on cl ust ers are of t en used by several peopl e for par al l el comput i ng. Havi ng user-level access t o t he har dwar e usual l y pr ohi bi t s si mul - t aneous use of one node by several processes. A new appr oach shoul d s uppor t a mul t i - pr oces s envi r onment . Ther ef or e t he mai n goal was t ha t PULC s uppor t s fine gr ai ned par al l el p r o g r a mmi n g on wor ks t at i on cl ust ers whi l e still pr ovi di ng t he benef i t s of mul t i - pr ocess envi r onment s . The mos t chal l engi ng pr obl e m in a mul t i - pr oces s e nvi r onme nt is t he de- mul t i pl exi ng of i ncomi ng messages. Gener al l y t her e are t hr ee possi bl e pl aces where message demul t i pl exi ng can t ake pl ace: 1. In t he oper at i ng syst em: The oper at i ng s ys t em ei t her checks per i odi cal l y t he har dwar e for pendi ng messages or it is i nt er r upt ed by t he ha r dwa r e when a message has arri ved. The oper at i ng s ys t e m unpacks t he mes s age header and st ores t he mes s age da t a in a cor r espondi ng queue in kernel space. Fr om t he vi ewpoi nt of t he kernel it does n' t ma t t e r i f t he mes s age is for t he cur r ent l y r unni ng process or for any ot her process. 2. In t he c ommuni c a t i on processor: Each c ommuni c a t i ng pr ocess has a me mo r y ar ea which is accessi bl e by t he c ommuni c a t i on har dwar e. The c ommuni c a t i on processor checks t he header and deci des wher e t he mes - sage f r a gme nt shoul d be st or ed. The numbe r of accessi bl e me mo r y ar eas is l i mi t ed, however. To sol ve t hi s pr obl e m t he c ommuni c a t i on s ys t e m can ei t her l i mi t t he numbe r of c ommuni c a t i ng processes or it buffers t he mes - sage i nt er medi at el y, where t he processes can access t he d a t a (in kernel space, c o mmo n message area, or a t r us t ed pr ocess' addr ess space) . 3. In t he low level c ommuni c a t i on sof t war e in user-space: A user process per i odi cal l y checks t he har dwar e (or get s i nt er r upt ed) , and recei ves t he message. I f t he message is not addr essed t o t he recei vi ng process, t he process st or es t he message in a message pool accessi bl e by t he des t i nat i on process. In all cases t he dest i nat i on process execut es a receive call and get s t he d a t a f r om t he i nt er medi at e st or age and st or es it i nt o t he final dest i nat i on. I f t he final des t i nat i on is known and accessi bl e at t he t i me of message demul t i pl ex- ing, t he message can be st or ed di r ect l y in t hi s area. Thi s is known as true zero copy message r ecept i on [BBVvE95]. PULC di vi des t he message demul t i pl exi ng and t he message r ecept i on in t wo di fferent modul es. The PULC message handler demul t i pl exes i ncomi ng messages. Thi s message handl er can ei t her r un on t he c ommuni c a t i on pr o- cessor or it can be linked t o each user process. The PULC inlerface recei ves t he message for t he process. I t al ways r uns in t he addr ess space of t he com- muni cat i ng process. Bot h modul es c ommuni c a t e by cal l i ng each ot her or by upda t i ng queues in a shar ed message area. 94 Another challenging task is resource management . Resources (buffers, sockets, etc.) are usually managed by the operating system. When movi ng the communication out of the kernel, this task can be accomplished by a regular user process. The resource manager has to control access to the hardware and cleans up after application shutdowns. In PULC, this task is performed by the PULC resource manager . Figure 4.1 gives an overview of the maj or parts of PULC. PULC Interface . . . . ti-- PULC Protocol Switch ,_ pv __ C Message H d!e ~ so f ' ~ ~' ' ~ # , \ :" PULC Resource",, , i i: Manager (PS1D~ Fig. 4.1. PULC Architecture PULC Programmi ng Interface: This module acts as programmi ng interface for any application. The design is not restricted to a particular interface definition such as Unix sockets. It is possible and reasonable to have sev- eral interfaces (or protocols) residing side by side, each accessible t hrough its own API. Thus, different APIs and protocols can be i mpl ement ed to support a different quality of service, ranging from standardised interfaces (i.e. TCP or UDP sockets), widely used programmi ng environments (i.e. MPI or PVM), to specialised and proprietary APIs (ParaSt at i on ports and a true zero copy protocol called Rawdata). All in all, the PULC in- terface is the programmer- visible interface to all implemented protocols. PULC Message Handler: The message handler is responsible to handle all kind of (low level) dat a transfer, especially incoming and outgoing mes- 95 sages, and is t he onl y par t t o i nt er act di r ect l y wi t h t he har dwar e. It consists of a pr ot ocol - i ndependent par t and a specific i mpl e me nt a t i on for each pr ot ocol defined wi t hi n PULC. The pr ot ocol - i ndependent par t is t he protocol switch which di spat ches i ncomi ng messages and demul - t i pl exes t hem t o pr ot ocol speci f i c receive handlers. To get hi gh- speed communi cat i on, t he pr ot ocol s have t o be as lean as possible. Thus, PULC pr ot ocol s are not l ayered on t op of each ot her ; t hey reside side by side. Sendi ng a message avoids any i nt er medi at e buffering. Af t er checki ng t he da t a buffer, t he sender di r ect l y t r ansf er s t he da t a t o t he har dwar e. The specific pr ot ocol s inside t he message handl er are responsi bl e for t he cod- ing of t he pr ot ocol header i nf or mat i on. PULC Resource Manager: Thi s modul e is i mpl ement ed as a Uni x da e mon process ( PSI D) and supervi ses al l ocat ed resources, cleans up af t er ap- pl i cat i on shut downs, and cont rol s access t o common resources. Thus , it t akes care of t asks usual l y managed by t he oper at i ng syst em. To be por t abl e among di fferent har dwar e pl at f or ms and oper at i ng sys- t ems, PULC i mpl ement s all har dwar e and oper at i ng syst em specific par t s in a modul e called har dwar e abs t r act i on l ayer ( HAL) . Choosi ng an i nt er con- nect i on net work wi t h different qual i t y of services woul d force t he adopt i on of t he PULC message handl er t o t hese services t he communi cat i on har dwar e provi des. E.g. if t he har dwar e doesn' t pr ovi de i n-order delivery, t he message handl er has use t he PULC f unct i ons which pr ovi de a r eor der i ng of f r agment s . 4. 1 Re s o u r c e s p r o v i d e d b y P UL C PULC suppor t s t he i mpl ement at i on of different pr ot ocol s by offeri ng a va- ri et y of resources t oget her wi t h associ at ed i nt erfaces t o access t hem. The pr ot ocol i ndependent resources are message fragments, communication ports, semaphores, and process control blocks. A Message fragment consists of a frag- ment cont rol block and t he message da t a and several f r agment s are concat e- nat ed t o f or m a messages. Fr agment at i on is essential, because t he under l yi ng har dwar e has l i mi t ed packet size. Ther ef or e, PULC f r agment s have fi xed sizes in me mor y and f r agment s are al l ocat ed as fixed sized me mor y blocks. Thi s ma y wast e memor y, but al l ocat i ng and managi ng vari abl e sized chunks of me mor y is t i me consumi ng. Several messages t oget her f or m a message queue of a port. The por t is t he basic addressabl e el ement in PULC communi ca- t i on. Di fferent pr ot ocol s use t he por t s as t he channel s t o t hei r c ommuni c a t i on par t ner s. The resource manager frees a por t and all f r agment s inside i t s mes- sage queue when no process is usi ng it anymor e. For t he T CP / UDP pr ot ocol , anot her resource called socket is provi ded. A socket uses a por t as i t s commu- ni cat i on channel and st ores addi t i onal socket specific i nf or mat i on. To know about all t he resources which are al l ocat ed by a specific process, PULC keeps i nf or mat i on about a process in a process control block. These i nf or mat i on are use t o clean up t he al l ocat ed resources when t he process exi t s. 9 6 If the PULC message handler runs on the host processor, several pro- cesses can access common resources. To ensure mut ual exclusion of processes to protect critical sections (mani pul at i ng queues or ot her resources), PULC provides user-level semaphores. Processor specific at omi c operations, such as t es t and set or l oad/ s t or e locked, are used to i mpl ement them. For an easy i mpl ement at i on of the protocols, PULC offers support func- tions to access the resources. E. g., PULC provides routines to store fragment s into the message queue of a port. There are only three different strategies to store fragments in a message queue. PULC classifies the ports and the protocol calls its appropriate routine. In general, message queues of a port can be classified in the following way: - Single stream: All fragments are stored in a single queue disregarding any message boundaries or message sources. - Multiple Stream: All fragments of a the same source are stored in a queue. Fragments of different sources are stored in different queues. - Dat agrams: Fragments of different messages and different sources are stored in different queues. Each message has its own queue. In addition to this classification these routines have to know if the hardware delivers the fragments of a message in order or if a reordering of the fragment s is necessary. Fort unat el y the ParaSt at i on hardware provides in-order delivery. The same holds for our HAL i mpl ement at i on for the Myrinet card. 4. 2 P S I D : T h e P U L C Coor di na t or Since PULC is fully implemented in user-space, the operat i ng system does not manage the resources. This task is done by a resource manager (PSID: ParaSt at i on Daemon). It cleans up resources of dead processes and organises access to the message area. Before a process can communi cat e with PULC, the process has to register with the PSID. The PSID can grant or deny access to the message area and the hardware. The PSID also checks if the version used by the PULC interface and the PULC message handler are compatible. The version check makes corrupt i on of dat a impossible. The PSID can restrict the access to the communi cat i on subsystem to a specific user or a maxi mum number of processes. This enables the cluster to run in an optimised way, since multiple processes slow down application execution due to scheduling overhead. All PSIDs are connected to each other. They exchange local i nformat i on and t ransmi t demands of local processes to the PSID of the destination node. Wi t h this cooperation, PULC offers a distributed resource management . The single system semantic of PULC is ensured by the PSIDs. They spawn and kill client processes on demand of other processes. PULC transfers remot e spawning or killing requests to the PSID of the destination node. PULC uses operating system functionality to spawn and kill the processes on the local node. The spawned process runs with same user id as the spawning process. 97 PULC r edi r ect t he out put of s pawned process on t he t e r mi na l of t he mo t h e r process. Ther ef or e it offers a t r a ns pa r e nt vi ew of t he cl ust er. The PSI Ds per i odi cal l y exchange l oad i nf or mat i on. Wi t h t hi s i nf or ma t i on PULC pr ovi des l oad bal anci ng when s pawni ng new t asks. Ther e ar e sever al spawni ng st r at egi es possi bl e: - Spawn a new t as k on t he speci fi ed node: No sel ect i on is done by PULC. The spawn request is t r ans f er ed t o t he r e mot e PSI D, whi ch cr eat es t he new t ask. A new t as k i dent i fi er is r et ur ned in t he resul t . - Spawn a t as k on t he next node: PULC keeps t r ack of t he node whi ch was used t o spawn t he l ast t as k on. Thi s s t r at egy sel ect s t he next node by i ncr ement i ng t he node numbe r . - Spawn a t as k on a unl oaded node: Before spawni ng, PULC or der s t he avai l abl e nodes by t hei r l oad. Af t er t hat , PULC spawns on t he nodes wi t h t he l east heavy l oad. These st r at egi es allow a PULC cl ust er t o r un in a bal anced f ashi on, whi l e still al l owi ng t he p r o g r a mme r t o speci fy t he exact node, when pr obl e m sol ved requi res a specific c ommuni c a t i on pa t t e r n. 4. 3 T h e P UL C Me s s a g e Ha n d l e r The PULC message handl er is r esponsi bl e for recei vi ng and sendi ng messages. 4. 3. 1 S e n d i n g me s s a g e s . Sendi ng a message avoi ds any i nt e r me di a t e buffer- ing. Af t er checki ng t he buffer, t he sender di r ect l y t r ansf er s t he d a t a t o t he har dwar e. The specific pr ot ocol s inside t he message handl er are r esponsi bl e for t he codi ng of t he pr ot ocol header i nf or mat i on. PULC doe s n' t r est r i ct t he l engt h or f or m of t he header. PULC j us t specifies t he f or m of t he ha r dwa r e header wi t h i t s pr ot ocol id. The rest of t he message header mu s t be i nt er - pr et abl e by t he pr ot ocol specific receive handl er . I f t he recei ver is on t he l ocal node, t he receive handl er opt i mi s es message t r ansf er by di r ect l y cal l i ng t he a ppr opr i a t e receive handl er of t he pr ot ocol . 4. 3. 2 Re c e i v i n g a me s s a g e . I f t he har dwar e s uppor t s a demul t i pl exi ng of messages, t he PULC message handl er r uns on t he c ommuni c a t i on pr ocessor of t he har dwar e. I t has s ome me mo r y c o mmo n wi t h each r ecei vi ng process. The da t a can di r ect l y be t r ans f er r ed t o t hi s me mo r y area. The first gener at i on of t he Pa r a St a t i on car d does not s uppor t any mes s age demul t i pl exi ng at har dwar e level and so t he PULC mes s age handl er has t o be par t of a process and r uns in t he addr ess space of i t s host process. Dur i ng r ecept i on of a message t he PULC message handl er can det ect t ha t it is not addr essed t o i t s own host process. I t has t o st or e t he message in a c o mmo n l y accessi bl e message ar ea ( SHM) where t he des t i nat i on process can r ead t he message. Whe t he r a message is recei ved wi t h t r ue zero copy, or it is s t or ed i nt er medi at el y, depends on t he used pr ot ocol . 98 The PULC protocol switch reads only the hardware header of the message and the protocol identifier. After decoding the id, the protocol switch directly transfers control to the receive handler of the protocol, which reads the rest of the message. This header forwarding is extremely fast and does not do any unnecessary copy of the data. The protocols can store the dat a directly in user dat a structures, as it is done in the rawdat a protocol, or queue the dat a in the a message queue ( TCP, UDP, PORT- M/ S/ D) . Other protocols can do it in their specific way. PULC allows multiple processes to communi cat e concurrently since differ- ent processes can use different communi cat i on ports. The protocol interface and the protocol receive handler have to ensure the correct cooperat i on while receiving a message. In a hardware-support ed PULC message handler a shared port must re- side in an area where bot h processes can access it. If bot h processes t rust each other, the port can reside in a message area which is mapped in bot h processes. If they do not trust each other, the message handler has to protect the port in its own memor y area. Both processes would have to access the message in the port t hrough the message handler API. This is much slower t han the solution with a direct access. 4. 4 PULC I nt e r f a c e Each protocol in the message handler can have its own interface. The inter- face is the count erpart of the message handler. The message handler receives a message and puts it in the message area whereas the interface functions get these messages as soon as they are received completely. The cooperat i on between the interface functions and the receive handler of the protocol in- cludes correct locking of the port and its message queues. Correct interaction is necessary since PULC doesn' t have control of the scheduling decisions of the Operating System. Thus the receive handler could be in a critical section while the Operat i ng System switches to a process which conflicts with this critical section. This could destroy consistency. A process can use several interfaces at the same time. E. g., it can use the sockets for regular communi cat i on and PULC' s ability to spawn processes t hrough the Port-M interface. The socket interface to PULC is the same as for BSD Unix sockets. This interface allows easy porting of applications and libraries to the fast com- munication protocols. Destinations which are not reachable inside the PULC cluster are redirected to regular operating system calls. All communi cat i on in Unix is based on the socket interface. By providing a compat i bl e interface, porting applications to PULC is j ust a relinking. PULC sockets use specially tuned met hods with caching of recently used structures. This allows an extremely fast communi cat i on with mi ni mal pro- tocol overhead. Each socket has a port as its communi cat i on channel. The 99 socket receive handl er onl y knows a bout t he por t s and uses di fferent enqueu- ing st r at egi es for UDP ( d a t a g r a m por t s ) and T CP socket s (si ngl e s t r e a m por t s) . The socket i nt er f ace pr ovi des t he i nt er act i on bet ween t he c ommuni - cat i on por t s and t he socket descr i pt or . Socket s can be s har ed a mo n g di fferent processes due t o a f o r k ( ) call a nd can be i nher i t ed by a e x e c ( ) call. Dur i ng f o r k ( ) , t he socket is dupl i cat ed but bot h socket s shar e t he s a me c ommu- ni cat i on por t ( t he count a t t r i but e of t he por t is i ncr ement ed) . Thus , bot h processes have access t o t he message queue of t he socket . Af t er an e x e c ( ) and a r econnect i on t o PULC t he socket s and t he por t s of t he mes s age ar ea are i nser t ed i nt o t he pr i vat e socket and por t descr i pt or t abl es. Ther ef or e t he process has access t o t hese abs t r act i ons agai n. 4. 4. 1 Co mmu n i c a t i o n L i b r a r i e s o n To p o f P UL C. The r e are several c ommuni c a t i on l i brari es bui l t on t op of PULC. Most of t h e m are j us t t he s t a nda r d Uni x di st r i but i ons on t op of socket s. The appl i cat i ons whi ch use t hese l i brari es j us t have t o be l i nked wi t h t he PULC socket s. Thes e l i br ar i es i ncl ude P4 [BL92] and t cgms g [Har91]. Ot her s such as PVM [ BDG have been changed [ BWT96] in a way t ha t t hey can be used s i mul t aneous l y t o t he s t andar d socket s. Thi s enabl es a di rect compar i s on of t he ope r a t i ng s ys t e m c ommuni c a t i on and PULC. The i mpl e me nt a t i on shows t ha t PVM adds a si gni fi cant over head t o t he r egul ar socket communi cat i on. Thi s i s n' t obvi ous when PVM is used wi t h r egul ar socket s (see sect i on 6.). Thi s l ead t o a new appr oach[ OBWT97] , which opt i mi s ed PVM on t op of t he por t - D i nt erface. PULC al r eady provi des efficient and flexible buffer ma n a g e me n t and t her e- fore t hi s f unct i onal i t y coul d be el i mi nat ed in t he PVM source. Thi s PSPVM2 is still i nt er oper abl e wi t h ot her PVMs r unni ng on any ot her cl ust er or su- per comput er . PSPVM2 vi ews t he whol e PULC cl ust er as one si ngl e par al l el syst em. The PULC MPI i mpl e me nt a t i on is based on MPI CH. MP I CH pr ovi des a channel i nt er f ace whi ch har dwar e ma nuf a c t ur e r s can use t o por t MP I CH t o t hei r own c ommuni c a t i on s ubs ys t em. Thi s channel i nt er f ace is i mpl e me nt e d on t op of PULC' s por t - D pr ot ocol . MPI CH on PULC uses P ULC' s dyna mi c process cr eat i on at s t ar t up. The i mpl e me nt a t i on is wel l -sui t ed for MPI - 2, which is s uppor t i ng dyna mi c process cr eat i on at r un- t i me. I t is possi bl e t o s uppor t MPI di r ect l y as an i nt er f ace t o PULC. Most of t he f unct i onal i t y is al r eady pr ovi ded in t he Por t pr ot ocol . 5. I mp l e me n t a t i o n Ther e exi st s t wo i mpl e me nt a t i ons of PULC, one for I nt el - PCs r unni ng Li nux and t he ot her for DEC- Al pha wor ks t at i ons r unni ng Di gi t al Uni x. Bot h of t he m use t he Pa r a St a t i on hi gh- speed c ommuni c a t i on car d as c ommuni c a t i on har dwar e. As descri bed in sect i on 3., Pa r a St a t i on offers ma n y useful servi ces t o t he sof t war e pr ot ocol s, but unf or t unat el y, it has no c ommuni c a t i on proces- sor on boar d. Thus, t he i mpl e me nt a t i on uses a c o mmo n l y accessi bl e s har ed 100 memory area (see figure 5.1) to store messages and control information. The PULC library itself, in particular the PULC message handler, acts as the trusted base within the whole system. The library is statically linked to each application and ensures correct interaction between all parts of the system. The operating system is only invoked at system and application st art up. I Application A, t Application B Operating System (Kernel mode) I Paral ;tafion Library Application Startup ] (User mode) a t I Driver Message Buffer | Control Llnformation System Startup / Initialization I I Normal Operation I ParaStation Hardware I Fig. 5.1. ParaStation User-Level Communication Operating system and hardware specific parts of the library are placed in a separate module (the HAL). Therefore only this module has to be change when porting PULC to another platform. This is currently done for the Myrinet communication card. Since the message handler is part of each process, the message area is mapped into each communicating process. This enables the message handler to receive messages for different processes and to demultiplex t hem to the correct receiving port. The multi-process ability of this solution is quite ex- pensive due to the locking of ports, as well as locking dat a transmission to and from the hardware. Using a commonl y accessible message area suffers from a (minimal) lack of protection. The implemented message demultiplexing implies t hat all com- municating processes trust each other. A malfunctioning process accessing the common message area directly is able to corrupt dat a owned by another process and can possibly crash the system. But the risk is mi ni mal since the address space of Alpha processors (64 bit addresses) is approxi mat el y 252 times larger t han the size of the message area (configuration dependent). If a wrong address is produced once a second, a corruption of dat a in the mes- sage area could happen approximately every 227 years. On the other hand, 101 t he t r us t ed syst em is open for mal i ci ous hackers wi t h access t o t he cl ust er, but t hi s is a t ol er abl e di sadvant age when compar ed t o t he per f or mance ben- efits gai ned f r om t hi s policy. I f t hi s lack of pr ot ect i on is consi dered har mf ul , PULC can be confi gured t o allow onl y a specific number of process or onl y a specific user access t o t he communi cat i on syst em concur r ent l y. 6. P e r f o r ma n c e E v a l u a t i o n Thi s sect i on shows t he efficiency of t he PULC i mpl ement at i on. The per f or - mance of t he different prot ocol s is pr esent ed and t he resul t s are expl ai ned. Per f or mance is measur ed on a cl ust er where each node in t he cl ust er is a ful l y confi gured wor kst at i on. 6. 1 Co mmu n i c a t i o n Be n c h ma r k Communi cat i on subsyst ems can be compar ed by eval uat i ng t he l at ency and t hr oughput of t he syst ems. PULC offers several i nt erfaces and r uns on several har dwar e/ oper at i ng syst em envi r onment s. Our t est clusters consi st of t wo Pent i um PCs (166MHz) r unni ng Li nux 2.0 and t wo Al phas 21164 ( 500MHz) r unni ng Di gi t al Uni x 4.0. PULC' s resul t s are compar ed wi t h t he oper at i ng syst em per f or mance whenever possible. The t est consists of a pai rwi se exchange pr ogr am t o mea- sure t he t hr oughput and a pi ng- pong t est t o measur e t he l at ency, whi ch is cal cul at ed by t he r ound t r i p t i me di vi ded by two. / * S e n d e r c o d e * / S t a r t T i m e r ( ) f o r ( i = O ; i < L O O P S ; i + + ) S e n d M e s s a g e ( b u f f e r , s i z e ) R e c e i v e M e s s a g e ( b u f f e r , s i z e ) e n d E n d T i m e r ( ) / * R e c e i v e r C o d e * / S t a r t T i m e r ( ) f o r ( i = O ; i < L O O P S ; i + + ) S e n d M e s s a g e ( b u f f e r , s i z e ) R e c e i v e M e s s a g e ( b u f f e r , s i z e ) e n d E n d T i m e r ( ) Fi g. 6.1. Pairwise exchange test program In t he exchange pr ogr am( see figure 6.1) bot h processes send a message t o t he ot her and wai t for t he receive of t he ot her . Ther ef or e bot h processes execut e always t he same command. In t he pi ng- pong t est (see figure 6.2), one pro: 2ss sends and t he ot her receives, af t er recei vi ng t he message, t he recei ver sends t he message back t o t he sender. Surpri si ngl y, t he slower Pent i um syst em per f or ms bet t er t han t he Al pha syst em in bot h l at ency and t hr oughput at lower l ayers (see Tabl e 6.3). Thi s 102 / * S e n d e r c o d e * / S t a r t T i m e r ( ) f o r ( i = 0 ; i < L O O P S ; i + + ) S e n d M e s s a g e ( b u f f e r , s i z e ) R e c e i v e M e s s a g e ( b u f f e r , s i z e ) e n d E n d T i m e r ( ) / * R e c e i v e r C o d e * / S t a r t T i m e r ( ) f o r ( i = 0 ; i < L 0 0 P S ; i + + ) R e c e i v e M e s s a g e ( b u f f e r , s i z e ) S e n d M e s s a g e ( b u f f e r , s i z e ) e n d E n d T i m e r ( ) Fi g. 6.2. Pingpong test program p r o t o c o l - layer h a r d wa r e r a w d a t a p o r t - M s o c k e t P VM P VM ( p o r t - M) s oc ke t ( sel f ) Al p h a 2 1 1 6 4 , 5 0 0 MI-Iz ParaStation OS/Ethernet latency b a n d - l a t e n c y b a n d - wi d t h wi d t h [/~s] [ MB/ s ] [ # , ] [ MB/ s ] 4. 2 12. 4 5. 1 11. 9 8. 9 9. 5 9. 0 9. 6 115 1.1 78. 0 8 . 7 289 1. 0 11. 5 9. 4 m~ : J| - ' l - ' ! :Pit :~l~ P e n t i u m, 1 6 6 MHz P a r a S t a t i o n OS / E t h e r n e t l a t e n c y b a n d - l a t e n c y b a n d - wi d t h wi d t h [ p , ] [ MB/ s ] [ p , ] [ MB/ s ] 3. 4 15. 6 6. 4 14. 8 14. 6 11. 8 13. 9 11. 9 3 0 8 1. 0 158 7. 8 776 0. 8 27. 2 11. 5 [ ~ ' ] ( o n i B D Y ~ 5 7 8 30. 0 Fi g. 6.3. Communication Performance of the PULC system is due t o t he ar chi t ect ur al differences bet ween t he t wo syst ems. In par t i cul ar Al pha' s capabi l i t y t o combi ne wri t es t o t he same me mor y l ocat i on requi res addi t i onal synchr oni sat i on. As t he Par aSt at i on communi cat i on i nt er f ace is i mpl ement ed as a FI FO buffer, me mor y bar r i er i nst r uct i ons ( MB) are in- sert ed af t er each wri t e t o t he FI FO. The MB i nst r uct i on i t sel f wai t s for all out s t andi ng read and wri t e oper at i ons and t hus l i mi t s t he per f or mance. In addi t i on t o t he wri t e combi ni ng bot t l eneck, t he s emaphor e mechani s m whi ch is used in t he Al phas is not as fast as t he semaphor es on t he Pent i um. A lock oper at i on on t he Al phas t akes about 1 #s whereas a Pent i um pr ovi des mut ual excl usi on wi t hi n 200 ns. The s emaphor e bot t l eneck is also visible in mul t i -process pr ot ocol s . The line t i t l ed hardware in t he t abl e above shows t he per f or mance of t he har dwar e abst r act i on l ayer descri bed in sect i on 5. and reflects t he ma x i mu m per f or mance one can get using Par aSt at i on on t he st at ed wor kst at i on. The addi t i onal l at ency of 0.9 #s on t he Al pha (3 #s on t he Pe nt i um) i nt r oduced by t he r awdat a pr ot ocol is due t o guar ant ee mut ual excl usi on and correct i nt er act i on bet ween concur r ent processes. Mul t i pl e por t s are ad- dressed by t he por t pr ot ocol . Thi s mul t i - pr ogr ammi ng envi r onment adds ad- di t i onal 3.8 #s (8.2 #s on Pent i um) t o t he r awdat a pr ot ocol . Pr ovi di ng full TCP socket f unct i onal i t y wi t hi n 9 #s opens up a wide r ange of fine gr ai ned paral l el pr ogr ams on t op of sockets. As r epor t ed in [BWT96] s t andar d pro- gr ammi ng envi r onment s, such as PVM, add a huge a mount of l at ency t o 103 t he socket s. Thi s is not not i ceabl e when slow oper at i ng s ys t e m socket s ar e used. When r unni ng PVM on t op of PULC socket s 89 % (91% on a Pe n t i u m) of t he l at ency is caused by t hese packages. Thes e numbe r s show t h a t t hese s t a nda r d envi r onment s do not a dopt well t o hi gh speed pr ot ocol s. Thi s l ead t o an opt i mi s at i on of PVM on t op of por t s. As r epor t ed, t he por t - M pr ot o- col al r eady pr ovi des mos t of t he f unct i onal i t y t ha t PVM has t o i mp l e me n t on t op of socket s, e.g. a ver y i neffi ci ent l y i mpl e me nt e d buffer ma n a g e me n t . Usi ng t he whol e f unct i onal i t y of PULC, PVM onl y adds 2.5 #s ( 13.4 ps on t he Pent i um) t o t he por t - M pr ot ocol l at ency. Thi s shows t ha t even wi t h s t andar di s ed i nt erfaces, PULC offers gr eat per f or mance. 6 . 2 A p p l i c a t i o n B e n c h m a r k A user does n' t focus on t he pur e message passi ng number s . The mor e i m- por t a nt f act is how t he s ys t e m behaves wi t h real appl i cat i ons. Thi s sect i on pr esent s per f or mance me a s ur e me nt s of t he s ys t e m in t wo di fferent ar eas wi t h t wo di fferent c ommuni c a t i on l i brari es. Fi rst , t he PVM i mp l e me n t a t i o n is meas ur ed wi t h a wi del y used l i near al gebr a package and second, t he NAS paral l el be nc hma r k is used t o c ompa r e t he s ys t e m t o t he Cr a y T3E, a dedi - cat ed par al l el syst em. 6. 2. 1 L i n e a r Al g e b r a P a c k a g e o n t o p o f P VM. Thi s t est uses a lin- ear equat i on sol ver for dense s ys t ems , cal l ed xs l u, whi ch is pa r t of ScaLA- Pack [ CDD+95] , a popul ar l i near al gebr a package. ScaLAPack uses BLACS as a c ommuni c a t i on i nt er f ace t o di fferent c ommuni c a t i on l i br ar i es such as MPI or PVM. In t hi s t est PVM act s as t he under l yi ng s ubs ys t e m. The t est is r un on up t o 8 Al phas (500 MHz, 256 MB Ra m, Di gi t al Uni x 4. 0b) connect ed wi t h t he Pa r a St a t i on har dwar e. S c a L AP ACK on 160 M B i t P a r a S t a t i o n wi t h P S P VM2 Problem 1 work- size (n) station MFlop 3000 443 4000 5000 6000 7000 8000 9000 10000 2 workstations Speed MFlop up i 1.75 759 1.70 753 1.85 821 4 workstations Speed MFlop up 2.18 966 2.47 1093 2.68 1187 2.90 1285 3.11 1379 8 workstations Speed MFlop up 2.62 1161 3.23 1431 3.74 1656 4.04 1789 4.37 1939 4.61 2044 5.07 2247 5.22 2312 The t abl e shows t ha t on t op of Pa r a St a t i on t he appl i cat i on scal es good in t e r ms of pr obl em size and numbe r of processors. A ma x i mu m pe r f or ma nc e of 2.3 GFl ops is achi eved whi ch c ompa r e qui t e well t o dedi cat ed par al l el ma - chines. Unf or t unat el y, x s l u depends on hi gh ba ndwi dt h and t hus Pa r a St a t i o n wi t h a bout 10MByt e/ s t hr oughput is t he real bot t l eneck. 104 6. 2. 2 NAS P a r a l l e l Be n c h ma r k o n t o p o f MP I . The second t est me a - sures t he per f or mance of t he s ys t em wi t h t he NAS Par al l el Be nc hma r k sui t e. Thi s sui t e is wi del y used t o compar e di fferent par al l el pl at f or ms . I t is bas ed on t op of MPI and it r uns wi t hout any source code modi f i cat i ons. Some t est s requi re a power of t wo numbe r and ot her s a squar e n u mb e r of processors. Ther ef or e not all col umns are filled in each t est . The FT be nc hma r k is a 3-D F F T appl i cat i on. MG is a mul t i gri d bench- mar k. The LU be nc hma r k does a ma t r i x decompos i t i on. I t is t he onl y bench- ma r k in t he NPB 2.0 sui t e t ha t sends l arge numbe r s of ver y smal l (40 byt e) messages. Ther ef or e it shows t he per f or mance of t he c ommuni c a t i on subsys- t e m for fi ne-grai ned appl i cat i ons. EP ( embar r as s i ng par al l el ) usual l y shows t he per f or mance of a si ngl e node. The c ommuni c a t i on s ubs ys t e m is not used frequent l y. IS (i nt eger sor t ) sort s a numbe r of i nt egers in par al l el . CG (conj u- gat e gr adi ent ) and IS exchange a l ot of da t a in huge da t a chunks. All of t hese codes requi re a power - of - t wo numbe r of processors. The SP( pe nt a di a gona l sol ver) and BT (bl ock di agonal sol ver) al gor i t hms are mor e coarse gr ai ned i mpl ement at i ons . The y solve t hr ee set s of uncoupl ed s ys t e ms of equat i on us- ing mul t i par t i t i on schemes. Bot h t he SP and BT codes requi re a s quar e numbe r of processors 1 . As a compar i s on t he number s achi eved by a Cr ay T3E- 900 are pr esent ed, which has si mi l ar pr ocessor s per node. The c ommuni c a t i on s ubs ys t e m is a hi ghl y opt i mi s ed t hr ee di mensi onal t orus. The t hi r d level cache is el i mi nat ed and t her ef or e t est s whi ch are me mo r y i nt ensi ve r un good on t he T3 E and t est which can mos t l y r un in t he cache pe r f or m worse t ha n on r egul ar work- st at i ons. The Cr ay T3E pr ovi des a bandwi dt h of a bout 300 MB/ s and a l at ency of about 2 #s at har dwar e level . Ther ef or e one coul d expect a c ompa r a bl e per f or mance for t est s whi ch do not depend on bandwi dt h. a For a detailed description of the tests please h t t p : / / s c i e n c e . nas. nasa. gov/ Sof t ware/NPB/ refer to 105 NAS Pa r a l l e l Be n c h ma r k o n P a r a S t a t i o n a n d T3 E Test on no. of nodes Class A BT ParaStation BT Cray T3E-900 CG ParaStation CG Cray T3E-900 EP ParaStation EP Cray T3E-900 IS ParaStation IS Cray T3E-900 LU ParaStation LU Cray T3E-900 MG ParaStation MG Cray T3E-900 FT ParaStation FT Cray T3E-900 SP ParaStation SP Cray T3E-900 1 2 4 8 n/ a n/ a 144.4 n/ a n/ a n/ a 226.7 n/ a 19.7 44.5 55.15 75.72 46.5 86.0 241.4 4 31.96 5.2 10.4 20.8 1.46 2.23 2.15 3.72 6.6 12.9 22.1 579.48 134.4 270.4 531.1 299.77 171.5 313.9 720.8 86.02 85.3 169.5 330.4 n/ a 106.55 n/ a n/ a 172.4 n/ a The table shows the results measured on ParaStation and the results taken from the NAS homepage for the T3E. Higher numbers mean better performance. In some test ParaStation behaves very good compared to the expensive dedicated system. Unsurprisingly, these are the test with minimal communi- cation (EP) and the test with many small messages (LU), because the MPI latency is about the same on both systems. During the other tests, which depend on high throughput, the ParaStation system began to swap received messages to user-space due to an overflow of the message storage. This effect limited the performance and a new version PULC will optimise this swapping. But even with this swapping effect, the resulting numbers are even better than other much more expensive dedicated machines, such as the IBM SP/2, SGI Origin, and Cray T3D 2. 7. Conc l us i on PULC shows extremely good performance on all protocols. Many programs benefit from the high speed of the PULC library. PULC' s design offers nearly the raw performance of high-speed communication cards to the user while still providing standardised interfaces. The design goal of a multi-user/multi- programming environment at full speed was reached. PULC is also easily 2 See the performance n u m b e r s at h t t p ://science. nas. nasa. gov/Sof t w a r e / N P B / N P B 2 R e s u l t s / i n d e x , h t m l 106 adapt ed t o new har dwar e and bri ngs efficient paral l el processi ng t o work- st at i ons clusters. Pr esent ed per f or mance resul t s compar e well wi t h par al l el syst ems. PULC is i ncl uded in t he Par aSt at i on syst em, whi ch was i nt r oduced i nt o mar ket in 19963 and is cur r ent l y por t ed t o t he Myr i net c ommuni c a t i on adapt er . Fi rst resul t s show t hat T CP t hr oughput will raise up t o 60 MB/ s while l at ency will i ncrease t o about 20 ~ts. These are first number s, where t he message handl er still r uns in low level soft ware. The pur e user-level appr oach in t he Par aSt at i on syst em showed ma ny drawbacks which coul d onl y be resol ved by i nt r oduci ng some secur i t y holes and per f or mance l i mi t at i ons. Especi al l y t he per f or mance of mul t i pl e pro- cesses on one node is dependent on t he coschedul i ng st r at egi es used. Un- fort unat el y, I coul dn' t find a coschedul i ng s t r at egy whi ch is good f or mul t i - t hr eaded and i nt erprocess communi cat i on at t he same t i me. Mor e research has t o be done in t hi s area. 8. F u t u r e Wo r k In fut ure, t he Par aSt at i on t eam will work on next - gener at i on Pa r a St a t i on hardware. Cur r ent issues for a new net work design are fi ber opt i c links, op- t i mi sed packet swi t chi ng, and flexible DMA engi nes t o reach an appl i cat i on- t o- appl i cat i on bandwi dt h of about 100 MByt e/ s . Si mi l ar t o Myr i net t he new hardware will be abl e t o r un t he message handl er on boar d. Ther ef or e any securi t y whole will be el i mi nat ed. PULC and t he full Par aSt at i on envi r onment is goi ng t o be por t ed t o ot her syst ems with PCI bus (e.g., Sun/ Sol ar i s, I BM- Powe r PC/ AI X, SGI / I RI X) . PULC itself will be por t ed t o ot her communi cat i on har dwar e. Addi t i onal interfaces and prot ocol s, such as Act i ve Messages, are consi dered t o be i m- pl ement ed as pr ot ocol s inside of PULC. Thi s woul d give t hem a per f or mance boost over t he cur r ent i mpl ement at i on whi ch are i mpl ement ed on t op of sock- ets or port s. Fur t her mor e, t he anal ysi s of t he message demul t i pl exi ng showed t ha t t hi s task can be done in t he OS, t he communi cat i on har dwar e, and t he low level software. All t hr ee cases will be i mpl ement ed and eval uat ed. Re f e r e n c e s [ACP95] Thomas E. Anderson, David E. Culler, and David A. Patterson. A Case for NOW (Network of Workstations). IEEE Micro, 15(1):54-64, February 1995. 3 For further information, see h t t p ://wwwipd. i r a . uka. de / Pa r a St a t i on. 107 [BBVvE95] Anindya Basu, Vineet Buch, Werner Vogels, and Thorsten von Eicken. U-net: A user-level network interface for parallel and distributed computing. In Proc. of the 15th ACM Symposi um on Operating Syst ems Principles, Copper Mountain, Colorado, December 3-6, 1995. [BDG+93] A. Beguelin, J. Dongarra, A1 Geist, W. Jiang, R. Manchek, and V. Sun- deram. PVM 3 User's Guide and Reference Manual. ORNL/TM-12187, Oak Ridge National Laboratory, May 1993. [BL92] Ralph Buttler and Ewing Lusk. User's Guide to the P4 Parallel Program- mi mg System. ANL-92/17, Argonne National Laboratory, October 1992. [BWT96] Joachim M. Blum, Thomas M. Warschko, and Walter F. Tichy. PSPVM:Implementing PVM on a high-speed Interconnect for Workstation Clusters. In Proc. of 3rd Euro PVM Users' Group Meeting, Munich, Germany, Oct.7-9, 1996. [CC97] G. Chiola and G. Ciaccio. Gamma: a low-cost network of workstations based on active messages. In 5th EUROMI CRO workshop on Parallel and Dis- tributed Processing, 1997. [CCHHvE96] Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel, and Thorsten yon Eicken. Low-Latency Communication on the IBM RISC Sys- tem/6000 SP. In ACM/ I EEE Supercomputing '96, Pittsburgh, PA, November 1996. [CDD+95] J. Choi, J. Demmel, I. DhiUon, J. Dongarra, S. Ostrouchov, A. Pe- titet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable Lin- ear Algebra Library for Distributed Memory Computers - Design Issues and Performance. Technical Report UT CS-95-283, LAPACK Working Note #95, University of Tennesee, 1995. [CGH94] Lyndon Clarke, Ian Glendinning, and Rolf Hempel. The MPI Message Passing Interface Standard. Technical report, March 94. [CPL+97] Chien, Pakin, Lauria, Buchanan, Hane, Giannini, and Prusakova. High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance. In Eighth SI AM Conference on Parallel Processing f or Sci- entific Computing (PP97}, 1997. [DBDF97] Stefanos N. Damianakis, Angelos Bilas, Cezarz Dubnicki, and Ed- ward W. Felten. Client Server Computing on Shrimp. I EEE Micro, pages 8-'17, January/February 1997. [FG97] Marco Fillo and Richard B. Gillett. Architecture and implementation of memory channel 2. Technical report, Digital Equipment Coropration, 9 1997. [Har91] R. J. Harrison. Portable tools and applications for parallel computers. International Journal on Quantum Chem., 40:847-863, 1991. [HWTP93] Christian G. Herter, Thomas M. Warschko, Walter F. Tichy, and Michael Philippsen. Triton/l: A massively-parallel mixed-mode computer de- signed to support high level languages. In 7th International Parallel Processing Symposium, Proc. of 2nd Workshop on Heterogeneous Processing, pages 65-70, Newport Beach, CA, April 13-16, 1993. [JR97] H. Jin and W. Rehm. Performance of message passing and shared memory on sci-based smp cluster. In Proceedings of Fifth High Performance Computing Symposium, Atlanta, Georgia, April 6-10 1997. [myr] The GM APL [OBWT97] Patrick Ohly, Joachim M. Blum, Thomas M. Warschko, and Walter F. Tichy. PSPVM2:PVM for ParaStation. In Proc. of 1st Workshop on Cluster Computing, Chemnitz, Germany, Nov.6-7, 1997. [PT97] Loic Prylli and Bernard Tourancheau. New protocol design for high per- formance networking. Technical report, LIP-ENS Lyon, 69364 Lyon, France, 1997. 108 [SR97] David Culler Steve Rodrigues, Tom Anderson. High-performance local-area communication using fast sockets. In USENIX '97, 1997. [WBT97] Thomas M. Warschko, Joachim M. Blum, and Walter F. Tichy. ParaS- tation: Efficient Parallel Computing by Clustering Workstations: Design and Evaluation. Journal of Systems Architecture, 1997. Elsevier Science Inc., New York, NY 10010. To appear. [WBvE97] Matt Welsh, Anindya Basu, and Thorsten von Eicken. ATM and Fast Ethernet Network Interfaces for user-level communication. In roceedings of the Third International Symposium on High Performance Computer Architecture (HPGA), San Antonio, 1997. TOP- C: Tas k- Ori ent ed Paral l el C Di s t ri but ed and Shared Me mo r y Gene Cooper man* College of Computer Science Northeastern University Boston, MA 02115 gene~ccs.neu.edu for Su mma r y . The "holy grail" of parallel software systems is a parallel programming language t hat will be as easy to use as a sequential one, while maintaining most of the potential efficiency of the underlying parallel hardware. TOP- C (Task-Oriented Parallel C) at t empt s such a model by presenting a task abstraction t hat hides much of the details of the underlying hardware. DSM (Distributed Shared Memory) also at t emps such a model, but along an orthogonal direction. By presenting a shared memory model of memory, it hides much of t he details of message-passing required by the underlying hardware. This article reviews t he TOP- C model and then presents ongoing research on combining t he advantages of bot h models in a single system. 1. I n t r o d u c t i o n Thi s paper pr oposes t he TOP- C model as a way t o easi l y or gani ze com- put at i ons on DSM syst ems wi t h ma ny processors, whi l e mai nt ai ni ng hi gh concur r ency. The pr oposed model allows t he appl i cat i on wr i t er t o i mpl i ci t l y decl are segment s of his envi r onment t ha t cor r es pond t o t he pr ogr am obj ect s t hat he is using. The segment s ar e i mpl i ci t in t ha t t he appl i cat i on wr i t er need onl y decl are t o TOP- C which segment s ar e modi fi ed by a gi ven r out i ne. TOP- C has been successful in execut i ng ma ny l arge, paral l el appl i ca- t i ons [4, 8, 10, 11, 12, 17]. TOP- C is i mpl ement ed as a C l i brary, and does not r equi r e a modi f i cat i on of t he pr ogr ammi ng l anguage of t he appl i cat i on. As wi t h any C l i brary, t he TOP- C l i br ar y can also be used by a C+ + pro- gram. One can choose any of t hr ee TOP- C l i brari es t o choose bet ween: SMP ( Symmet r i c Mul t i Pr ocessi ng, or shar ed memor y) ar chi t ect ur es, di s t r i but ed memor y ar chi t ect ur es, and a sequent i al ar chi t ect ur e. The appl i cat i on wr i t er may cont i nue t o use his or her f avor i t e pr ogr ammi ng l anguage as l ong as t ha t l anguage has an i nt er f ace t o C libraries. It shoul d be not ed t ha t cur r ent hi gh- end SMP ar chi t ect ur es ( many processors) ar e qui t e si mi l ar t o DSM syst ems wi t h har dwar e suppor t . Hence, t her e appear s t o be a gr adual pr ogr essi on f r om l ow- l at ency SMP t hr ough medi um- l at ency DSM syst ems, wi t h no s har p di vi di ng line. Accordi ngl y, we t al k about t he SMP versi on of t he TOP- C model wi t h t he i nt ent i on t ha t t hi s also appl i es t o DSM. * Support ed in part by NSF Grant CCR-9732330. 11o Sect i on 2. descri bes t he TOP- C model . Sect i on 3. t hen mot i vat es why t he model needs t o be ext ended when t he envi r onment uses a l ot of memor y. Sect i on 4. t hen descri bes a nat ur al way t o enhance t he TOP - C model by pr ovi di ng an appl i cat i on abs t r act i on of segments. I f t he appl i cat i on pr ogr a m is an obj ect - or i ent ed C+ + pr ogr am, t hen each segment will of t en cor r es pond t o an obj ect . Sect i on 5. t hen descri bes how t he enhanced TOP- C model maps ont o a DSM ar chi t ect ur e. In par t i cul ar , t her e is an i mpor t a nt issue of how t he mul t i pl e segment s of t he TOP- C envi r onment ma p ont o t he mul t i pl e pages of a DSM syst em. We ar e still in t he process of obt ai ni ng a sui t abl e DSM, and so we have not had t he oppor t uni t y t o t est TOP- C in t hi s envi r onment . Nevert hel ess, a paper anal ysi s descri bes ma ny of t he DSM f eat ur es t ha t we expect will be necessar y for TOP- C t o r un efficiently on t op of DSM. 2. Th e TOP- C Mo d e l The TOP- C model has been descr i bed in [7]. The model is suffi ci ent l y fl exi bl e t o also be easily por t ed t o i nt er act i ve l anguages [5, 6]. The model has also been appl i ed t o met acomput i ng [9], due t o t he ease of checkpoi nt i ng t he cur r ent s t at e and sendi ng a copy of t ha t s t at e t o a new process j oi ni ng t he comput at i on. The model has been successful l y used in a var i et y of appl i cat i ons [4, 8, 10, 11, 12, 17]. The model allows a single file of appl i cat i on code t o be execut ed as a sequent i al , SMP, or di st r i but ed me mor y appl i cat i on, by si mpl y l i nki ng wi t h a different library. Por t abi l i t y is emphasi zed by bui l di ng on t op of a POSI X t hr eads l i br ar y (for SMP) or MPI [14] (for di s t r i but ed memor y) . MPI was chosen as a wi del y available message-passi ng s t andar d, wi t h good efficiency. The TOP- C di st r i but i on also cont ai ns i t s own small, unopt i mi zed subset i mpl ement at i on of MPI , allowing one t o qui ckl y set up a small, sel f - cont ai ned appl i cat i on. Fur t her , t he por t abi l i t y of TOP- C makes it easy t o r e- t ar get t o anot her message-passi ng pl at f or m, such as PVM. TOP- C is freel y di s t r i but ed at f t p : l l f t p , c c s . n e u . e d u / p u b / p e o p l e / g e n e / t o p - c / . T h e p r o g r a m m i n g style is S P M D (Single P r o g r a m , Multiple Data). T h i s is e x e c u t e d in the context of a master-slave architecture a n d a n e n v i r o n m e n t or g l o b a l state. T h i s e n v i r o n m e n t receives lazy, incremental updates, in a fashi on t ha t will be made cl ear l at er. The user i nt er f ace has pur posel y been kept si mpl e by r est r i ct i ng t he user i nt erface t o a single, pr i mar y syst em call: ma s t e r s l a v e ( ) . Th a t func- t i on requi res as par amet er s , four appl i cat i on f unct i ons decl ar ed by t he user: s e t _ t a s k _ i n p u t (), d o _ t a s k (), g e t _ t a s k _ o u t p u t () a n d u p d a t e _ e n v i r o n m e n t ( ) . T h e philosophy is to present the higher-level task abstraction to t h e application. T h i s s h o u l d b e contrasted to lower level inter- faces t hat pr esent ei t her a message-passi ng abs t r act i on or a shar ed me mor y abst r act i on. 111 The task is the first abstraction. The first two application-defined func- tions, set t a s k_i nput () and do_t ask (), implicitly define the input-output behavior of the task. The third function, get _t as k out put ( ) , returns an action to be taken, based upon the task output. The three primary actions are N0_ACTION, REDO, and UPDATE. When the application specifies the UPDATE action, the application-specific function, updat e_envi ronment () is called on each process (including the master). The routine, updat e envi ronment () uses the task output to introduce an incremental update. The figure below illustrates the flow of control between master and each of several slaves for a task. MASTER SLAVE I~t_t ask_input~ \ output> I D , T E ) update_environment(input, output) ) ) Figure 2.1. TOP-C Programmer' s Model A process always completes its current operation, before reading a pending message for the next operation. A message from the master to a slave requesting an update to the slave's copy of the environment always takes precedence over a message specifying a new task. A REDO action results in 112 the original task input being sent back to the same slave, typically after a message to update the environment. In addition to the task, the second key to the TOP- C model is the envi ronment (global state). The environment, like the task, is not explicitly declared by the application. Rather, it is implicitly defined by the application routines. Each of the four application routines may read the most recent local environment. However, only updat e_envi r onment () may modify the dat a in the environment. The environment is read and written only by the application routines, and not by any TOP-C system routine. The most important issue for TOP-C is to allow tasks to concurrently read and make a request to modify the environment. As seen in figure 2., a decision to modify the environment can only happen if ge t t a s kout put () returns an UPDATE action. This action both allows TOP-C to record at what "time" the environment was last modified, and to then call upda t e e nvi r onme nt (). In the case of distributed memory, upda t e e nvi r onme nt () is called on each process, including the master. In the case of shared memory or sequential code, updat e_envi ronment () is called only on the master. 3. Conc ur r e nc y I s s ue s f or Sha r e d Me mor y Note that for any shared memory system (not just TOP-C), there is an inherent reader-writer problem when one thread (in this case the master) writes to a region of memory while another thread is reading the same region of memory. The TOP-C methodology reduces this to a single writer-multiple reader problem. The TOP-C solution is to allow both memory operations to proceed, but to later detect the memory collision and account for it. The method is analogous to the method of "optimistic concurrency" in distributed databases. Concurrency is maintained in TOP-C in an application-specific manner. The system provides a utility, i s up_t o_da t e ( ) , callable from within the application routine, updat e_envi r onment ( ) . This routine will determine whether the environment was modified on the master after the task input under consideration was generated on the master, and before the task output was received by the master. Any memory collisions are a special case of this more general situation, and so will also be detected. If the environment was not modified, then the application trivially attains perfect concurrency. If the environment was modified, then the application routine, get _t as k_out put (), may either return a REDO action, or employ an application-specific technique to "patch" the task out put to take account of the modified environment. The get _t as k out put () routine receives the task input, in addition to the task output, precisely to make it easier to patch the output. The effect of this concurrency strategy is t hat the environment acts as a single large "page" of memory. If any task causes the page to be "touched", 113 t hen all processes may have t o r ead an upda t e t o t he page. The page upda t e is handl ed in a l azy manner , pr ovi di ng a t ype of l at ency hi di ng. However, t he pr esence of onl y a single, at omi c envi r onment effect i vel y means t ha t false shar i ng of da t a is wi despr ead wi t hi n t he syst em. Thi s is t he cur r ent s t at e of TOP- C. The issue of false shar i ng of a single monol i t hi c envi r onment t ends t o es- peci al l y hur t TOP- C appl i cat i ons t ha t r equi r e a shar ed me mor y model . Thi s occur s because of a nat ur al di chot omy in TOP- C appl i cat i ons. Appl i cat i ons t ha t r equi r e onl y a smal l er a mount of me mor y for t he envi r onment t end t o r un comf or t abl y in t he di s t r i but ed me mor y model , in whi ch t he envi r onment is r epl i cat ed among many processes. However, appl i cat i ons r equi r i ng a l arge amount of me mor y for t he envi r onment will pr ef er a shar ed me mor y envi ron- ment . Ot her wi se, t he cost of physi cal me mor y of t en makes it uneconomi c t o find a site wi t h sufficient me mor y on each pr ocessor t o allow t he r epl i cat i on of a l arge envi r onment wi t hi n each process. Thus, l arge envi r onment s favor a shar ed me mor y model . Thi s soft ware view of me mor y can be achi eved ei t her by an SMP ar chi t ect ur e or by a DSM ar chi t ect ur e on t op of ma ny wor kst at i ons. The next sect i on discusses an exper i ment al versi on of TOP- C t ha t be t t e r accommodat es a shar ed view of me mor y by pr ovi di ng mul t i pl e pages, or segment s, wi t hi n t he envi r onment . 4. Mu l t i p l e S e g me n t s wi t h i n a n E n v i r o n me n t In t he exper i ment al TOP- C model , t he envi r onment is r epl aced by mul t i pl e segment s. The use of mul t i pl e segment s forces us t o change one command and one act i on in t he TOP- C model : i s _ u p t o _ d a t e ( ) and UPDATE. All ot her aspect s of t he TOP- C r et ai n t he same simplicity. Recal l t ha t t he TOP- C envi r onment is never expl i ci t l y decl ared. Rat her it is i mpl i ci t l y defi ned by t he appl i cat i on pr ogr a mme r as t hose por t i ons of me mor y wi t hi n a slave process t ha t are r ead by d o _ t a s k ( ) and t hat are r ead or wr i t t en t o by u p d a t e e n v i r o n me n t ( ) . (In addi t i on, t he mas t er r out i nes s e t _ t a s k _ i n p u t ( ) and g e t _ t a s k _ o u t p u t ( ) ma y also r ead t he envi r onment . ) In our i mpl ement at i on of segment s, we r et ai n t hi s i dea t ha t segment s are i mpl i ci t referenced, but never expl i ci t l y decl ared. Si nce t he envi r onment is r epl aced by segment s, t he ut i l i t y i s _ u p _ t o d a t e ( ) must be ext ended t o i ncl ude a single par amet er , speci fyi ng for whi ch segment s t he quer y is bei ng made. Cur r ent l y, t hi s pa r a me t e r is specified as a st r i ng r epr esent i ng a set of number s. For exampl e, " 1 , 3 , 5 - 7 " r epr esent s segment s 1, 3, and 5 t hr ough 7. Second, t he command u p d a t e _ e n v i r o n me n t () is now used t o upda t e one or mor e segment s. I t woul d be possi bl e t o add an addi t i onal r equi r ement for t he appl i cat i on pr ogr ammer t o have u p d a t e _ e n v i r o n me n t ( ) r et ur n a st ri ng, such as " 4- 8" , i ndi cat i ng whi ch segment s ar e bei ng updat es. Thi s woul d allow TOP- C t o mai nt ai n an i nt er nal t abl e t ha t updat es a t i mes t amp 114 for each segment, and then answer any application queries of the form i s up_t oda t e (" 1, 3, 5- 7") . However, it was felt to be a simpler syntax to instead extend the UPDATE action returned by get _t ask_out put (). Since the application programmer already must return the action UPDATE (implemented as a C constant), we now require the application programmer to instead return a parametrized action such as UPDATE("4-8") (implemented as a C function macro). It is clear that the internal table of timestamps for each segment can be maintained only on the master process, since queries of the form i s _upt oda t e () and updates of the form UPDATE() both originate on the master process. As each new task originates on the master, a new task ID is issued as a monotonically increasing sequence. The timestamps for each segment are then implemented as task ID's. So an i s up_to dat e( ) query can be answered by TOP-C simply by determining the task ID of the current task being processed by get _t ask out put (). That current task ID is compared with the maximum of the timestamps for each segment being queried by i s upt oda t e (). Those timestamps are maintained by TOP-C in its internal table, and are task ID's corresponding to the last updat e_envi ronment () for each queried segment. If the current task ID is "newer" (larger), then TOP-C returns true. Other- wise, it returns false. Thus, the extensions to i s upt oda t e () and UPDATE() impose a mini- mal additional burden on the TOP-C application programmer, while provid- ing strong benefits in the form of higher concurrency. The partition of the environment memory into segments by the application will often be a natural extension of the application. For example, large application tables or other arrays can be subdivided by partitioning the index set into equal subinter- vals. Object-oriented applications will often partition their environment by associating an object ID with each object, and associating a TOP-C segment with the memory used by an object. The object ID can then also be used as a segment number. 5. TOP- C ove r Di s t r i but e d Sha r e d Me mor y Existing DSM systems primarily provide physical memory management and memory consistency. TOP-C provides memory management in the form of implicitly specified TOP-C segments, where the user is responsible for the memory organization, and the TOP-C framework provides consistency management for this memory. Therefore the functionality of TOP-C and a DSM system intersect in the area of memory management. This section discusses the possible benefits and design of a combined system. There is not yet an implementation of the ideas in this section. The introduction of shared memory to TOP-C introduces a new problem that was not present in the distributed memory of TOP-C. When the master 115 calls u p d a t e _ e n v i r o n me n t ( ) , wri t es on t he mas t er t ake effect i mmedi at el y on t he slave, due t o t he shar ed memor y. Thi s is handl ed in SMP t hr ough a s t andar d si ngl e- wr i t er - mul t i pl e- r eader sol ut i on by whi ch r eader s ma y l at er r e- r ead any modi fi ed segment t hr ough a REDO act i on. Nevert hel ess, t hi s s t r at egy also i mposes a bur den on t he appl i cat i on wr i t er in t ha t d o _ t a s k ( ) may r et ur n a wr ong answer af t er r eadi ng i nconsi st ent dat a, but it mus t be guar ant eed never t o hang due t o i nconsi st ent dat a. DSM syst ems can emul at e t he l azy updat es of TOP- C under di st r i but ed me mor y by i mpl ement i ng l azy rel ease consi st ency. Many DSM syst ems, such as Tr eadMar ks [1], Quar ks [16] and t he earl i er Muni n [2] syst em, s uppor t rel ease consi st ency. Release consistency allows for a weaker me mor y model in whi ch an acquire oper at i on is r equi r ed bef or e r eadi ng or wri t i ng a shar ed vari abl e, and a release oper at i on is r equi r ed before anot her pr ocessor can acqui re a shar ed vari abl e. Rel ease consi st ency allows i ni t i at i on of a new acqui re oper at i on wi t hout wai t i ng for pendi ng r eads t o compl et e, and it allows a new wri t e wi t hout wai t i ng for pendi ng rel ease oper at i ons t o compl et e. A t ypi cal i mpl ement at i on of rel ease consi st ency is t o i mpl ement t wo l i br ar y r out i nes, acqui re and release, ( Tr e k _ l o c k _ a c q u i r e ( l o c k _ h a n d l e ) and Trek_l ock r e l e a s e ( l o c k h a n d l e ) in t he case of Tr eadMar ks) , whi ch oper at e on a lock handl e (an i nt eger in t he case of Tr eadMar ks ) . Af t er an acqui re oper at i on, all wri t es by t he appl i cat i on ar e not ed by t he DSM syst em unt i l a cor r espondi ng rel ease oper at i on. ( I nt er cept i on of wri t es can be i mpl ement ed by t he UNIX syst em call, mp r o t e c t ( ) . ) If a second pr ocess acqui res t he same lock, t hen all of t he modi fi ed pages will be r epl i cat ed on t he second process. Rel ease consi st ency is t ypi cal l y i mpl ement ed in one of t wo vari at i ons. These t wo vari at i ons differ in how t o handl e wr i t e updat es. The first var i at i on is l azy release consi st ency. In l azy rel ease consi st ency, a wr i t e upda t e occur s onl y after t he call t o r e l e a s e ( ) by t he wri t i ng process, and when a second process t hen calls a c q u i r e ( ) in an a t t e mpt t o access t he same page of memor y. The second var i at i on is eager rel ease consi st ency. In t hi s var i at i on, modi fi ed pages ar e upda t e d for all processes hol di ng a copy of t he page at t he t i me of t he call t o r e l e a s e ( ) . Thi s upda t e can be "bat ched" for efficiency, but t he ori gi nal call t o r e l e a s e ( ) may not be seen t o compl et e by a second process unt i l t he second process has recei ved t he "eager" wr i t e updat es. The pr ef er r ed DSM pol i cy for TOP- C is one of lazy release consistency in whi ch t her e ar e no page updat es seen by ot her processes and no page i nval i dat i ons unt i l af t er t he call t o r e l e a s e () and at t he t i me of a second call t o a c q u i r e ( ) . Thi s mi mi cs t he TOP- C me mor y model of lazy, i ncr ement al updat es. Thi s fits well wi t h t he TOP- C met hodol ogy, in whi ch wri t es t o any one TOP- C segment ar e likely t o be i nfrequent . If TOP- C were i mpl ement ed on t op of a DSM syst em, t hi s woul d re- qui re appr opr i at e calls of a c q u i r e ( ) and r e l e a s e ( ) by TOP- C t o t he 116 under l yi ng DSM syst em. One woul d call a c q u i r e ( ) bef or e a call t o u p d a t e _ e n v i r o n me n t ( ) and r e l e a s e ( ) af t er t he call. Bef or e a call t o d o _ t a s k ( ) (on a sl ave), one woul d call a c q u i r e ( ) i mmedi at el y fol l owed by r e l e a s e () in or der t o recei ve t he modi fi ed pages. If one has i mpl ement ed mul t i pl e segment s of t he envi r onment in TOP - C, one would i nvoke a di fferent l ock handl e for each segment . I t mi ght be- come necessar y for u p d a t e _ e n v i r o n me n t () t o t ake an addi t i onal ar gument , specifying whi ch segment t o updat e. TOP- C woul d t hen guar ant ee t o call u p d a t e _ e n v i r o n me n t ( ) r epeat edl y, once for each segment t ha t needs t o be updat ed. Pl ans ar e under way t o t est TOP- C on t op of a DSM syst em. The exper i ment al versi on of TOP- C (using shar ed memor y) will be t est ed. Thi s will pr ovi de i mpor t a nt f eedback a bout mer gi ng t he TOP- C shar ed me mor y model wi t h t he shar ed me mor y model used by DSM. 1. C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, "TreadMarks: Shared Memory Computing on Networks of Workstations", I EEE Computer, Vol. 29, No. 2, pp. 18-28, February 1996. 2. J. Carter, J. Bennett, and W. Zwanpoel, Implementation and Performance of Munin, Proc. 13th ACM Syrup. Operating Syst em Principles, 1991, pp. 152-164. 3. R. Chow and T. Johnson, Di st ri but ed Operating Syst ems and Al gori t hms, Addison Wesley Longman, 1997. 4. G. Cooperman, "Practical Task-Oriented Parallelism for Gaussian Elimination in Distributed Memory", Linear Algebra and its Applications 275-276, 1998, pp. 107-120. 5. G. Cooperman, GAP/ MPI : Facilitating Parallelism, Proc. of DI MACS Work- shop on Groups and Comput at i on H 28, DI MACS Series in Discrete Mat h- ematics and Theoretical Comput er Science, L. Finkelstein and W.M. Kant or (eds.), AMS, Providence, RI, 1997, 69-84. 6. G. Cooperman, STAR/ MPI: Binding a Parallel Library to Interactive Symbolic Algebra Systems, Proc. of International Symposi um on Symbol i c and Algebraic Computation (ISSAC '95), ACM Press, 126-132. 7. G. Cooperman, TOP-C: A Task-Oriented Parallel C Interface, 5 th Inter- national Symposi um on High Performance Di st ri but ed Comput i ng (HPDC- 5), 1996, IEEE Press, 141-150 (software at f t p : / / f t p . c c s . n e u . e d u I p u b l p e o p l e l g e n e / t o p - c / ). 8. G. Cooperman, L.Finkelstein, M.Tselman and B.York, Constructing Permut a- tion Representations for Matrix Groups, J. Symbol i c Comput at i on 24, 1997, pp. 1-18. 9. G. Cooperman and V. Grinberg, "TOP-WEB: Task-Oriented Met acomput i ng on the WEB", International Journal of Parallel and Di st ri but ed Syst ems and Networks 1, 1998, pp. 184-192; a shorter version appears as: "TOP- WEB: Task-Oriented Metacomputing on t he Web", G. Cooperman and V. Grinberg, Proceedings of Ni nt h I ASTED International Conference on Parallel and Dis- tributed Comput i ng and Syst ems (PDCS-97), I ASTED/ Act a Press, Anaheim, 1997, pp. 279-286. 117 10. O. Cooperman and G. Havas, Pract i cal parallel coset enumerat i on, Proc. of Workshop on High Performance Computation and Gigabit Local Area Networks, G. Cooperman, G. Michler and H. Vinck (eds.), Lecture notes in control and information sciences 226, Springer Verlag, pp. 15-27. 11. G. Cooperman, G. Hiss, K. Lux, and J/irgen M/iller, The Brauer t ree of t he principal 19-block of t he sporadic simple Thompson group, J. of Experi ment al Mat hemat i cs 6(4), 1997, pp. 293-300. 12. G. Cooperman and M. Tselman, New Sequential and Parallel Al gori t hms for Generating High Dimension Hecke Algebras using t he Condensation Technique, Proc. of International Symposi um on Symbolic and Algebraic Comput at i on (ISSAC '96), ACM Press, 155-160. 13. G.C. Fox, W. Furmanski, M. Chen, C. Rebbi and J. Cowie, WebWork: Int e- grat ed Programmi ng Envi ronment Tools for National and Gr and Challenges, Proc, of Supercomput i ng '95. 14. W. Gropp, E. Lusk and A. Skjellum, Using MPI, MI T Press, 1994. 15. J. Protid, M. Tomagevid, V. Milutinovid, Di st ri but ed Shared Memory: Concepts and Systems, I EEE Comput er Society Press, 1998. 16. M. Swanson, L. Stoller, J. Carter, "Making Di st ri but ed Shared Memory Sim- ple, Yet Efficient", Proc. of the 3rd Int ' l Workshop on High-Level Parallel Pro- gramming Models and Supportive Environments (HIPS' 98), pages 2-13, March, 1998. 17. M. Tselman, Comput i ng per mut at i on representations for mat r i x groups in a di st ri but ed environment, Proc. of DI MACS Workshop on Groups and Comput a- tion H 28, DI MACS Series in Discrete Mat hemat i cs and Theoretical Comput er Science, L. Finkelstein and W.M. Kant or (eds.), AMS, Providence, RI, 1997, 371-382. Metacomputing in the Gigabit Testbed West Thomas Eickermann I and Ferdinand Hommes 2 1 Forschungszentrum Jiilich, Germany 2 GMD - Forschungszentrum Informationstechnik, Sankt Augustin, Germany Abst r act . The 'Gigabit Testbed West' is one of two Testbeds for the upgrade of the German scientific network to Gigabit capacity, that is planned for the year 2000. It currently uses a 2.4 Gigabit/second ATM link to connect the Research Centre Jiilich and the GMD - National Research Center for Information Technology in Sankt Augustin. The testbed is the basis for several application projects ranging from metacomputing to multimedia. This contribution gives an overview of the infrastructure of the testbed and the applications. 1 I nt r oduc t i on A common definition of met acomput i ng - the shared use of distributed su- percomput i ng resources - contains different topics like unified access to the batch systems of different comput i ng centers [1],[2] or the si mul t aneous use of several supercomput ers by a single application. The first approach aims to simplify the access to supercomputers, the second should allow the so- lution of problems t hat could not be treated so far or solve problems more efficiently. The coupling of supercomput ers offers a way to increase the peak CPU-performance and main memory accessible by a single application. This allov~s e.g. particle simulations with large numbers of particles, where mai n memory is often the limiting resource. Even more appealing is the so called ' heterogeneous met acomput i ng' , which combines comput ers of different ar- chitecture, massively parallel computers, vect or-comput ers or special-purpose machines like visualization servers [3]. A serious drawback is t hat the bandwi dt h and latency which are achiev- able over an external network - - no mat t er if local or wi de-area - - can usu- ally not compet e with the performance of the internal network of a massively parallel computer. Because of t hat , only certain classes of applications can benefit from met acomput i ng. One such class is represented by so-called ' cou- pled fields' applications. Here, two or more space- and t i me-dependent fields interact with each other. An i mpl ement at i on of such applications can make explicit use of the performance hierarchy of the networks in the following way. The fields are distributed over the machines of the met acomput er and for each field, a parallelization via domai n composi t i on can be performed. Typ- ically, the fields have to be exchanged over the network once per simulation timestep, while the calculation of each field often requires several iterations per timestep, and commmuni cat i on within each iteration. This means t hat 120 al t hough t he r equi r ement s for t he ext er nal net wor k can be qui t e hi gh, t hey are usual l y smal l compar ed t o t he i nt er nal communi cat i on needs. A second class of appl i cat i ons benefits f r om bei ng di s t r i but ed over s uper comput er s of different ar chi t ect ur e, because t hey cont ai n par t i al pr obl ems t ha t can best be solved on massi vel y paral l el or vect or - super comput er s. For ot her appl i cat i ons, r eal - t i me r equi r ement s are t he reason t o connect several machi nes. 2 T h e Gi g a b i t T e s t b e d We s t In Ger many, t he net work t hat connect s research, science and educat i onal in- st i t ut i ons wi t h each ot her and t he rest of t he i nt er net is oper at ed by t he DFN- Ver ei n, an associ at i on of t hese i nst i t ut i ons f ounded in 1984. Since 1996 this net work is based on ATM- t echnol ogy and allows for access capaci t i es up t o 155 Mbi t / s . An ext ensi on i nt o t he Gbi t / s r ange on a nat i onal basis is pl anned for t he year 2000. To pr epar e t hi s t r ansi t i on, t wo t est beds have been set up in t he west ern and sout her n par t s of Ger many. The y will serve t o eval- uat e new net work t echnol ogy as well as t o gai n exper i ence wi t h appl i cat i ons requi ri ng bandwi dt hs beyond t he cur r ent l y avai l abl e 155 Mbi t / s . In t he ar ea of scientific comput at i on, such appl i cat i ons can e.g. be f ound in mul t i medi a, di st r i but ed access t o huge amount s of da t a and of course in me t a c omput i ng, which is t he subj ect of this article. The Gi gabi t Test bed West s t ar t ed as t he first of t he t wo Ge r ma n t est beds in August 1997. It is a j oi nt pr oj ect of t he Research Cent r e J/ilich and t he GMD - Nat i onal Research Cent er for I nf or mat i on Technol ogy in Sankt Au- gust i n close t o Bonn. In t he first year of oper at i on t he t wo l ocat i ons - - whi ch are appr oxi mat el y 1O0 km apar t - - were connect ed by an OC- 12 ATM link (622 Mbi t / s ) based upon Synchr onous Di gi t al Hi er ar chy ( SDH/ STM4 ) t ech- nology. In August 1998 this link has been upgr aded t o OC- 48 (2.4 Gbi t / s ) . The connect i on is pr ovi ded by o. t el . o Service GmbH and uses t he opt i cal fi ber i nf r ast r uct ur e inside t he power lines of t he Ge r ma n power suppl i er RWE AG. In t he f r amewor k of a bet a- t est Fore Syst ems ATM swi t ches ( ASX- 4000) were used t o connect t he local net works of t he research cent ers t o t he OC- 48 line. Ini t i al st abi l i t y pr obl ems t hat were observed dur i ng t he t est t ur ned out t o be rel at ed t o signal at t enuat i on and t i mi ng. Those pr obl ems have been sol ved and bot h t he SDH link and t he switches are in st abl e oper at i on now. The appl i cat i on pr oj ect s t hat use t he t est bed can rel y on a solid base of i nst al l ed s uper comput er capaci t y. J/ilich is equi pped wi t h 512- node Cr ay T3E- 600 and 256- node T3E- 900 massi vel y paral l el comput er s and a 16- processor Cr ay T90 vect or - comput er . An IBM SP2 and a 12 processor SGI Onyx 2 vi sual i zat i on server are i nst al l ed in t he GMD. Besi des several insti- t ut es in t he research cent ers in Jiilich and Sankt August i n ot her i nst i t ut i ons par t i ci pat e in t he t est bed wi t h t hei r appl i cat i ons. These are t he Al fred We- gener I nst i t ut e for Pol ar and Mari ne Research (AWI), t he Ge r ma n Cl i mat e Comput i ng Cent er ( DKRZ) , t he Uni versi t i es of Col ogne and Bonn, t he Na- 121 t i onal Ge r ma n Aerospace Research Cent er ( DLR) in Col ogne, t he Academy of Medi a Ar t s in Cologne, and t he i ndust r i al par t ner s Pal l as Gmb H and echt zei t GmbH. 3 S u p e r c o mp u t e r c o n n e c t i v i t y A key f act or for t he success of me t a c omput i ng act i vi t i es are communi cat i on net works t hat pr ovi de hi gh- bandwi dt h and l ow- l at ency connect i ons bet ween t he component s of t he met acomput er . Compar ed t o t he net wor ki ng equi p- ment for WAN-backbones, ATM- connect i vi t y for s uper comput er s has evol ved qui t e slowly. Whi l e 622 Mbi t / s- I nt er f aces are now avai l abl e for all c ommon wor kst at i on pl at f or ms, sol ut i ons are still out s t andi ng for t he ma j or super- comput er s used in t he Test bed West. For t he Cr ay T3E, t he Cr ay T90, and t he IBM SP2 onl y 155 Mbi t / s are avai l abl e ( and will be in t he foreseeabl e f ut ur e) . For t he SGI Onyx 2 a 622 Mbi t / s ATM- i nt er f aee is expect ed t o be avai l abl e in earl y 1999. Ther ef or e a di fferent sol ut i on had t o be f ound t o connect t he super comput er s in Jiilich and Sankt August i n t o t he t est bed. The best per f or mi ng net worki ng connect i on of t he Cr ay s uper comput er s is t he ' Hi gh Per f or mance Paral l el I nt er f ace' ( Hi PPI ) whi ch offers a peak perfor- mance of 800 Mbi t / s when a low-level pr ot ocol and l arge t r ansf er bl ocks (1 MByt e or mor e) are used. Even wi t h T CP / I P communi cat i on, t r ansf er r at es of mor e t han 400 Mbi t / s can be achi eved wi t hi n t he local Cr ay compl ex in Jiilich. Thi s is mai nl y due t o t he fact , t hat Hi PPI net wor ks allow I P- packet s of up t o 64 KByt e size ( MTU size). One way t o i nt er connect I P net wor ks based on Hi PPI and ATM t echnol ogy is t o use t he ATM/ Hi P P I - g a t e wa y by Ascend Communi cat i ons. A serious l i mi t at i on of t hi s sol ut i on is t ha t on t he Hi PPI side onl y MTU sizes up t o 9182 Byt e are suppor t ed. Ther ef or e we fol- lowed a different appr oach. A wor kst at i on is equi pped wi t h a Fore Syst ems 622 Mbi t / s ATM i nt erface and a Hi PPI i nt erface and act s as an I P- r out er bet ween t he Hi PPI and t he ATM net work. Since t he Fore ATM a da pt e r sup- por t s large MTU sizes, I P packet sizes of 64 KByt e are possi bl e on each par t of t he net work. We are cur r ent l y using an SGI 0200 and a SUN Ul t r a 30 as dedi cat ed r out er s for t he Cr ay syst ems in Jiilich. A si mi l ar sol ut i on was chosen t o connect t he IBM SP2 in Sankt August i n to t he t est bed. 8 SP- nodes are equi pped wi t h 155 Mbi t / s ATM adapt er s and one wi t h a Hi PPI interface. The ATM adapt er s are connect ed t o t he t est bed vi a a FORE ASX 1000. The Hi PPI net wor k is r out ed by a SUN E5000 which has also a FORE 622 Mbi t / s ATM adapt er . Pr el i mi nar y meas ur ement s show a t hr ouphput of mor e t han 370 Mbi t / s bet ween t he Cr ay T3 E in Jiilich and t he IBM SP2 in Sankt August i n. The l ayout of t he net wor k as of Sept ember 1998 is depi ct ed in figure 1. Thr oughput values t hat were meas ur ed in t ha t net work wi t h vari ous har dwar e are shown in t abl e 1. The del ay t ha t is i nt r o- duced by t he 100 km SDH/ ATM line is about 0.9 msec. Thi s val ue is still below t he del ay i nt r oduced by t he oper at i ng syst ems of t he T3 E (,-~3 msec) 122 FZJ IGMD /BpM ~ForoASX_100O iForoASX.200 s o ~ g i t r 8 2 - - Fore ASX- 1000 , 1 2 1 , 1 I Fore AS,X- 1 0 0 0 i , ~ l g o r e ASX-4000 o e , s , o o o l I I I I s G , I 0 2 0 0 , ForoASX-1000 SUNUltra60 1 2 P r o c [ @ ] l I PC S~ I N I HIPPI I Cisco $1010 Cisco A 100 . . . . . . . . Switch , ' ! ! l l i 256 Pror 512 Proc 16 Proc F i g . 1. Conf i gur at i on of t he Gi gabi t Te s t be d West in s umme r 1998. Jfi l i ch a nd Sankt Augus t i n ar e connect ed vi a a 2.4 Gb i t / s ATM- l i nk. The s upe r c omput e r s ar e a t t a c he d to t he t e s t be d vi a Hi PPI - ATM gat eways , sever al wor ks t at i ons vi a 622 or 155 Mb i t / s ATM i nt erfaces. a n d t he SP2 ( ~ 2 ms e c ) whi c h wer e me a s u r e d wi t h p i n g - p o n g t e s t s i n l oc a l n e t wo r k s . T a b l e 1. TCP t hr oughput in ATM cl assi cal I P net wor ks a d a p t e r [ Mbi t / s] t hr oughput [ Mbi t / s] Sun Ul t r a 60, Sol ari s 2.6 622 530 Sun E5000, Sol ari s 2.6 622 501 SP2, Thi n node, AI X 4.1.5 155 118 T3E- 900 155 115 Onyx 2, I RI X 6.4 155 126 2 H ~ 1 m 84 [ ~ 0 ~ B 1 44 o 1 ~. o l:,,m.o 1: ~. o 1: ~. o 1: ~. o i, m 123 l r ~o I Bo libel Fig. 2. VAMPIR timeline display of a metacomputing application running on two SP2 and two T3E nodes. The horizontal axis is the execution time, each horizontal bar represents a processor. Light parts of the bars depict calculations, dark parts MPI communication. The black fines represent MPI messages. 4 Tool s To make met acomput i ng usable for a broader range of users, the availabil- ity of at least a mi ni mum set of tools is mandat ory. Most i mpor t ant is a met acomput i ng-aware communi cat i on library. In the Gi gabi t Test bed West, it was decided to rely mai nl y on MPI [4] which has become the de- f act o st andard in di st ri but ed memor y parallel comput ers. A couple of features t hat are useful for met acomput i ng applications are part of the MPI-2 [5] defini- tion. Dynami c process creation and at t achment e.g. can be used for realtime- visualization or comput at i onal steering; l anguage-i nt eroperabi l i t y between C and FORTRAN is needed to couple applications t hat are i mpl ement ed in dif- ferent programmi ng languages. When the project st art ed no met acomput i ng- aware MPI-2 i mpl ement at i on was available (this is still t rue today, except for the LAM i mpl ement at i on, which i mpl ement s the dynami c features of MPI-2 on workstation clusters [6]). Therefore such a devel opment was assigned to Pallas GmbH. A first prot oype was finished in Sept ember 1998. Until then, the PACX/ MPI - l i br ar y developed by the University of St ut t gar t was used [7]. It support s a subset of MPI-1 and allows to couple Cray T3Es. Thi s l i brary has been port ed to the IBM SP2 and opt i mi zed for high-speed networks by the proj ect part ners in Jiilich and Sankt Augustin. For MPI poi nt -t o-poi nt communi cat i on, t hroughput values of 73 Mbi t / s with a l at ency of 6 msec have been observed between the Cray T3E in Jiilich and the IBM SP2 in Sankt Augustin. For those measurement s the 155 Mbi t / s ATM interfaces have been 124 used. Fi rst exper i ment s wi t h t he Hi PPI / ATM gat eway show si gni fi cant i m- pr ovement s compar ed t o t hat value. Also i mpor t a nt are t ool s for per f or mance eval uat i on and t uni ng. For mes- sage passing appl i cat i ons VAMPI R [8] is a well known pr oduct . It was devel- oped in t he Research Cent r e Jiilich and is now di s t r i but ed by Pal l as GmbH. For t he use in t hi s pr oj ect VAMPI R has been ext ended by some me t a c om- put i ng feat ures. Tracefiles, t hat have been cr eat ed on t he di fferent machi nes of t he me t a c omput e r can be synchr oni zed and mer ged and vi sual i zed in t he t i mel i ne display. Fi gure 2 shows an exampl e. A wr apper l i br ar y for t he in- s t r ument at i on of PACX/ MPI appl i cat i ons for t he use wi t h VAMPI R was also devel oped. No a t t e mpt has been made t o devel op a met a- debugger . Wi t h PACX/ MPI messages t hat are exchanged bet ween t he machi nes can be t raced. For ot her probl ems, paral l el debuggers like Tot al vi ew have t o be used separ at el y on each machi ne. 5 Ap p l i c a t i o n s A coupl e of appl i cat i on subpr oj ect s t hat t ouch di fferent aspect s of met acom- put i ng have been defi ned wi t hi n t he Gi gabi t Tes t bed West. In t he fol l owi ng t he ai ms and t he st at us in s ummer 1998 of each appl i cat i on is descr i bed briefly. More det ai l s will be present ed in separ at e publ i cat i ons. 5. 1 S o l u t e t r a n s p o r t i n g r o u n d wa t e r A t ypi cal ' coupl ed fields' szenario is t he t r ans por t of sol ut ant s in gr ound wat er. The i nt er act i ng fields are t he vel oci t y of t he gr ound wat er flow and t he concent r at i ons of t he sol ut ant s. Two i ndependent pr ogr ams t ha t per- f or m such ki nd of 3- D si mul at i ons have been devel oped in t he I ns t i t ut e for Pet r ol eum and Organi c Geochemi st r y at t he Research Cent r e Jiilich. The pr ogr am TRACE ( Tr anspor t of Cont ami nant s in Envi r onment al Syst ems) si mul at es t he flow of wat er in vari abl y s at ur at ed, por ous, het er ogeneous me- dia. It uses a f i ni t e- el ement di scr et i zat i on of t he model equat i ons and has been paral l el i zed at t he Cent r al I nst i t ut e for Appl i ed Mat hemat i cs in Jiilich based on a domai n decomposi t i on [9]. It is coded in FORTRAN 90 and uses MPI. The C+ + pr ogr am PARTRACE ( PARt i cl e TRACE) per f or ms t he sim- ul at i on of t he sol ut ant s using a Mont e- Car l o met hod. In t hei r ori gi nal ver- sions, t he pr ogr ams coul d onl y ' communi cat e' vi a files. TRACE si mul at es t he wat er flow unt i l a s t at i onar y flow evolves and wri t es t he resul t i ng fields i nt o a file which is t hen used as i nput for t he par t i cl e si mul at i on t ha t is done by PARTRACE. It was consi dered a serious r est r i ct i on of t hi s appr oach t hat t he si mul at i on of part i cl e t r ans por t is l i mi t ed t o s t at i onar y flows. To resolve t hi s l i mi t at i on, t he appl i cat i ons were coupl ed using PACX/ MPI . Each of t he m now r uns 125 in i t s own MPI - c o mmu n i c a t o r and t he wat er flow fields are exchanged vi a message- passi ng. Cur r ent l y, TRACE is r un on t he T3 E in Jiilich and P ART RACE on t he SP2 in Sankt August i n. In a t ypi cal run, 10 MByt es are t r ans f er r ed over t he t es t bed at t he begi nni ng of each t i mes t ep. Wi t h one t i me s t e p t a ki ng appr oxi - ma t e l y 2 seconds, t hi s resul t s in mode r a t e aver age net wor k l oad. Never t hel ess, t he peak r at es are much hi gher, since all d a t a are t r ans f er r ed in a si ngl e bur s t . Cur r ent l y, work is under way t o i mpr ove pe r f or ma nc e and s cal abi l i t y of bot h appl i cat i ons. Thi s will also resul t in i ncr easi ng net wor k r equi r ement s . Fut her - mor e it is pl anned t o i mpl e me nt an onl i ne vi sual i zat i on of t he c o mp u t a t i o n . 5. 2 ME G a n a l y s i s An appl i cat i on t ha t can benefi t f r om het er ogeneous me t a c o mp u t i n g emer ges f r om t he anal ysi s of ma gne t oe nz e pha l ogr a phy dat a. The ma gne t i c field a r ound a h u ma n head is meas ur ed wi t h an ar r ay of s uper conduct i ng q u a n t u m in- t er f er ence devi ces ( SQUI Ds) . Fr om t hese dat a, t he di s t r i but i on of el ect ri c cur r ent s in t he br ai n can be r econs t r uct ed by sol vi ng an i nverse pr obl em. I n Jiilich, t hi s is done wi t h t he ' Mul t i pl e Si gnal Cl assi f i cat i on' ( MUSI C) al go- r i t hm [10]. Wi t h MUSI C, pa r a me t e r s of a fi ni t e numbe r of cur r ent di pol es are obt ai ned in t hr ee phases [11]. 9 The numbe r of di pol es is e s t i ma t e d usi ng st at i st i cal me t h o d s t h a t sepa- r at e si gnal f r om noise. 9 The posi t i ons of t he di pol es are cal cul at ed. Thi s is done by fi ndi ng t he e xt r e ma of a f unct i on, t ha t meas ur es how well a di pol e pl aced at a gi ven l ocat i on is abl e t o r epr oduce t he si gnal e s t i ma t e d in t he f or me r st ep. 9 In t he l ast st ep, t he t i me evol ut i on of di pol e s t r engt h and or i ent at i on ar e cal cul at ed. The second phase is mos t t i me cons umi ng but can be i mpl e me nt e d on a mas s i vel y par al l el c omput e r ver y efficiently. The first phas e is be t t e r sui t ed for a vect or - comput er . The r eason for t ha t is t ha t it i nvol ves ope r a t i ons on mat r i ces t ha t are t oo smal l t o be efficiently par al l el i zed ( t ypi cal l y 360x360). Separ at e me a s ur e me nt s of a par al l el pr ogr a m t ha t i mpl e me nt s t he MUSI C al gor i t hm on t he T3E and t he T90 conf i r m t hi s. As soon as our MPI - 2 i m- pl e me nt a t i on will be abl e t o coupl e t hose machi nes, a di s t r i but ed versi on of t he pr ogr a m shoul d be abl e t o achi eve an overal l execut i on- t i me t h a t is bel ow t he t i me needed on ei t her t he T3 E or t he T90. 5. 3 Re a l t i me f MRI Anot her exper i ment in Jfilich t ha t deal s wi t h br ai n act i vi t y is based on func- t i onal Magnet i c Resonance To mo g r a p h y ( f MRI ) [12]. Her e a t est per s on is exposed t o e.g. per i odi c vi sual or acoust i c s t i mul at i ons . The ar eas of br ai n 126 act i vi t y are identified by fitting the par amet er s of a model of t he expect ed re- sponse of the brai n with the MRI dat a. Thi s not only i mproves the sensi t i vi t y of the measur ement compar ed to si mpl er correl at i on met hods but also allows to check those models. Head movement s of the test person t end t o produce art efact s in the detected activity. Therefore it is essential to correct for those movement s. In order to allow i nt eract i ve response of the experi ment al i st , all this should be done and visualized in real t i me. It is pl anned to i mpl ement this with the following setup. The raw dat a is t ransferred from the MRI scanner to the T3E, where it is processed. The resulting funct i onal dat a is handed over to an SGI Onyx 2 at the GMD in Sankt August i n. Thi s machi ne creates an i nt eract i ve 3- D represent at i on of the brai n on a Responsive Workbench t hat is agai n l ocat ed in Jiilich. For t hat purpose, two st ereo-i mages have to be t ransferred over t he gi gabi t t est bed. In order to allow for interactive movement and slicing by a person oper at i ng the workbench, these i mages have to be updat ed several t i mes a second. Current l y only a si mpl e 2- D visualization of the processed dat a is i mpl ement ed. Thi s setup is sketched in figure 3. It should be not ed t hat a si mi l ar appl i cat i on has recently been demonst r at ed by the Pi t t sbur gh Super comput i ng Cent er [13]. MR-Scanner data It ~tomical data RT-Server o ~ ~ ~ ,#c~ CRAY T3E (ZAM) ! {, ~ r'~-~~ ~a'~t anatc+micnl i . . . . ]data RT-Client (2D-GUI) . . . . I I I . . . . . . . . . . . . . . . . . . . . . . , - . . : . : . s - : - - :'- : ' - : " : : - " - . . . . . . . . I III SGI Onyx 2 (GMD) Responsive Workbench (ZAM) Fig. 3. Setup of the fMRI experiment. The raw scanner data are transferred through a front-end workstation to the T3E where they are processed. From there, anatomi- cal and functional brain-images are transferred to either a workstation with a simple 2-D display or over the testbed to an Onyx 2 in the GMD. The rendered images are sent back over the testbed to a Responsive Workbench in Jiilich. 127 5 . 4 D i s t r i b u t e d c l i m a t e a n d w e a t h e r m o d e l s A second 'coupled fields' application in the gigabit testbed deals with the distributed calculation of climate and weather models. Here, the Alfred- Wegener-Institute (AWl), the German Climate Computing Center (DKRZ) and the GMD will use the supercomputers in Jfilich and Sankt Augustin for a coupled simulation of atmospheric processes and the ocean-ice system. There are two main differences to the ground water szenario. One is t hat here the fields interact only at a 2-D interface, the ocean surface, whereas water and solutants interact in the full 3-D simulation domain. This further reduces the amount of data to be exchanged. Nevertheless, shorter simulation times for a single timestep and higher model resolution lead to similar total bandwidth requirements. The second difference is that in the ground water case dat a flows in one direction only - - there is no feedback from the solutants to the ground water flow. In contrast to that, both the ocean and the atmosphere models need the fields from the other model as boundary conditions. Because of that, peak bandwidth and latency of the network is much more critical here than in the gound water problem. 5.5 Di s t r i but e d f l ui d- s t r uct ur e i n t e r a c t i o n A more general approach to 'coupled fields' type problems is pursued in the EC funded project CISPAR. The idea there is to use well-established com- mercial computational fluid dynamics (STAR-CD) and structural mechanics codes (PAM-SOLID, PERMAS) for problems that involve the interaction of a fluid with flexible structures. Examples for such problems are artificial heart valves, torque converters or ships. A standard interface for those codes as well as a coupling library (COCOLIB) have been developed by the GMD and in- dustrial project partners. Within the Gigabit Testbed West, the COCOLIB will be ported to the metacomputer after the end of the CISPAR project in 1999. 5.6 New net wor ks and appl i cat i ons The testbed is currently extended by connecting new sites to the original link between Jiilich and Sankt Augustin and by defining new applications that use those extensions. A dark fibre that links the national German Aerospace Research Center (DLR) and the University of Cologne to the GMD has just been set up. This line will be used for projects that range from distributed traffic simulation and visualization to distributed virtual TV-production (in cooperation between GMD, DLR, Academy of Media Arts in Cologne, and echtzeit GmbH). The latter relies on the results of a multimedia project that evaluates components for studio quality digital video transmission over ATM in the testbed. A new 622 Mbit/s ATM-link between the University of Bonn 128 and the GMD will be the basis for metacomputing projects that deal with multiscale molecular dynamics and lithospheric fluids. Here the PARNASS- cluster [14] of the Institute for Applied Mathematics of the University of Bonn is connected to the IBM SP2 and the Cray T3E. 6 Concl us i on This contribution gave an overview over the metacomputing activities in the Gigabit Testbed West. The underlying 2.4 Gbi t / s SDH and ATM technology for the wide area backbone seems to be mature, a neccessary condition for the upgrade of the German scientific network that is planned for the year 2000. In contrast to that, the networking capabilities of the supercomputers that are attached to the testbed have to be improved. The concept of a Hi PPI/ ATM gateway seems to be promising. A couple of applications t hat deal with various aspects of metacomputing are using the infrastructure of the testbed. Their results should enhance our understanding about the conditions under which distributed high-performance computing is feasible. 7 Ac knowl e dge me nt s Most of the activities that are reported in this contribution are not the work of the authors but of several persons in the institutions that participate in the Gigabit Testbed West project. The authors wish to thank D. Conrads, W. Frings, D. Gembris, T. Graf, R. Niederberger, S. Posse, M. Sczimarowski, and H. Vereecken from the Research Centre Jiilich, U. Eisenbl/itter, H. Grund, W. Joppich, G. GSbbels, M. GSbel, M. Kaul, E. Pless, R. VSlpel, K. Wolf, P. Wunderling, and L. Zier at the GMD, W. Hiller and T. StSrtkuhl at the AWI, V. G/ilzow at the DKRZ and J. Henrichs and K. Solchenbach at Pal- las GmbH, to mention but a few. We also wish to thank the BMBF for partially funding the Gigabit Testbed West and the DFN for its support. Special thanks to the University of Stuttgart for the PACX/MPI-library. Ref erences 1. Erwin, D., The UNICORE Architecture and Project Plan, Workshop on Seam- less Computing, ECMWF, Reading, September 16-17, 1997. 2. Sander, V., High Performance Computer Management, Workshop Hypercom- puting, Rostock, September 8-11, 1997. 3. Eickermann, Th., Henrichs, J., Resch, M., Stoy, R., and VSlpel, R., Metacom- puting in gigabit environments: Networks, tools, and applications, Parallel Com- puting 24, p. 1847-1872, 1998. 4. Message Passing Interface Forum, MPh A Message-Passing Interface Standard, University of Tennessee, http://www.mcs.anl.gov/mpi/index.html, 1995. 129 5. Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, University of Tennessee, ht t p: / / www. mcs. anl . gov/ mpi / i ndex. ht ml , 1997. 6. Burns, G.D., Daoud, R.B., Vaigl, J.R., LAM: An Open Cluster Envi ronment for MPI, Supercomputing Symposium '94, Toronto, Canada, June 1994. 7. Beisel, T. , Gabriel, E., Resch, M., An Extension to MPI for Di st ri but ed Com- puting on MPPs, in Marian Bubak, Jack Dongarra, Jerzy Wasniewski, Eds., Recent Advances in Parallel Virtual Machine and Message Passing Interface, p. 75-83, Springer-Verlag Berlin Heidelberg, 1997. 8. Nagel, W.E., Arnold, A., Weber, M., Hoppe, H.C., Solchenbach, K., VAMPIR: Visualization and analysis of MPI resources, Supercomp. 63, Vol. XII, no. 1, p. 69-80, 1996. 9. Wimmershoff, R., Entwicklung und Implementierung einer dreidimensionalen Partitionierungsstrategie f'fir das Pr ogr amm TRACE auf einem massiv paral- lelen Rechner. Technical Report Forschungszentrum Jiilich, Jiil-3157, 1995, in German. 10. Mosher, J. C. , Lewis, P.S., and Leahy, R.M., Multiple Dipole Modeling and Localization from Spat i o-Temporal MEG DATA. IEEE Trans. Biomed. Eng. 39, p. 541-557, 1992. 11. Beucker, R. and Schlitt, H.A., Objective Signal Subspace Det ermi nat i on for MEG, Forschungszentrum Jfilich, ZAM, FZJ - ZAM- I B 9715, 1997. 12. Ogawa, S., Lee, T.M., Kay, A.R., Tank, D.W., Brain magnetic resonance imag- ing with contrast depending on blood oxygenation. Proc. Natl. Acad. Sci. USA 87, p. 9868-9872, 1990. 13. Goddard, N.H., Hood, G., Cohen, J.D., Eddy, W.F., Genovese, C.R., Noll, D.C., and Nyst rom, L.E., Online Analysis of Functional MRI Dat aset s on Parallel Platforms. Journal of Supercomputing, 11, p. 295-318, 1997. 14. Griebel, M., Zumbusch, G., Parnass: Porting gigabit-LAN component s to a workstation cluster, in W. Rehm ed., Proceedings of the 1st Workshop Cluster- Computing, held November 6-7, 1997, in Chemnitz, Chemni t zer Informat i k Berichte, CSR-97-05 , p. 101-124, 1997. Hi g h Pe r f o r ma n c e Me t a c o mp u t i n g i n a Tr a ns a t l a nt i c Wi d e Ar e a Ap p l i c a t i o n Te s t b e d Edgar Gabriel, Michael Resch, Paul Christ, Alfred Geiger, and Ulrich Lang 1 High Performance Computing Center Stuttgart Allmandring 30, D-70550 Stuttgart, Germany (gabriel, resch}(~hlrs.de Abst r act . During the last couple of years, a wide variety of tools and libraries have been developed to enable distributed computing and visualisation. This paper presents the technical background and the results of such a project meaned to couple different computational resources. A metacomputing implementation of MPI called PACX-MPI was used to make the applications run on such a cluster. Three applications were used for demonstration purposes. These applications had to be adapted for Metacomputing, to make them more latency tolerant. 1 I n t r o d u c t i o n In 1997 the HLRS was involved in two t ransat l ant i c projects in the frame of the G7 Global Informat i on Society Initiative "Global Int eroperabi l i t y of Broadband Networks" (GIBN). One was from PSC and HLRS and was focus- ing on the application aspect of met acomput i ng. The other one was from SNL and HLRS and was concentrating on distributed visualization in a virtual laboratory. During the first project phase it became clear t hat the proj ect s should be merged into a Global Wide Area Application Test-bed (G-WAAT). This would allow to couple simulation and visualization in a met acomput i ng scenario. The main targets of the merger were: 9 To set up a product i on test-bed for met acomput i ng applications and dis- t ri but ed visualization 9 To combine supercomput i ng forces in order to solve much larger problems t han any of the partners could solve on his own resources. 9 To integrate software component s in order to establish a collaborative simulation and visualization environment In a first step this meant to set up a network connection fast enough to allow distributed simulation and visualization. Second, it was necessary to find a communi cat i on software t hat enables met acomput i ng for one single application. Thi rd applications had to be adapt ed to be able to fully exploit the provided met acomput er. Fourth, distributed visualization software had to 132 be adapted and extended. In response to these needs a transatlantic network connection was set up. The communication issue was resolved by implement- ing a completely new communication library based on the MPI standard. An existing Collaborative Visualisation software was extended and improved [11]. The concept of the paper is as follows. The technical details of the test- bed are described in section 2. Section 3 presents a library, which enables message-passing even between different Massively Parallel Processing Sys- tems (MPP' s) or Parallel Vector Processors (PVP' s). The results achieved during the Supercomputing '97 in San Jose and Supercomputing '98 in Or- lando using several applications are presented in section 4. A brief overview about the future work in this field is described in section 6. 2 A T r a n s a t l a n t i c N e t w o r k C o n n e c t i o n For a sufficient network throughput for metacomputing applications and col- laborative working the most relevant network Quality of Service (QoS) re- quirements are small and constant delays and nearly no packet losses. Measurements taken by HLRS on the standard internet connection which is provided and shared by the German DFN community, including a transat- lantic link of 2*45 Mbps shared bandwidth showed, that the available QoS between HLRS and the US had a strong variance. During the working hours in Europe and the eastern part of USA the packet losses varied between 10% and 40% resulting in varying TCP throughputs, which were not sufficient for effective metacomputing and cooperative work. Therefore a dedicated transatlantic test-bed was established connecting the two CRAY T3Es at HLRS and PSC based on a dedicated 2 Mbps ATM channel. For the Supercomputing '97 event, this network was extended to Sandia National Laboratories, Albuquerque New Mexico, and to San Jose. For the Supercomputing '98 event, the dedicated transatlantic ATM was rebuilt again based on a dedicated 10 Mbps ATM channel. Figure 1 shows the geo- graphic extension and the participating network providers of the transatlantic metacomputing environment during SC' 98 (see also ht t p: / / www. hl rs. de/ - news/ event s/ t 998/ se98/ ). 2. 1 Ne t wo r k Pe r f o r ma nc e Me a s u r e me n t s W i t h r e s p e c t t o l a t e n c y , c o m p a r i n g t h e c r o s s a t l a n t i c s t a n d a r d p a t h p r o - v i d e d b y D F N a n d t h e d e d i c a t e d A T M - L i n k , i t w a s i n t e r e s t i n g t o n o t e t h e e f f e c t d u e t o t h e n u m b e r o f r o u t e r s i n v o l v e d a n d t h e t r a n s l a t i o n o f p a c k e t l o s s e s i n t o a d d i t i o n a l d e l a y s . T h e r e s u l t s a c h i e v e d o n t h e n e t w o r k c o n n e c - t i o n b e t w e e n a t e s t w o r k s t a t i o n a t H L R S a n d t h e C R A Y T 3 E i n P i t t s b u r g h o v e r t h e s t a n d a r d p a t h a n d t h e d e d i c a t e d l i n k a r e d e p i c t e d i n t h e f o l l o w - i n g T a b l e . A s a l r e a d y m e n t i o n e d t h e n e t w o r k p e r f o r m a n c e o f t h e s t a n d a r d St ar TAP Chicago N SC'98 Orlando boothes with HLRS - HLRS booth - European booth - iGRID booth - NEC booth TrE~satlantic ATM PVC .... ATM OC3 lines IP-Tunnel Pittsburgh Manchester St~tgart HPC & Vis. wnwm- Cray T3E Cray T3E Cray T3E Onyx2 Juelich, FZ BMBF project, Unicore Toulouse, Cer facs (planned) Octane 133 Fi g. 1. Network for t ransat l ant i c met acomput i ng demonst rat i ons duri ng SC' 98 Connection Bandwi dt hl [Mb/s] DFN 2*45 ATM-Li nk 2 no. of t cp- t hr oughput rout ers [kB/s] 3 day night 15 5O 3OO 4 5 200 - packet losses [%] day night 3O 3 0 0 del ay [ms] day ni ght 1801 160 1502 4 Ta bl e 1. TCP t hr oughput and network QoS between HLRS and PSC on s t andar d Int ernet and dedi cat ed ATM link, Summer 1997. 1average value (vari at i on between 160 and 300 ms). 2variation between 150 and 155 ms. 3the socket buffer used was 64 kB. 4no t est s done. pa t h is s t r ongl y i nf l uenced by t he Eur ope a n and US wor ki ng hour s. Dur i ng a smal l t i me- wi ndow in t he Eur ope a n ear l y mor ni ng hour s, t he packet l oss and packet r ound- t r i p t i me were accept abl e and a TCP t hr oughput of appr ox. 300 kByt e / s was achi evabl e. However , dur i ng t he dayt i me, t he I P packet l osses ( meas ur ed wi t h packet si ze of 1 kByt e) downgr a de d t he TCP t hr oughput t o less t ha n 50 kByt e / s . The mean packet r ound- t r i p t i me on t he s t a nda r d pa t h r anged f r om 160 t o 300 ms. 134 On t he dedi cat ed ATM- l i nk, t her e were pr act i cal l y no packet losses ( dur - i ng SC' 97 a smal l numbe r of packet losses a ppe a r e d dur i ng t he change over f r om CANARI ES ATM- net wor k t o CA*Ne t I I ) wi t h a near l y c ons t a nt r ound- t r i p t i me of 150 ms. Thi s good l i nk pe r f or ma nc e r es ul t ed in a c ons t a nt T CP t h r o u g h p u t of 200 kByt e / s , whi ch is t he ma x i mu m t h r o u g h p u t avai l abl e on an ATM- l i nk wi t h 2 Mb i t / s bandwi dt h. The hi gher numbe r of r out er s on t he s t a nda r d p a t h i nt r oduced a r el at i vel y smal l l at ency, so in t he case of a smal l l oad as seen dur i ng Eu r o p e a n ni ght - t i me hour s t he r ound- t r i p t i me on t he s t a nda r d p a t h is c ompa r a bl e t o t h a t of t he di r ect ATM link. Fi gur e 2 shows a compar i s on of t he net wor k del ay and packet l osses dur i ng a 24 hour per i od over t he s t a nda r d p a t h and t he dedi cat ed ATM- l i nk. The d a t a on t he dedi cat ed ATM- l i nk was c a pt ur e d dur i ng SC' 97, t he d a t a on t he s t a nda r d p a t h s ome t i me af t er SC' 97. 100 21' 11.97: packet l oss [%1 ys. t i me [hi, 80 60 o 4 ~ o + 00:00 06:00 12:00 18:00 24:00 300 250 200 15o lOO 80 6o 40 20 o 0 0 : o 0 18.11.97: packet l oss [%] vs. ti me [hi I / J.k_.b~ltLI]1.1 . . . . . . . . . . . . l hi l,t . . . . . . Jilt,. t,_,,l, b L , 06:00 12:00 18:00 24:00 21. 11. 97: me a n r t t [ ms ] vs. t i me[ h] 100 50 0 18. ! 1.97: me a n r t t [ ms ] vs. t i me[ h] 300 250 200 t5o -.-1 1o0 50 o 00:00 0 0 : 0 0 06:00 12:00 18:00 24:00 06:00 12:00 18:00 24:00 a) Standard-Path (DE~') b) dedicated ATM-Link Fi g. 2. Round t ri p t i me and packet loss on the st andard pat h and the direct ATM link during a 24 hour period. As is well known, t he TCP pe r f or ma nc e on l i nks wi t h l ar ge ba ndwi dt h t i mes del ay pr oduct s is st r ongl y de pe nda nt upon t he T CP Wi ndow size, whi ch is conf i gur ed on t he end- s ys t ems t hr ough t he T CP socket buf f er sizes. On t he 2 Mb i t / s ATM- l i nk a socket buffer size of 64 kByt e was suffi ci ent for ma x i mu m T CP t hr oughput . 3 I nt e r ope r abl e MPI 135 For met acomput i ng the question of communi cat i on is a crucial one. The li- brary should be able t o fully exploit the fast net work of each single machi ne in the met acomput i ng scenario. At the same t i me it should be able t o sup- port the full communi cat i on funct i onal i t y between different machi nes t hat an application requires. PACX-MPI (PArallel Comput er eXt ensi on MPI) was de- signed t o enable message passing inside and over the bounderi es of an MPP, too. To realize this goal PACX- MPI has to distinguish messages which re- main inside a machine, in this cont ext called internal communi cat i on, and messages which have t o be t ransferred to anot her MPP. The l at t er one will be called ext ernal communi cat i on. For the i nt ernal communi cat i on, PACX- MPI is using the vendor i mpl ement ed MPI-library, since this is nowadays the only opt i mi zed and port abl e protocol, which is available on every syst em and which can fully exploit the capabilities of t he underl yi ng network. For t he ext ernal communi cat i on PACX- MPI should use a st andar d prot ocol , and t he decision was t o implement as a first prot ocol TCP/ I P. To avoid, t hat each application node has t o open a socket -connect i on to anot her node on a different machi ne when communi cat i ng, two so called daemon-nodes have been i nt roduced. These two nodes t ake care of out goi ng respectively incoming messages and are therefore t r anspar ent for t he appli- cation. Since PACX-MPI has the goal to support the whole MPI 1.2 st andard, problems like the configuration of a global communi cat or had t o be solved. Figure 3 is explaining the global configuration of MPI _COMM_WORLD on a met acomput er consisting of two machines. . . . . . . ~ ~ ~ 5 ~ I ' I I I I I i I - - - - - - - - 1 I I MPI _COMM_WORLD ~ global numbe r I I local number Fi g. 3. MPI_COMM_WORLD on a metacomputer consisting of two MPP' s 136 On the left machine, which shall be t he machi ne with the number one, t he first two nodes with ranks 0 and 1 are not part of MPI_COMM_WORLD, since these are the daemon nodes. The next node with the rank 2 is t herefore t he first node in our global communi cat or and gets t he global rank num- ber 0. All ot her application nodes get a global number according t o t hei r local ranks minus two, the last node on this machine has the rank 3. On t he next machine, the daemon nodes again are not considered in t he global MPI_COMM_WORLD. The node with the local rank 3 is number 4 in t he global communi cat or, since the numberi ng on this machi ne st art s with t he last global rank on the previous machine plus. Int roduci ng this renumberi ng and mappi ng of local pids to global ones, one gets a global MPI_COMM_WORLD wi t hout loosing the local information. 3.1 Poi nt - t o- poi nt ope r a t i ons i n PACX- MPI A poi nt -t o-poi nt operat i on in PACX-MPI can be briefly described as follows. The sender has to check first, whet her t he receiving node is on t he same machine or not. If it is on the same machine, it can di rect l y send t he message to the receiving node using native MPI-commands. If it is on a different machine, like in the example of Figure 4, it has to creat e a header first, which contains all informations to identify a message, and t hen a dat a-package. Bot h packages are sent t o a daemon node. The daemon node t ransfers bot h packages t o the destination machine, where anot her daemon node receives t he message and hands it over to the dest i nat i on node. ~ Command package Dat a package ............... Ret um Value ( optional ) global number local number Fig. 4. Point-to-point operation from global node 2 to global node 7 137 The recei ver has al so t o check whet her t he c ommuni c a t i on is i nt er nal or ext er nal . For an i nt er nal c ommuni c a t i on i t can di r ect l y execut e an MPI _Recv c omma nd. The onl y addi t i onal wor k whi ch has t o be done in t hi s case is, t h a t t he MPI _St at us has t o be adapt ed, since gl obal and l ocal numbe r s ar e not i dent i cal (see Fi gur e 3). I f t he c ommuni c a t i on is ext er nal , t he r ecei vi ng node checks fi rst , whe t he r t he expect ed message is al r eady in t he buffer. I f not , it has t o recei ve t he header and t he dat a- packet s f r om a da e mon node. 3. 2 Gl obal ope r at i ons Roughl y s peaki ng gl obal oper at i ons in MP I can be spl i t in t wo gr oups. The first gr oup of oper at i ons has a r oot - node, whi ch has t o di s t r i but e (e.g. br oad- cast ) or t o recei ve (e.g. reduce) s ome gl obal dat a. The second gr oup has no such r oot - node, all nodes have t he s ame s t a t us (e.g. bar r i er ) or d a t a (e.g. al l reduce, al l -t o-al l ) af t er t he gl obal oper at i on. The first gr oup of gl obal oper at i ons , whi ch have such a r oot - node, ar e spl i t t ed in t wo par t s in PACX- MPI . One pa r t is t o di s t r i but e/ col l ect t he da t a bet ween t he machi nes, and a second pa r t is a l ocal ope r a t i on i nsi de t he machi ne. The sequence of t hese t wo pa r t s is dependi ng on t he t he oper at i on. For a br oa dc a s t d a t a will be first di s t r i but ed t o all machi nes and af t er war ds t he l ocal br oa dc a s t will be per f or med. For a r educe- oper at i on PACX- MPI has t o pe r f or m first t he local oper at i on and onl y in t he second s t ep t he gl obal col l ect i ng of da t a will be per f or med. For t he second class of gl obal oper at i ons whi t hout a r oot - node, t her e ar e several possi bi l i t i es. The mai n di fference bet ween t he al gor i t hms is whe t he r we ar e execut i ng an al l -t o-al l exchange of da t a bet ween t he machi nes, or whet her we ar e col l ect i ng t he gl obal r esul t on a dedi cat ed node and di s t r i but e t he gl obal r esul t s in a second st ep. Let ' s r egar d t hi s s i t uat i on usi ng MP I ~ l l r e d u c e . I n t he fi rst a l gor i t hm each machi ne woul d execut e a l ocal MPI _Reduce, usi ng a l ocal r oot - node. I n a second st ep, each machi ne woul d send i t s r esul t t o all ot her machi nes , cal - cul at e l ocal l y t he gl obal r esul t and di s t r i but e t hi s t o all nodes on i t s machi ne. I n t hi s case we will have N. ( N - 1) numbe r of messages whi ch have t o be exchanged bet ween all machi nes , wi t h N bei ng t he numbe r of coupl ed machi nes. The a dva nt a ge of t hi s a l gor i t hm is, t ha t all ext er nal communi cat i on st eps can be pe r f or me d t heor et i cal l y in paral l el . I n t he second al gor i t hm each machi ne is execut i ng a l ocal r educe agai n, but in t he second st ep t hey all send t hei r l ocal r esul t t o a dedi cat ed node, whi ch cal cul at es t he gl obal r esul t and di s t r i but es t ha n t hi s r esul t t o all ot her machi nes. I n t hi s case we will have 2 . N 138 external communication steps, but only N communications can be performed in parallel at the same time. Which of these two algorithms is performing faster, is an issue of actual investigations and strongly dependent of the network-configuration between the machines. 3. 3 Re l a t e d wo r k s Their are several works related to this theme, each having a somewhat dif- ferent approach. The well known Globus-project [7] tries to build up a whole bunch of metacomputing services, including also distributed computing for MPI-applications based on MPICH [6] and the NEXUS communication li- brary. A disadvantage of this attempt is, t hat every external communcation step is done by direct node-to-node connection. This can lead for really big configurations to problems because of too many open ports/sockets. Addi- tionally the underlying NEXUS-library has no support for global operations. Therefore the execution time for a broadcast-operation for example is stronlgy dependent of the distribution of nodes on the different MPP' s. The MagPie-project [9] was setup to solve this problem. This project implemented global-operations for clusters of machines for MPICH, but on the other hand they still do not solve the problem of the direct point-to-point operations. PVMPI [4] makes MPI applications run on a cluster of machines by using PVM for the communication between the different machines. Unfortunately the user can use only point-to-point operations and has to add some non MPI congruent calls. The subsequent project, MPI_Connect uses the same ideas but replaced PVM by a library called SNIPE [5], and supports now global operations too, in contrary to PVMPI. A similar approach has been done by PLUS [1]. This library additionally supports communication between different message-passing libraries, like e.g. PARMACS, PVM and MPI. But again the user has to add some calls to his application. Another project called Stampi [8] has been recently presented. This project already uses the MPI2 process model, but focuses mainly on local area com- puting. On the other hand they distinguish between one/ t wo/ t hree hop com- munication, and therefore a metacomputer need not perform direct node- to-node communication but can use some kind of daemon for the external communication. 4 Appl i c a t i ons a nd Re s ul t s During the Supercomputing '97 in San Jose and Supercomputing '98 in Or- lando a couple of demonstrations were done using PACX-MPI. In this section we will briefly describe the applications used in the metacomputing environ- ment and we will also present some results achieved. 139 4. 1 URANUS The first appl i cat i on is a CFD- code cal l ed URANUS ( Upwi nd Rel axat i on Al gor i t hm for Nonequi l i br i um Fl ows of t he Uni ver si t y of St ut t ga r t ) [2]. Thi s pr ogr am has been devel oped for si mul at i ng t he r eent r y of a space vehi cl e in a wide al t i t ude vel oci t y range. The r eason why URANUS was t est ed in such an envi r onment is t ha t soon t wo addi t i onal component s of URANUS will have a gr eat demand on memor y: t he nonequi l i br i um pa r t has been fi ni shed in t he sequent i al code and will be paral l el i zed soon. Fur t he r mor e we will si mul at e t he Crew-Rescue-Vehi cl e (X-38) of t he new i nt er nat i onal space- st at i on wi t h mor e t ha n 3 Million cells. Bot h component s t oget her r equi r e me mor y in t he r ange of hundr eds of Gi gabyt es, t ha t cannot be pr ovi ded by a single machi ne t oday. Dur i ng t he SC98 we si mul at ed t he Eur ope a n space-vehi cl e HERMES wi t h 1.7 Million cells usi ng 992 CPU' s on t wo Cr ay T3E' s . The code is based on a r egul ar gri d decomposi t i on, whi ch l eads t o a ver y good l oad bal anci ng and a si mpl e communi cat i on pat t er n. In t he following we give t he overal l t i me i t t akes t o si mul at e a medi um size pr obl em wi t h 880.000 gri d cells. For t he t est s we si mul at ed 10 I t er at i ons. We compar ed a single machi ne wi t h 128 nodes and t wo machi nes wi t h 2 t i mes 64 nodes. Obvi ousl y t he unchanged code is much slower on t wo machi nes. How- Method 128 nodes 2*64 nodes using MPI using PACX-MPI URANUS 102.4 156.7 unchanged URANUS 91.2 150.5 modified URANUS - 116.7 pipelined Ta bl e 2. Comparison of timing results (sec) in met acomput i ng for URANUS ever, t he over head of 50% is r el at i vel y smal l wi t h r espect t o t he slow net wor k. Modi f i cat i on of t he pre-processi ng does not i mpr ove t he si t uat i on much. A l ot mor e can be gai ned by fully asynchr onous message-passi ng. Using so cal l ed ' Message Pi pel i ni ng' [2] messages ar e onl y recei ved if available. The recei vi ng node ma y cont i nue t he i t er at i on pr ocess wi t hout havi ng t he most r ecent da t a in t ha t case. Thi s hel ped t o r educe t he comput i ng t i me significantly. The i mpl i cat i on of t hi s me t hod is, t ha t for conver gence t he number of i t er at i ons has t o be i ncr eased by about 10 per cent . Addi t i onal l y one has t o t ake care, t ha t t he messages are not ol der t ha n t wo i t er at i ons, since t hi s ma y pr event conver gence at all. Test s for one single machi ne were not r un because r esul t s are no l onger compar abl e wi t h r espect t o numer i cal convergence. 140 4. 2 P 3 T - DS MC The second appl i cat i on is P3T- DSMC. Thi s is an obj ect - or i ent ed Di r ect Si m- ul at i on Mont e Carl o Code which was devel oped at t he I ns t i t ut e for Comput e r Appl i cat i ons (ICA I) of St ut t ga r t Uni ver si t y for gener al par t i cl e t r acki ng pr obl ems [10]. Since Mont e Carl o Met hods ar e well sui t ed for met acomput i ng, t hi s ap- pl i cat i on gives a ver y good per f or mance on t he t r ans at l ant i c connect i on. For Paxticles/CPU!60 nodes using MPI 1935 0.05 3906 0.1 7812 0.2 15625 0.4 31250 0.81 125000 3.27 500000 13.04 Ta bl e 3. Compaxlson of timing results (sec) 2*30 nodes using PACX-MPI 0.28 0.31 0.31 0.4 0.81 3.3 13.4 in metacomputin~ for P3T-DSMC smal l number of part i cl es t he met acomput i ng shows some over head. But up t o 125.000 part i cl es t i mi ngs for one t i me st ep ar e i dent i cal on one machi ne and in t he met acomput i ng- t es t bed. Thi s excel l ent behavi our is due t o t wo basi c f eat ur es of t he code. Fi rst , t he comput at i on t o communi cat i on r at i o is becomi ng bet t er if mor e part i cl es ar e si mul at ed per process. Second, l at ency can be hi dden mor e easily if t he number of par t i cl es i ncreases. 4. 3 P 3 T - MD Th e t hi r d appl i cat i on is also based on t he P3T- t ool ki t , but i nst ead of a Mont e- Car l o code t hi s pr ogr am solves t he mol ecul ar - dynami c equat i ons t o si mul at e t he i nt er act i ons bet ween t he part i cl es. Ther ef or e t he code is s t r onger coupl ed compar ed t o P3T- DSMC. Dur i ng t he SC98 event , a l ot of t est s have been per f or med wi t h bot h P3 T appl i cat i ons usi ng up t o 1024 processors. Th e resul t s coul d not yet be fully eval uat ed. Of t en mol ecul ar - dynami cs si mul at i ons gener at e l arge sized out put s . It makes no sense t o st or e a compl et e conf i gur at i on of such a si mul at i on in a met acomput i ng envi r onment , since t hi s woul d r equi r e a l ot of addi t i onal t i me. P3T- MD does for big si mul at i ons a di s t r i but ed post pr ocessi ng of t he dat a. For exampl e t o cal cul at e t he f or ce- di st r i but i on bet ween t he part i cl es, each node is doi ng its own st at i st i cal anal ysi s and onl y t he r esul t of t hi s anal ysi s is st or ed i nst ead of t he raw dat a. 5 Conc l us i ons 141 The last two chapters have pointed out, t hat one has to invest a lot of work un- til the applications performs on a cluster of MPP' s. An MPI-implementation suitable for such a cluster of machines has completly different requirements than an MPI-library working on a single machine. Optimizing point-to-point operations by dealing with different protocols is required but still not enough. The global operations have to be adapted to algorithms dealing with latencies of different ranges. Additionally apllications have to be adapted to become more latency tol- erant and to use less bandwidth. A CFD-application like URANUS is much more difficult to adapt for such an environment, since it is strongly coupled and it is not simple to save communication whithout loosing numerical perfor- mance. The key question in any case is, whether one succeeds in overlapping communication and computation. A Monte-Carlo Method like P3T-DSMC is communicating less then the application above, and therefore fits automatically better for Metacomputing. Problems may still arise of dealing with huge amounts of data, which have to be transfered to a single machine for the final output. Thus some distributed postprocessing operations are inalienable to transfer and save only really important data. These points may be resumed regarding the costs for Metacomputing. Since costs for the networks are depending on the reserved bandwidth and the time for which we are using the network, algorithms should be developed that consider economical apsects as well. Nevertheless, there are some applications, for which Metacomputing may be nowadays the only method to get some results, and for which it is worth to do the whole work. 6 Out l ook The future metacomputing activities of the High Performance Computing Center Stuttgart will be focused in a project called METODIS (MEtaco- muting TOols for Distributed Systems). This project is supported by the European Community and has the major goal to create a set of tools for Metacomputing. This will include a MetaMPI, based on PACX-MPI, a gen- eral ATM interface, that will be used by PACX-MPI and a metacomputing version of the performance analysis tool VAMPIR, which will be coupled to PACX-MPI. To achieve a full support for MPI1.2, PACX-MPI has to be extended to support more functions. Up to now we've implemented mainly the MPI-calls according to our applications needs. Additionaly PACX-MPI will be extended to support not only TCP/ I P for the external commmunication, but also to support other protocols, like ATM or HiPPI. 142 Ac knowl e dge me nt The aut hor s woul d like t o t ha nk for t he hel pful s uppor t by net wor ki ng or- gani zat i ons and gr oups, especi al l y Ge r ma n Tel ekom, Tel egl obe, CANARI E, STAR TAP, vBNS, ESNet . I n addi t i on we woul d like t o t h a n k Pi t t s b u r g h Su- pe r c omput i ng Cent er and t he Hi gh Pe r f or ma nc e Comput i ng Cent er St u t t g a r t for pr ovi di ng t hei r machi nes for our t est s and de mons t r a t i ons . Ref erences 1. Mat t hi as Brune, JSrn Gehring and Alexander Reinefeld (1997), Heterogeneous Message Passing and a Link to Resource Management, Journal of Supercomput - ing, Vol. 1, 1-17 2. Thomas BSnisch and Roland Rfihle (1998) Adapt i ng a CFD Code for Met acom- puting, 10th International Conference on Parallel CFD, Hsi nchu/ Tai wan, May 11-14. 3. Th. Eickermann, J. Heinrichs, M. Resch, R. Stoy, R. VSlpel (1998) Met acom- put i ng in gigabit environments: Networks, tools and appl i cat i ons, Parallel Com- put i ng 24, 1847-1872. 4. Gr aham E. Fagg, Jack J. Dongarra and A1 Geist (1997) Heterogeneous MPI Application Int eroperat i on and Process management under PVMPI , in Marian Bubak, Jack Dongarra, Jerzy Wasniewski (Eds.) ' Recent Advances in Parallel Virtual Machine and Message Passing Interface' , 91-98, Springer. 5. Gr aham E. Fagg, Keith Moore, Jack J. Dongarra, A1 Geist (1997) Scalable Net- worked Informat i on Processing Envi ronment (SNIPE), Technical Paper, Super- comput i ng 1997. 6. Ian Foster, Jonat han Geisler, William Gropp, Nicholas Karonis, Ewing Lusk, Georg e Thiruvathukal, Steven Tuecke (1998) Wi de-Area I mpl ement at i on of t he Message Passing Standard, Parallel Comput i ng 24. 7. Ian Foster, Carl Kesselman (1998) The Globus Project: A St at us Report , Proc. I PPS/ SPDP '98 Heterogeneous Comput i ng Workshop, pg. 4-18, 1998. 8. Toshiya Ki mura, Hiroshi Takemi ya (1998) Local Area Met acomput i ng for Mul- tidisciplinary Problems: A Case st udy for Fl ui d/ St r uct ur e Coupled Simulation, 12th ACM International Conference on Supercomput i ng, Melbourne, Jul y 13-17. 9. Thilo Kielmann, Rut ger F.H. Hofman, Henri E. Bal, Aske Pl aat , Raoul A.F. Bhoedjang (1998), MagPie: MPI ' s Collective Communi cat i on Operat i ons for Clustered Wide Area Systems, to appear at PPoPP' 99, online version available at ht t p: / / www. cs. vu. nl / al bat ross 10. Mat t hi as Miiller and Hans J. Her r mann (1998) DSMC - a stochastic al gori t hm for granular mat t er, in Hans J. Her r mann and J.-P. Hovi and Stefan Luding (Eds.) ' Physics of dry granular media' , Kluwer Academic Publisher. 11. Andreas Wierse (1995) Performance of the COVISE visualization syst em under different conditions, in Visual Dat a Exploration and Analysis II, in Georges G. Grinstein, Robert F. Erbacher eds., Proc. SPIE 2410, pages 218-229, San Jose. MI L E S S - A L e a r n i n g a n d T e a c h i n g S e r v e r f o r Mu l t i - Me d i a D o c u me n t s Hol ger Gol l an, Fr ank Lf i t zenki r chen, and Di et er Nast ol l Comput er Center, Essen University, Schfitzenbahn 70, 45117 Essen, Ger many Ab s t r a c t . MILESS [7] is a joint project between t he Comput er Cent er and t he Central Li brary of Essen University, together with t he two pilot depar t ment s of linguistics and physics. The mai n purpose is to provide st udent s and faculty of Essen University with a library server t hat support s several different functions t hat are needed within a digital library. Based on the IBM DB2 Digital Li brary product [4], MILESS can store and retrieve digital document s in any given format ; moreover, searching is possible in a very el aborat e way, and access control is suppor t ed as well. In this article, we will first discuss why there is a growing need for digital library servers, followed by a description on how MILESS is build on t op of t he IBM DB2 Digital Library. We will describe the software techniques t hat are used to build t he system, and we give a test case for the use of MILESS when referencing different articles within a mat hemat i cal journal. 1 T h e Ne e d f o r a Di g i t a l L i b r a r y Wi t h t he evol vi ng web t echnol ogi es, t he a mount of di gi t al d a t a t h a t is ac- cessi bl e vi a t he i nt er net is enl ar gi ng in a dr a ma t i c way. Usual l y, t hese d a t a will a ppe a r on cer t ai n websi t es, ei t her per sonal or i nst i t ut i onal . I n addi t i on, t her e mi ght be commer ci al pl aces on t he web t ha t hol d l ot s of i nf or mat i on in di fferent di gi t al f or mat s. But t hi s huge set of i nf or mat i on l eads t o sever al pr obl ems . 9 I t is s omet i mes har d t o find. 9 I t has no s ys t emat i c order. 9 I t mi ght vani sh wi t hout f ur t her not i ce. On t he ot her hand, cl assi cal l i br ar i es have t o fi nd new ways t o enabl e t hei r cus t omer s t o wor k wi t h t hi s new mat er i al in addi t i on t o t he wel l known books and j our nal s on t he shelfs. Mor eover , uni ver si t y s t udent s a nd f acul t y want t o use di gi t al and esp. mul t i medi a mat er i al in l ear ni ng, t eachi ng and r esear ch. To face t hese pr obl ems and t o meet t hese needs, a di gi t al ver si on of t he cl assi cal l i br ar y servi ces is needed. I t shoul d s uppor t t he use of such ma t e r i a l by pr ovi di ng a rel i abl e, pe r ma ne nt , and s ys t emat i cal l y or der ed access t o it. 144 2 MI LESS a nd t he I BM DB2 Di gi t a l Li br a r y In late 1997, the Computer Center and the Central Library of Essen Uni- versity started the MILESS project, which was funded by the local state ministry and the university. The idea was to install a digital library server that could solve the problems mentioned in the previous section. While the Computer Center brought in its knowhow in information technology and soft- ware development, the Central Library started to redefine the classical library techniques and services for the new types of digital and multimedia objects. In addition, two pilot departments (linguistics and physics) started to fill the digital library server with appropriate material. To store and archive the digital documents, MILESS uses the IBM DB2 Digital Library product [4]. Its main parts consist of one library server and several object servers, where the object servers are responsible for the actual storage of the documents, but access is only possible via the library server that controls and manages the documents that are put into the digital library. Using this control mechanism, it is impossible to delete any documents in the object servers without notification of the library server, hence there can be no dead links within the system. The library server itself is running on top of a DB2 database, and the object servers can be connected to an ADSTAR Dis- tributed Storage Manager (ADSM) [1] that handles the storage and archiving problems. E.g., documents can be archived when they haven' t been used for a longer time. The IBM DB2 Digital Library product offers a lot of features including several services that are useful for a digital library server. It handles storage and management of the documents via the object servers, and it enables ac- cess control via a rights management. Moreover, it has sophisticated search techniques, e.g. text mining and Query By Image Content (QBIC) to.en- able the user to find what he (she) is looking for within the stored digital documents. While the IBM DB2 Digital Library product does an excellent job when it comes to the storage and retrieving problem, it is of no great help for the implementation of the MILESS data model. Since we wanted to have great flexibility in this respect, we adopted the Dublin Core [2] standard for the description of electronic resources, adding some additional features like contact information for the creators and contributors of the documents. Thus a document within MILESS can e.g. have several titles, several creators and contributors, several types and formats, etc. In particular, documents in MILESS can have several derivates in different formats; no standard format is required. Moreover, MILESS can handle hierarchical classifications t hat are widely used in science to help capturing the subjects of a document in a standardized way. This very complex and yet flexible dat a model for the metadata of electronic documents enables the librarians in the project to extend their classical library services to the digital material within the MILESS system. 145 Since the I BM DB2 Digital Li brary product can not handle such complex data models, additional software had to be wri tten to enable MILESS to work wi t h t he Dubl i n Cor e s t a nda r d. We wi l l t a ke a cl oser l ook on t he ne w sof t war e i n t he ne xt sect i on. The f ol l owi ng f i gur e i l l us t r at es t he di f f er ent pa r t s of t he I BM DB2 Di gi t al Li br a r y pr oduc t . ub i c. o, N. VldeoCharger ~.. Server: Streaming of Audio/Video Data ( ~EG, ...) Web-Server / Java Servlet Engine: MTLESS Server Components HTML XlVIL Web- Br ow~ Java VM: Appleta MILESS GUI for Authors Library Sewer: Metadata L Object Server. ' ~ Files [ (PS, PDF, ...), [ ;ntral / distn~outed [ I (Title, Author, ...) ~ I central / dis~n'buted | ' | . . . . . . i t Text Sem'ch Server: Fulltext Queries ~ ~e i (Textindices) I l l ar,9 I IBM Digital Library : r ~' 1 II . . . . . . . . . . . . " I . . . . . . ..... BackupSy~:::: ::: ,,::, ADSM Server: Archivia$ and IBM DB2 / Oracle 3 MI LESS - A Cl os e r Look Besides the need for special software because of the complex data model, the additional MILESS software is divided into different layers, dependent on their functionality. To be platform independent, a design decision was made to use JAVA as the programming language, made possible by a JAVA API for the IBM DB2 Digital Library that can be used to connect JAVA code with the Digital Library product. Thus the bot t om layer of the system is given by this programming interface that connects the Digital Library product with the outside world. This API is used by a so-called Persistency Layer that is responsible for the storing and retrieving of documents. On top of that there is a collection of JAVA classes that implement the funtionalities for documents, legal entities (creators and contributors), etc. Another part of the inner system is using JAVA servlets. MILESS is run- ning inside a web server that is capable of using servlets, and these servlets are used for the connection and communication between the user and the system, e.g. 146 9 A Document Ser vl et is used t o present t he met adat a of a document on a page within t he web server. 9 The Deri vat eServl et is needed t o access a cert ai n f or mat of a specified document . 9 The SearchServl et takes t he user queries, connect s t o t he Di gi t al Li br ar y product t o do t he search, and present s t he results on a page wi t hi n t he web server. The following figure illustrates t he different software layers of t he MI LESS syst em. Load and edit metsdata and contmt :/::::n HTML Pages: i I t i : : : : : : : ' : : : : : :::::: *aD* / Sear ch~navi gat c I 0 : i : r - - MI LESS i / show ,sando . , ! * .:: . . . . : : Communicator i:::: :~ J Di l l Model Pack I (Java Class Library for I "1 Dublin Core Data Model) "~" - - MILESS Persistency Layer (Java Class Library): R Create, retrieve update, delete and search MILESS data items V Besides t he inner part s of t he syst em t hat run on t he server side, t here are ot her part s t hat run on t he side of t he user client. Basically, t he onl y t hi ng t he user needs is a web browser t hat connect s hi m t o t he MI LESS homepage at ht t p: / / mi l e s s , uni - e s s e n, de. From here, t he user has access t o t he full funct i onal i t y of t he system; e.g. t he search facilities can be reached j ust vi a normal HTML- pages. In addition, t here is t he possibility for an aut hor to creat e or change document s inside t he MILESS syst em. To do this, he (she) can call a graphi cal user interface (GUI) t hat runs as a JAVA appl et inside t he web browser. Thi s GUI can be used t o creat e new document s for t he syst em, or t o change al ready existing document s or personal dat a of creat ors and cont ri but ors. Moreover it helps navi gat i ng t hr ough hi erarchi cal classifications t o find t he correct subj ect s for t he document s. The communi cat i on bet ween this GUI on t he client side and t he inner par t of t he syst em on t he server side is handl ed vi a a Servlet Communi cat or , t he dat a exchange is done vi a XML [3]. In t he near fut ure cont ri but ors can use XML directly t o put mat er i al 147 i nt o t he syst em. Thus t he appl i cat i on on t he cl i ent side does n' t have t o know anyt hi ng about t he i nt er nal r epr esent at i ons on t he ser ver side; especi al l y t he I BM DB2 Di gi t al Li br ar y pr oduct and its i nt er nal s t r uct ur e is t ot al l y i nvi si bl e t o t he out si de world. 4 MI LES S - F r o m t h e u s e r p o i n t o f v i e w Ther e are a l ot of possible scenari os for a user of t he MI LESS syst em. Most peopl e will use it j ust t he way t hey use a classical l i brary, but wi t h enhanced feat ures. One of t hese is t he search facility. By maki ng use of t he sear ch t echni ques of t he under l yi ng I BM DB2 Di gi t al Li br ar y pr oduct , t he user can not onl y search for cer t ai n aut hor s or words and phr ases wi t hi n t he t i t l e or t he keywords, but it is also possi bl e t o sear ch for words and phr ases wi t hi n t he t ext s of t he document s. Moreover, because of t he way hi er ar chi cal classifications ar e i mpl ement ed inside MI LESS, he (she) can navi gat e t hr ough t he hi er ar chy of such classifications, l ooki ng e.g. at all document s at a specific level. Because of t he capabi l i t y of t he syst em t o handl e document s in any gi ven f or mat , t he user mi ght have a pr obl em wi t h t he act ual f or mat of a r et r i eved document . To over come this difficulty, MI LESS has a pl ug- i n col l ect i on t ha t coul d hel p t he user and his br owser t o under s t and a st r ange f or mat . Mor eover , we are col l ect i ng di fferent conver t er s t ha t coul d be used t o cr eat e new f or mat s f r om exi st i ng ones when put t i ng new mat er i al i nt o t he syst em. Cr eat i ng new mat er i al is anot her scenari o for t he use of t he MI LESS sys- t em. Wi t h t he hel p of t he user GUI, a nybody can cr eat e new document s in t he syst em by pr ovi di ng t he necessar y me t a da t a wi t hi n t he GUI and upl oad- ing t he da t a files of t he document i nt o t he syst em. Thi s can be used e.g. by l ect ur er s t o put t hei r l ect ures and exerci ses i nt o t he syst em, enabl i ng s t udent s t o work wi t h t hi s mat er i al onl i ne whenever t hey want . Anot her scenari o sees a l ect ur er pr epar i ng his next t al k and sear chi ng t he syst em for cer t ai n mul t i medi a mat er i al he can use in t he class. Such mat er i al can be JAVA appl et s or s i mul at i ons / ani mat i ons , audi o- / vi deo mat er i al , et c. To access vi deo mat er i al , a vi deo server [5] is i ncl uded in t he s ys t em t ha t uses st r eami ng t echni ques t o del i ver mul t i medi a mat er i al in real t i me t o mul t i pl e users. Yet anot her use of MI LESS is t he linking bet ween di fferent art i cl es of a mat hemat i cal j our nal ; a first t est case for t hi s will be pr es ent ed in t he ne xt section. 5 MI L E S S a n d t h e " Ar c h i v d e r Ma t h e ma t i k " - A T e s t Ca s e Thi s final sect i on present s a col l abor at i on wi t h t he I ns t i t ut e for Expe r i me nt a l Mat hemat i cs a l. Essen Uni versi t y, wher e six old vol umes of t he ma t h e ma t - 148 ical journal "Archiv der Mathematik" will be retrodigitized to make them available on the web in digital form (see [6]). One small piece in this project is the automatic extraction of the biblio- graphic data like title, authors, author address, journal name, volume num- ber, etc. These data, stored in an XML-format [3], can be used to fill the metadata fields of the Dublin Core standard automatically and to put the retrodigitized articles into the MILESS system, with restricted and controlled access because of copyright issues. Another piece in this project is the automatic recognition of the cited ref- erences at the end of any article. Using Optical Character Recognition (OCR) and heuristics, an HTML-page is produced t hat contains the references and tries to link, where possible, to an online copy of the cited article. To do this, some standardization is needed to produce a correct link. To install a prototype for such a referencing functionality, we have put two articles of the "Archiv der Mathematik" into the MILESS system, namely 9 William Crawley-Boevey, Tameness of biserial algebras, Arch. Math. 65, 399-407. 9 Christof Geiss, On degenerations of tame and wild algebras, Arch. Math. 64, 11-16. where the second one is a cited reference in the first article. After the auto- matic recognition of the references of the first article, the link to the second article will be produced automatically as h t t p ://miless. u n i - e s sen. d e / i e m / A r chiv_der_Mat h e m a t i k / 6 4 / 1 1 where the volume number and the first page are used to uniquely identify the cited article. Upon this request the MILESS system starts an internal search to retrieve the referenced article. With this feature the reader can view the first article, realizing the citation, and following the link by just clicking on it. This can easily be extended to referenced articles being published in other journals, once these journals are available online and a unique way to reach their articles is realizable just from the bibliographic dat a as in the example above. An extension of the prototype in this direction is planned for the near future. Such an extension can lead to a distributed library for scientific journals, not necessarily restricted to mathematics, adding new features and functionalities for the user, but creating some demands on the underlying networks as well. 6 Ac knowl e dge me nt The MILESS project is financially supported by the local state government of Northrhine-Westfalia, Germany, and Essen University. Many people have been involved in the design and implementation of the system, including, but not restricted to D. Azkan, A. Bilo, E. Coelfen, B. Lix, V. Nordmeier, B. Schlesiona, A. Sprick. R e f e r e n c e s 149 1. ADSTAR Distributed Storage Management, ht t p: / / www. st orage. i bm. com/ soft ware/ adsm/ 2. The Dublin Core Standard, ht t p: / / pur l . ocl c. or g/ dc/ 3. Extensible Markup Language, ht t p: / / www. w3. or g/ TR/ REC- xml 4. IBM DB2 Digital Library, ht t p: / / www. soft ware. i bm. com/ i s/ di g-l i b/ 5. IBM DB2 Digital Library Video Charger, ht t p: / / www. soft ware. i bm. com/ dat a/ vi deocharger/ 6. G. O. Michler, "A Prot ot ype of a Combined Digital and Retrodigitized Search- able Mathematical Journal", Preprint. 7 . MI L E S S - Multimedialer Lehr- und Lernserver Essen, ht t p: / / mi l ess. uni -essen. de Rural Educational System Network (RESNET): Design and Deployment Salim Hariri and Wang Wei, Sung-Yong Park, Harvey Janelli Center for Advanced TeleSysMatics (CAT) University of Arizona Tucson, AZ 85721 {hariri, wang} @ece.arizona.edu, www.ece.arizona.edu/-hpdc Department of Computer Science Sogang University Seoul, Korea sypark@ieee.org Interactive Media Group, Inc. 14817 Sopras Circle Addison, TX 75224 janelli_img @ worldnet.att.net ABSTRACT: The main objective of this project is to design and deploy the initial infrastructure of the Rural Education System Network (RESNET) in eastern Texas. We have selected the Asynchronous Transfer Mode (ATM) and single-mode fiber to build the RESNET infrastructure. The RESNET network operated initially at a backbone speed of OC-3c (155 Mbit/s), with the goal of upgrading to OC-12c (622 Mbit/s). The RESNET backbone connected the following sites: Tyler County Courthouse, Woodville ISD High School, Tyler County Hospital in the city of Woodville, Alabama & Coushatta Indian Reservation, Big Sandy ISD, Livingston High School, the Polk County Courthouse, the Polk County Hospital, main campus of Sam Houston State University in Huntsville, TX. The intended applications for RESNET are classified into three types: 1) Telecommunication Services, 2) Interactive Multimedia Services, and 3) Mutlimedia Services. The telecommunication services include: Switched data, voice, and video ATM service at 25 Mbps and 155 Mbps. In addition, it is the goal of the Tribes to establish a Call Center and Network Control Center. The interactive multimedia services include: Virtual Classroom, Virtual courtroom, Virtual County, Virtual Clinic, Teacher Network, Parent Network. The multimedia services include: Video-On-Demand (MPEG I & II), Education-On- Demand, Training-On-Demand, Multimedia Publishing, Electronic Publishing, and Intra/Internet Broadcasting. In this paper, we give an overview of the RESNET history, goals, design and technology adopted for RESNET, and conclude with future RESNET activities. 1. Ov e r v i e w o f RES NET - Hi s t ori cal Pe r s pe c t i ve The Rural Education System Network or RESNET was founded in 1992. It is an educational and public service partnership of private industry, local government, 152 universities, hospitals, rural independent school districts and the Alabama- Coushatta Tribes. It was incorporated in the State of Texas on November 4, 1993. In 1994 the Tribes entered into the RESNET partnership to seek fundi ng for the Tribal Technology project. In 1996 RESNET was successful in securing initial funding of $750,000 from the Houst on Endowment (a private foundation). Money was provided for the construction of an 80-mile fiber optic net work with an OC-3c (155Mbps) Asynchronous Transfer Mode (ATM) backbone. This network backbone connects three independent school districts (ISD) of Livingston, Big Sandy, and Woodville; two hospitals, the Indian Health Servi ce (IHS) clinic and Tyl er County hospital with UTMB Galveston; Sam Houst on State University (a regional university); and the Polk, and Tyl er count y governments. In addition, the Department of Agriculture provided $340, 000 of funding for the base line ATM electronics that connects these locations with the Alabama-Coushatta Tribes. It was the consortium' s intent that RESNET becomes a low cost communi t y resource, operated and supported by tribal members. It provides a high-speed interface to the Internet, and serves as an Intranet for the Tribes, hospitals, local governments, and educational components. RESNET has assembled a solid partnership of local business, international technology providers, and world class academics. All of the partnerships have been consummated with the use of a formal teaming agreement. This binding agreement clearly defines the commitment of all the parties. The majority of the partners have been involved for over three years and continue to get more and more involved. Current business and technology partners include: Lucent Technologies (4 years); Entergy Corp. (3 years); Sam Houst on Electric Cooperative (SHECO) (2.5 years); the CASE Center at Syracuse University (2 years); and the newest partner, the IBM Corporation, Net work Hardware Division. All of the current RESNET ISD' s mentioned have been commi t t ed for over four years and have played an active role in the definition of end-user needs. Sam Houston State University, Rice University, Texas A&M University, University of Texas, and the University of Houston are committed to providing distance learning programs and higher education courseware for the network. This has been a "grass-roots" effort, primarily in support of the Tribes and their children. Unlike other programs, the Alabama-Coushatta Tribes will continue to share resources with the surrounding community. RESNET recently entered into a special partnership with the SHECO. Both parties have a j oi nt sheath agreement whereby SHECO in addition to giving RESNET pro-bono pole contacts, matches mile for mile additional fiber installation. For every mile of fiber RESNET installs, SHECO installs a mile and both parties exchange fibers to form a common network. In addition, SHECO maintains the entire fiber optic system upon completion of installation by RESNET. 153 2. RESNET Goals and Strategies The main goal of RESNET is the devel opment of a communi t y networking environment to cooperatively devel op applications that utilize advanced communication technologies. The RESNET will provi de high-speed net work connectivity to all the RESNET sites (e.g., schools, court houses, hospitals, and Indian reservations) and state of the network-based applications (e.g., switched telecommunication services, interactive multimedia services and multimedia services). The RESNET also provides a high-speed connection to the Internet, including NGI (Next Generation Internet), and the NII (National Informat i on Infrastructure). This federally mandated program was established by the High Speed Computing Act of 1991 (a.k.a. "NREN" - the National Research and Education Network). Its original goal was to establish a multimedia broadband fiber-optic network, connecting over 1200 national universities and research facilities in the U.S. and eventually overseas. The NII now has evol ved into the Internet. It provides access to a myriad of information sources. It is in effect, an extension of thousands of electronic assets worldwide, with very high-bandwidth requirements. Very large databases, multimedia database with very large files, high-bandwidth applications (e.g., the ability to log on to the Hubbell telescope and concurrently view on-going experiments) are all part of this "mega- network". It is RESNETs' goal to see that the children in rural east-Texas are not left behind while this Information Superhighway (NII) comes to fruition. The population base of rural America is highly economi cal l y disadvantaged and highly populated by minorities. It is, in effect, a mirror image of the inner city, with a more favorable environment and lower population density. The RESNET technology goal is to install Fiber-to-the-Schools (FTTS) using single-mode fiber backbone, with ATM transport at OC-12. There will be ATM switches at each ISD campus and ATM network interface cards (25 Mbps) installed in each student workstation. The technology involved in our migration toward NII/ NGI compatibility is fairly simple. We must work with what is presently installed, wherever possible, and upgrade to the compatible technology whenever possible. Our goal in the pilot is to extend the fiber installed at the University of Houst on and their full services, into the three ISD campuses of the RESNET pilot. RESNET takes care of the solutions to insure the security and privacy of the data and network bet ween the two sites. 3. RESNET Design and Technology The RESNET is a broadband fiber-optic based, private wide-area ATM network, whose scope, by the year 2002, will include ten counties in eastern Texas area. 154 The first segment of this consortium-owned, not-for-profit managed network was started around the Alabama-Coushatta Indian Reservation in Polk County, Texas, and it was designed in a hierarchical way such that each town has a backbone switch, several small-to-medium sized ATM switches, and customer premise equipment 9 Therefore, the initial design of RESNET (Phase I), as we can see from Figure 1, has been focused on building a high-speed ATM backbone (OC-3) by connecting Indian Reservation to Livingston, Woodville and Dallardsville (Big Sandy ISD), and on connecting the backbone switch to the community facilities such as schools and hospitals in each town. The next phases of RESNET design (Phase II, III, and IV) will be extended to cover other ISDs located in west, north, and south regions as shown in Figure 1. Dibol North ~as~ ~ ISD Term CordKan Other Ed~, . ISD ~, ~egget t West ~ Term Huntsvil 13lzue II i~ 00-30 SMF New Waveriy Coldgpring Goodrich ISD ISD ISD Backbone Switch I,NrA 8265 00-30 SMF i 00-30 SMF i t~ Alabama~l~'oushatta Wo~:t~lle [)~ Indian P'Trvati~ East O~ I 00-30 SMF Term Houston GigaPo p S~y Shepard (NGI/vBNS) Cut and Shoot ~ ISD ISD SoothJla& " ~e~ CleYe]and .......... ~ ............. ISD Big ' ISD Term Figure I: Overview of RESNET Design The RESNET design is implemented mainly by using ATM products from IBM. For example, in each RESNET site, we have installed a backbone switch (IBM 8265 or IBM 8260 Nways ATM switch), and a combination of IBM 8260 Nways Multiprotocol Switching Hubs and IBM 8285 workgroup ATM switches to build ATM clusters operating at 155 and 25 Mbps speed 9 Each ATM switch runs PNNI version 1.0 and UNI 3.0/39 to connect to other ATM switches and workstations. Each 8265 backbone switch has been equipped with a Multiprotocol Switched Services (MSS) module that provides various routing and switching services 9 The MSS is a key component of IBM' s Switched Virtual Network architecture and supports various features such as Classical IP over 155 ATM (CLIP) [1], LAN Emulation (LANE) [2] services and Next Hop Resolution Protocol (NHRP) [3]. It also supports various routing (e.g., RIP, OSPF, etc.) and bridging protocols. For the administrative purpose, each town is configured with different IP subnet and each subnet runs either CLIP or LANE services based on its requirements. In this case, the MSS module installed in each backbone switch provi des the necessary functions such a s Address Resolution Protocol (ARP) server and LANE servers (LECS, LEC, and BUS). Since any cut-through routing (e.g., NHRP) capabilities has not been configured at the time of this writing, the communication between any two towns still passes through two routers, which may be the performance bottleneck as the inter-town traffic grows. However , we envision that the problem of this performance bottleneck can be resolved when standard-based short-cut routing schemes such as Multiprotocol over ATM (MPOA) [4] or Multiprotocol Label Switching (MPLS) [5] are i mpl ement ed within the MSS module. The current RESNET backbone operates at the speed of 155 Mbps (OC-3) and ATM is the main transport mechanism over the RESNET. However, with the advances in high-speed interfaces (e.g., OC-12, OC-48, and OC-192) and optical technologies (e.g., Dense Wavelength Division Multiplexing (DWDM)), the future RESNET backbone is expect ed to be upgraded to support gigabits or terabits applications. RESNET also allows the sharing of networked resources within a school district, town, count y or region. A high-speed connection (DS-3 or OC-3) to the Houst on GigaPop is being sought by RESNET in order to access Rice University, Texas A&M, University of Texas, and the University of Houston, most of which have a connection to the vBNS (very-high-speed Backbone Net work Service) national ATM network. Sharing of the computational resources and the cooperat i ve educational programs (both K-12 and adult education) permits RESNET participants to access to the resources that they could not afford on an individual basis. Figures 2 through 5 show the actual designs for RESNET sites, Livingston, Alabama Coushatta, Woodville, and Big Sandy ISD, respectively. 156 Lwmgston 2:5 M!os Liviston H.S LivistonisD IB l " 0C-3C SMF 8, 65 Indian Reservat i ml Backbone Switch 0C-3C: SMF i IBM 82611 Polk County 0C-3C SMF 1B~65 Figure 2: Detailed View of Livingston Pre-K I Head i i Reservation __ I 2:: Mps ] Town Switch Start I J Backbone | ~-~mmellllr Tribal Enterprise , I IBM 8285 :, 9 : A 5 " ~. ~. 5 ,,' :s, J Distance Learning To Livingston Town Switch IBp 8~65 0C- 3C / ITo Woodville Town Switch 0C-3C Figure 3: Detailed View of Alabama-Coushatta Indian Reservation 155 Mps PC'S " ~ Woodville Tele.radiology I - ~ .... H',S", I ,B.],26o-'-~-~.. I 0C-3C SMF /i IBM 8260 155 Mps IBM 1285 ~" PC'S--~ Figure 4: Detailed View of Woodviile 157 Big Sandy ISD Class Room Computer Class Room Class Room Lab in H.S. in Elem. S. q ~c.~ ~. ' ~176176176 25 Mps 0C-3C SMF .... . l / oo- ~ W CAT5 C~,T5 IBM 8260 Figure 5: Detailed View of Big Sandy ISD 0C-3C SMF 158 4. RES NET De mo n s t r a t i o n One of the design goals of RESNET is to develop and depl oy network-based multimedia applications that can fully utilize the high-speed ATM net work connectivity across RESNET sites. For this purpose, we have created three proof-of-concept scenarios and demonstrated them at the RESNET opening ceremony in November, 1997. The demonstrations were present ed at the Alabama Coushatta Indian Reservation and three scenarios were demonstrated: 1) Video-conferencing demonstration, 2) Video on demand demonstration, and 3) Voi ce over IP demonstration. These three applications were selected based on the current needs of different RESNET sites and on the future business plans using the RESNET infrastructure. In what follows, we briefly revi ew the demonstrations and some of the experience from the demonstrations. 4.1 Video-conferencing Demonstration Video-conferencing is one of the most important applications that provi de users with advanced video collaboration solutions both for education and business applications. In this demonstration, the First Virtual' s ATM-based video- conferencing solutions [6] have been installed both at Alabama Coushatta Indian Reservation and at Big Sandy ISD. The First Virtual' s vi deo-conferenci ng solution consists of a PC equipped with a plug-and-play 25 Mbps ATM NIC card from First Virtual (VC-NIC) and an MVIP-capable vi deo-conferenci ng equipment from PictureTel. Unlike other ATM NIC cards, the VC-NIC card is specially designed to support video-networking applications and includes an industry standard MVIP interface on the board. This onboard MVIP interface provides direct connection to MVIP-capable multi-vendor vi deo-conferenci ng equipment and allows the video traffic to bypass the system bus. The VC-NIC card is fully compliant with standard UNI 3.x signaling prot ocol and LANE 1.0 protocol. The video data is transmitted at 384 Kbps speed. Although the 128 Kbps video-conferencing based on a single ISDN line meets the needs of face- to-face conferencing, the ATM-based 384 Kbps conferencing provides higher quality video enough to support most business applications and distance learning applications. 4.2 Video on Demand Demonstration In the business and educational applications, it is common to record video presentations, classes, and movies, and store them in a centralized storage system in order to eliminate the need for the replication and distribution of tape. Retrieving this information from the remote PCs on demand is an important application since most of the ISDs in RESNET can share a wealth of course material and educational movies. In order to implement this scenario, a PC server (IBM 330) with a 155 Mbps ATM interface card (LANE interface) has been installed and connect ed to the backbone ATM switch at Alabama-Coushatta Indian Reservation. This PC 159 server runs Wi ndows NT Server 4.0 operating syst em and is equi pped with RAID-5 disk array to provi de a high-level of fault t ol erance and bet t er performance. Several DVD movi es were downl oaded into the di sk array so that client PCs in remot e sites can access the movi es si mul t aneousl y over the network. On the other hand, we have installed I BM Tur boways ATM 25 Mbps NICs into several client PCs located at different RESNET sites (e.g., Li vi ngst on, Woodville, Bi g Sandy) and connected t hem to their local I BM 8285 wor kgr oup ATM switches. The client PCs were also equipped with MPEG- 2 decoder board and software DVD pl ayer to pl ay the DVD movi es stored in the PC server. The public domai n Net work File Syst em (NFS) soft ware (NFS server and NFS client) was used to i mpl ement the communi cat i on bet ween the client and server. For exampl e, the PC server exports its file syst em and the client PCs mount t he remot e file syst em into the network drive. Once the mount operat i ons are properl y executed, each client PC opens the net work drive and pl ays the DVD movi es stored in the PC server over the ATM network. 4 . 3 V o i c e o v e r I P D e m o n s t r a t i o n One of the mai n advantages of using ATM is that any t ypes of data (voice, video, data) can be mi xed and transferred over the same net work infrastructure. Wi t h the expl osi ve growth of Internet and the increasing interests in building Next Generation Net work (NGN) ( NGN is a future communi cat i on infrastructure that integrates voice, data, and video traffic into a single common packet net work), Voice over IP has been gaining increasing popularity among researchers and creating a lot of opportunities for business and educational applications. In our demonstration, we have installed two Tempest Dat a Voi ce Gat eway (DVG) from Franklin Tel ecom [7] at Al abama-Coushat t a Indian Reservat i on and Woodvi l l e High School. The two places are 15 miles apart and bel ong to different LATAs. The Tempest DVG is self-containd, PC-based standalone box with three interface cards (DSP board, Tel ephone interface board, LAN i nt erface board). This box runs Linux operating syst em and contains syst em soft ware f r om Franklin Tel ecom. One of the probl ems we have met was that the data i nt erface provi ded by Tempest DVG was Ethernet only, t hereby the direct connect i on to the RESNET ATM backbone was not an option. In order to solve this probl em, two local Ethernet subnets were created both at Al abama-Coushat t a Indian Reservat i on and at Woodvi l l e Hi gh School. In each subnet, we have also installed a Wi ndows NT-based router (a PC with dual data interfaces (Ethernet ad ATM)) so that the Ethernet traffic generated f r om the Tempest DVG is routed and trasmitted across the RESET ATM backbone to the Woodvi l l e Hi gh School. The router at the Woodvi l l e Hi gh School in turn routes the data to the local Ethernet subnet. Although each voice packet has to pass through t wo routers, the quality of the voi ce was quite impressive. As we increase mor e si mul t aneous voice sessions, the quality of voice mi ght be degraded due to the nature of Ethernet and the two intermediate routers. Installing a T1 board f r om Franklin Tel ecom and connecting it directly to the backbone switch ( I BM 8265/8260 has an interface modul e for T1/E1) is another option to i mpr ove the t hroughput and 160 guarantee the quality of simultaneous voice sessions. Also, in a real environment, we can create an ATM PVC between two Ethernet subnets and bridge the voice traffic (or tunelling) to improve the performance. 5. Summary and Concluding Remarks In this paper, we presented the design and depl oyment of the Rural Educat i onal System Net work (RESNET) in eastern Texas. We reviewed how this proj ect started, funded and the steps involved in implementing the RESNET backbone network. We also reviewed in further detail the t echnol ogy adopted to design each RESNET site. We are currently working with Texas A & M university to take the responsibility of managing all the RESNET services. In addition, we are currently pursuing aggressively initiatives to provide high-speed connect i vi t y to the national high-speed backbone (vBNS). Once this connect i on is established, we will work with the Researchers at the Center for Advanced Tel eSysMat i cs (CAT) at the University of Arizona and Texas A&M to establish an Adaptive Distributed Virtual Computing Environment (ADVICE) [8] on RESNET. References [1] M. Laubach, "Classical IP and ARP over ATM", RFC 1577, January 1994. [2] ATM Forum, "LAN Emulation over ATM Specification - ver 1.0", February 1994. [3] D. Katz, D. Piscitello and B. Cole, "NBMA Next Hop Resolution Prot ocol ", Internet Draft, December 1995. [4] ATM Forum, "Multiprotocol over ATM - ver 1.0", July 1997. [5] R. Callon, P. Doolan, N. Felman, A. Fredette, G. Swallow and A. Viswanathan, "A Framework for Multiprotocol Label Switching", Internet Draft, November 1997. [6] http://www.fvc.com [7] http://www.ftel.com [8] Salim Hariri et al., "The design and evaluation of a virtual distributed computing environment", Cluster Computing, Vol. 1, May 1998, pp. 81-93. S o me Pe r f o r ma n c e St udi e s i n Ex a c t Li ne ar Al g e br a Geor ge Havas 1 and Cl emens Wagner 2 1 Centre for Discrete Mathematics and Computing Depart ment of Computer Science and Electrical Engineering The University of Queensland, Queensland 4072, Australia havas~csee, uq. edu. au h t t p ://www. it. uq. edu. a u / ~ h a v a s / 2 Fachgruppe Praktische Informatik Fachbereich Elektrotechnik und Informatik Universit~it-GHS Siegen D-57078 Siegen, Germany wagner@inf ormat ik. uni-siegen, de h t t p ://pi. i n f ormatik, uni-siegen, d e / c l e m e n s / Abs t r a c t . We consider parallel algorithms for computing t he Hermite normal form of matrices over Euclidean rings. We use st andard types of reduction met hods which are the basis of many algorithms for determining canonical forms of matrices over various computational domains. Our implementations take advantage of well- performing sequential code and give very good performance. 1 I n t r o d u c t i o n Al gor i t hms for exact l i near al gebr a have been much st udi ed. Many di fferent st rat egi es for cal cul at i on of canoni cal forms of mat r i ces have been pr oposed. A compr ehensi ve bi bl i ogr aphy and a number of earl i er met hods for i nt eger mat r i ces are exami ned in [8], i ncl udi ng references t o vari ous pol ynomi al t i me al gori t hms. Some paral l el and some mor e r ecent met hods ar e descr i bed in [16,9,6,22,20,11,12,21]. Reduct i on met hods for general Eucl i dean ri ngs ar e st udi ed in det ai l in [23]. We concent r at e on al gor i t hms whi ch use r educt i on as t hei r under l yi ng pri nci pl e. In spi t e of t he fact t ha t t he worst case per f or mance of r educt i on met hods can be exponent i al l y bad (see [4] and [13]), such t echni ques pr ovi de t he basis for ma ny sequent i al i mpl ement at i ons. We do not consi der modul ar met hods in t hi s paper . Rat her we s t udy ot her r ecent i mpl ement at i ons whi ch focus on fi ndi ng wel l -performi ng al gor i t hms and good heur i st i cs for r educt i on met hods. We s t ar t by out l i ni ng t he mat hemat i cal backgr ound. We t he n show how t o ext end sequent i al al gor i t hms t o a paral l el envi r onment . We finish by pr esent i ng some sampl e per f or mance figures and out l i ni ng r ecommendat i ons for choosi ng appr opr i at e paral l el al gori t hms. 162 2 Ma t h e ma t i c a l b a c k g r o u n d A c ommut a t i ve ri ng R wi t h i dent i t y 1 is Eucl i dean if t her e is a val ue f unct i on : R* ~ NO (where R* = R \ {0} and No is t he set of nonnegat i ve i nt eger s) such t ha t t he following pr oper t i es hol d for a E R and b E R*. 1. For a ~ O, ~( a b ) >_ ~ ( a ) . 2. The r e exi st q, r E R wi t h a = qb+r , such t ha t ei t her r = 0 or ~(b) > qo(r). Pa r a d i g m exampl es of Eucl i dean ri ngs ar e Z ( t he r i ng of i nt eger s, wi t h abs ol ut e val ue as val ue f unct i on) and Fix] (t he ri ng of uni var i at e pol ynomi a l s wi t h coefficients in a field F, wi t h degr ee as val ue f unct i on) . An el ement a E R is a uni t if it has an i nver se a - 1 E R such t h a t aa - 1 = a - l a = 1. The set U( R) of all uni t s of R is a mul t i pl i cat i ve gr oup. El ement s a, b E R ar e associ at es if t her e exi st s a uni t c E R such t h a t a = bc and we wr i t e a ..~ b. Associ at i on is an equi val ence r el at i on wi t h equi val ence cl asses [a] : = {b E R I a ~ b}. A subset R C_ R is a r e pr e s e nt a t i ve set for R if: {[a] I a E R} = R/ . . , ; and Va, b E R, a 7~ b. For a Eucl i dean ri ng 7~ a f unct i on p : R R* - 4 R is cal l ed a r esi due class s ys t e m if for all a, a ~ E R and b E R* p(a,b) 9 {a- qbl q 9 ~ ( p ( a , b ) ) < ~(b), and p ( a , b ) = p ( a ' , b ) ~ 3t 9 + t b . Let M C R wi t h M ~t {0} be a finite, n o n e mp t y s ubs et of R. The gr e a t e s t c ommon di vi sor of M ( gcd( M) ) is t he equi val ence cl ass [g] such t hat : Ya 9 [g], a [ M; and Vb 9 R wi t h b [ M, b I [g]. I f we have a r epr es ent at i ve set R for R and d 9 gc d( M) M R t hen d is uni quel y det er mi ned. Fur t her ba c kgr ound mat er i al on Eucl i dean ri ngs is gi ven in [18,7,5]. Mat r i ces A and B wi t h ent ri es in a Eucl i dean ri ng 7~ ar e col umn equi v- al ent if t her e exi st s a uni modul ar ma t r i x V such t h a t A = B V . Ma t r i x V cor r es ponds t o a sequence of e l e me nt a r y col umn oper at i ons : mul t i pl yi ng a col umn by a uni t of 7~; addi ng any mul t i pl e by a r i ng el ement of one col umn t o anot her ; or i nt er changi ng t wo col umns. For any ma t r i x B over a Eucl i dean ri ng 7~ wi t h r e pr e s e nt a t i ve s ys t e m R t her e exi st s a uni que lower t r i angul ar ma t r i x H whi ch is col umn equi val ent t o B and whi ch satisfies t he fol l owi ng condi t i ons. Let r be t he r a nk of B. 1. The first r col umns of H are nonzer o and t he r emai ni ng col umns ar e zero. 2. For 1 _< j _< r let Hi j , j be t he first nonzer o ent r y in col umn j . The n i l < i 2 < . . . < i t . 3. H/ j , j 9 l _ < j _ < r . 4. For 1 < k < j < r , Hi j , k = p ( Hi j , k , Hi ~ , j ) . Thi s ma t r i x is cal l ed t he col umn He r mi t e nor ma l f or m ( HNF) of t he gi ven ma t r i x B and has ma n y i mpor t a nt appl i cat i ons. As al r eady ment i oned, 163 t her e are ma ny al gor i t hms based on r educt i on met hods for c omput i ng t he HNF. Descri pt i ons of such met hods for canoni cal f or m c omput a t i on in Eu- cl i dean rings (somet i mes speci al i zed t o t he i nt egers) in t he l i t er at ur e i ncl ude [ 1 8 , 7 , 5 , 1 9 ] . 3 S e q u e n t i a l a l g o r i t h ms Det er mi ni st i c pol ynomi al - t i me HNF al gor i t hms ( non- modul ar ) i ncl ude t hose of Ka nna n and Bachem [14], Chou and Collins [3], and Havas, Maj ewski and Mat t hews [12] for t he i nt egers; of Ka nna n [15] for Q[x]; and of Wagner [23] for Fq Ix]. Heuri st i c al gor i t hms (oft en f ast er a n d / o r " bet t er " ) i ncl ude t hose of Havas and Maj ewski [9] for t he i nt egers and of Wagner [23] for Fq [x]. All in all, even t he sequent i al al gor i t hm si t uat i on is a qui t e compl i cat ed st ory, whi ch is addr essed in much mor e det ai l in [23]. We do not go i nt o t hi s f ur t her here, but r at her bui l d paral l el al gor i t hms based upon effect i ve sequent i al ones. 4 P a r a l l e l i mp l e me n t a t i o n s The pr obl em we consi der is: Gi ven A E Ma t m c omput e in par al l el H t he HNF of A t oget her wi t h a uni modul ar ma t r i x V such t ha t H -- AV. To uni fy t he comput at i on we act ual l y comput e t he HNF of a wor ki ng ma t r i x A) Let K be t he Her mi t e nor mal f or m of W. The first m rows of K W= I . 9 / ar e t he Her mi t e nor mal f or m of A and t he last n rows of K give a uni modul ar t r ans f or mat i on mat r i x V. A parallel comput er P :-- {7Co,... , 7rN-1} consi st s of N pr ocessor s wi t h di st i nct me mor y and a communi cat i on net wor k. Let { ; ~(l, T) : = + ot her wi se for each 0 < r < N- 1. We use t he mat r i x di st r i but i on model f r om [24]. Each pr ocessor 7r~ on t he paral l el comput er st ores par t of t he worki ng ma t r i x W cor r es pondi ng t o a (~(m, T) + 1) X n mat r i x A (~) and a ~(n, r ) x n ma t r i x V (' ) whi ch are submat r i ces of t he worki ng versi ons of A and t he mul t i pl i er V E GL n ( ~ x ] ) , respect i vel y. An ext r a row, row ~( m, r ) + 1 of A (r) is used t o cont r ol t he comput at i ons. We call t hi s t he comput at i on row. The ma t r i x is di st r i but ed rowwi se t o processors, but in st r i pes n o t in blocks. For each oper at i onal row we also do a br oadcast . Thus, we di s t r i but e t he i nput ma t r i x A by st or i ng t he i t h row of .4 in row i ~ of ma t r i x W (~) on pr ocessor 7r, wher e r : = (i -- 1) mod N and i ' : = L( i - 1) / N] + 1 for 1 < i < m. The paral l el , hybr i d HNF al gor i t hm PARALLEL-HNF is gi ven by ps eudocode in Fi gur e 1. I t uses t he s t andar d DI V oper at or for t he a ppr opr i a t e Eucl i dean ri ng and calls t wo subpr ocedur es: COMPUTE- GCD and PARALLEL-ROD. 164 PARALLEL-HNF ( W (r) , o~,/3) i n p u t r : pr oc e s s or i nde x W( r 7r e- par t of a full c o l u mn - r a n k , d i s t r i b u t e d ma t r i x W a, / 3: n o n - n e g a t i v e i nt eger s k + - I ( i ~ , . . . , i , , ) + o w h i l e k < n d o r + - - 0 i + - - 1 s r mi n { k + a - 1, n} f + - - k w h i l e r < s d o f +-- m a . x { r + 2, f } + - (i - I) m o d N if # = r t h e n j +-- L(i - W N J + I b r o a d c a s t W(,;)+I, W: , ) ) , . . . , W(, ; ) t o a l l o t h e r e l s e j +- g( m, r ) + 1 ~ I ndex of c omput at i on row r e c e i v e lJ:(~) T~:(~) W: , ] ) f r o m # ' ' j , r +l , ' ' j , f ~' ' " , i f i ---- i~+1 t h e n f o r l +- - f t o s d o q +--- DI V( W! ~ ), W ( ; ) I ) 9 . , o - ) + _ _ , . , d : ) : ' . , < - > w . : w. , t - q w ; : + l W (r) +-- COMPUTE- GCD( W (r), j , r + 1, f , s) W (~) i f j , r +l ~ 0 t h e n ( i l , . . . , i , , ) +- ( i l , . . . i r , i , i ~ + l , . . . , i ~ - 1 ) r + - r + l W (r) +- PARALLEL- ROD( W (r) , r, / 3, i l , . . . , iT) k + - k + c ~ r e t u r n W (~) F i g . 1. Pa r a l l e l h y b r i d He r mi t e n o r ma l f o r m a l g o r i t h m A c a l l C O M P U T E - G C D ( B , i , j o , j l , j 2 ) t a k e s a m a t r i x B w i t h n c o l u mn s , a n i n t e g e r i _> 1 a n d 1 < Jo < j l _< j 2 <_ n a s i n p u t , w h e r e i i s a v a l i d r o w i n d e x o f B. I t p r o d u c e s a r i g h t e q u i v a l e n t m a t r i x B ' a s o u t p u t w h e r e BI.. z,3o = g c d ( Bi , j o , B i , j l , . 9 9 , Bi , j ~ ) a n d B I-,,J. = 0 f o r j l < j _< j 2- T h i s a l g o r i t h m c o m p u t e s B ' f r o m B b y u n i m o d u l a r c o l u mn o p e r a t i o n s ( s wa p p i n g o f t wo c o l u mn s , mu l t i p l y i n g a c o l u m n w i t h a n u n i t , a n d a d d i n g a mu l t i p l e o f one c o l u mn t o a n o t h e r c o l u mn ) . T h e r e a r e v a r i o u s d i f f e r - e n t m e t h o d s t o o b t a i n B ' f r o m B (e. g. [ 1, 2, 10, 23] ) . B y u s i n g g c d a l g o r i t h m s wh o s e e x e c u t i o n d e p e n d s o n l y o n t h e e n t r i e s i n t h e o p e r a t i o n a l r o w we n e e d n o a d d i t i o n a l c o m m u n i c a t i o n f or t h i s p u r p o s e . 165 The function PARALLEL-ROD reduces the off-diagonal entries. It is a hybrid al gori t hm which is controlled by par amet er / 3 E No. For/ 3 + 1 great er t han or equal to the rank of t he i nput mat r i x it is a parallel vari ant of t he st andard reduct i on algorithm. For/3 = 1 this al gori t hm is a parallel version of Chou-Collins' reduction met hod. If we choose/3 equal t o zero t he al gori t hm does not change the input mat ri x. A more detailed descri pt i on of Parallel- ROD algorithm can be found in [24]. The or e m 1. Let A E Mat mx~( R) with rank r be in echel on/ orm. Let 1 < /3 < r and q := Lr ~J . Then the PARALLEL-ROD algorithm uses q(q---k~21/3 + r broadcasts to transfer at most r2+r+q2Z+qZ ring elements. 2 A proof is given in [24]. The PARALLEL-HNF al gori t hm divides the i nput mat r i x into vertical blocks of width c~. For k := (~, 2c~,... , (p - 1)~, n (where p : = r ~l ) t he HNF of the leading k-column submat ri x is comput ed. Thi s is shown in Fi gure 2. k_ T L o f $ r' j ~ I] I!R L 0 0 k_ M M' Fig. 2. Computing the HNF of the leading k columns, for k = ~, 2~, . . . The or e m 2. Let A E Mat mx~( R) with rank r and 1 <_ ~,/3 < n. Then O" n 3 nm the PARALLEL-HNF algorithm uses (Tfi + --5-) broadcasts with distributed W := I. , ~ and/3 as input. It trans/ers 0 ring elements. For R = ~x] the procedure uses O( ( m + n - r)n5/324~iiAi]2 ) field opera- tions. At most 0(n4~2: IIAll ) field elements are trans/erred via broadcasts. 166 Proof. Tr ansf or mi ng a (m + n) x s pr i nci pal s ubmat r i x of W wi t h s E {a, 2 a , . . . , (p - 1) a, n} i nt o echel on f or m requi res at most m + n - r br oad- casts. Tr ansf or mi ng t he echel on f or m i nt o HNF requi res ( by The or e m 1) q(q+l) [ . ~ J 2 /3 wi t h q : = br oadcast s. In t ot al t hi s l eads t o t he br oadcas t bound i=1 i =1 r ) + a P ( P + l ) [ 2 p 3 + 3 p 2 + p a P ( P + 1) J = p ( m + n - , ~ + a 2 9 , , 9 128 , + , 4 To comput e t he echel on form we do not need t o br oadcas t t he l i near l y de- pendent rows. Thus, t he number of br oadcas t ri ng el ement s can be maj or i zed by ~-~(1 + ( a - 1)) = s a i=1 for a ( m + n) x s pri nci pal s ubmat r i x of W. Tr ansf or mi ng t he echel on f or m t o Her mi t e nor mal f or m requi res br oadcast i ng at most O( r 2) ri ng el ement s. Comput i ng t he Her mi t e nor mal f or m of all ( m + n) x s pr i nci pal s ubmat r i ces of W wi t h s E { a , 2 a , . . . , ( p - 1) a , n} we need p- i O ( n a + ~-~ i a 2) + p O( r 2) C_ O ( n a + P ( P 2 x) a ) + O( pr 2) i=1 ri ng el ement s t o be br oadcast . The proofs of t he ot her t wo est i mat es ar e qui t e l engt hy. The y can be f ound in [23]. 5 Perf ormance exampl es We have i mpl ement ed t hi s and r el at ed al gor i t hms in C/ C+ + on t he I BM SP2 at t he GMD in Sankt August i n. We have used t he xl C compi l er and 167 the message passing library MPL (bot h IBM product s). We have used the Sort i ng-GCD algorithm, due to Majewski-Havas [17], for the i mpl ement at i on of the COMPUTE-GCD function, where we used a heap for det ermi ni ng the polynomial with the largest degree or the integer with the largest absolute value, respectively in a subvector. We have done many practical studies with these algorithms. In this paper we give some details of the behavior of PARALLEL-HNF for some r andom matrices over F2 [x] and Z. Thus we used an input mat ri x over F2 [x] which is a random 80 x 80 matrix, where the degree of each ent ry is less t han or equal to 80. The rank of this mat ri x is r = 80. The input mat ri x over Z is a random 100 x 100 matrix. The absolute value of each ent ry is less t han or equal to 64. Table 1 and Figures 3 and 4 show the results of experiments in which we varied a and ft. We used 16 nodes of the SP2. The first row of each measure- ment is the total running time (minutes:seconds.hundredth). The second row gives the maxi mum degree (F2 Ix]), or the number of bits in the largest ab- solute value (Z) which arose during the comput at i on. In Figure 3 the x-axis shows a while the y-axes show run times and maxi mum degrees. In Figure 4 the x-axis shows a while the y-axes show run times and maxi mum number of bits. ~2 [x] z = n+ /3 = n+ / 3=1 fl=c~ ~ 8=1 f l =~ 1 - - ~ 1 --c~ 06:53.08106:57.10 06:52.68 1 00:20.35 00:20.28 00:18.92 1 12708 12708 12717 1401 1401 1412 05:34.19]07:14.12 10:28.21 3 00:23.25 00:22.91 00:21.99 3 12708 15073 18646 1698 1698 2138 05:45.01 06:41.58 10:51.69 00:19.68 00:19.59 00:22.39 5 5 13701 15016 23377 1598 1744 2639 04:43.10105:06.79. 10:02.18 8 00:20.57 00:20.35 00:21.78 8 12232 19614 26823 1641 2374 3146 04:11.13104:40.08. 09:05.69 20 00:21.01 00:22.65 00:27.84 16 12384 27387 33377 1263 5851 6506 04:28.23105:13.12. 08:40.16 40 00:31.24 00:40.86 00:46.43 2O 11270 29951 35593 1464 13977 14266 03:49.84106:12.17. 06:02.98 60 00:43.38 01:14.43 01:11.70 40 9982 37468 37883 2043 20222 20222 03:32.68108:11.38. 05:00.42 80 01:37.72 03:09.35 01:49.23 6O 11270 35593 30719 4544 64587 51291 01:33.32107:01.55. 01:32.47 100 11:36.60 11:36.20 11:38.67 8O 9955 43486 9955 20987 20987 20987 Tabl e 1. Effect of varying c~ and 168 i1) ill 700 500 4 0 0 3 0 0 200 100 0 0 - - " ) K - - - 1 3 =N+l - a - - - m- - 2-. 45000 40000 35000 30000 25000 20000 15000 10000 5000 , , , , , . . _ _ _ ~ ---)K--- / ~- N+ 1- a - - - ~ - / / t o r ,' / ; i I I I I I I I i i i i I i I 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 ( a ) ( b ) Fi g. 3. Effect of varying c~ and fl for a mat ri x over F2[x]: (a) running times, (b) maximum degree 6 Concl udi ng Remarks The reuse of sequent i al code for paral l el i mpl ement at i ons is well s uppor t e d by t he oper at i on row concept . We have used earl i er sequent i al GCD al gor i t hms for our paral l el i mpl ement at i ons. Thi s leads t o paral l el i mpl ement at i ons for Her mi t e nor mal f or m comput at i on wi t h good speed- up. In t he i nt eger case t he hybr i d al gor i t hm gives best resul t s for smal l a and fl = n. For R = F2 [x] we get t he best per f or mance for a -- n and smal l ft. For integer mat r i ces wi t h large r ank, t he new pr ocedur es ar e f ast est . For di st r i but ed comput at i on it is good t o use t he paral l el , hybr i d pr ocedur e wi t h small a ( < 10) and fl : = n. These pr ocedur es pr oduce ver y good t r ans f or - mat i on mat r i ces (i.e., ent ri es wi t h smal l absol ut e value) if t he r ank of t he mat r i x is less t ha n t he number of col umns. The t r ans f or mat i on mat r i ces are al most as good as ones obt ai ned usi ng LLL l at t i ce basis r educt i on met hods , whi ch are or der s of magni t ude slower. For mat r i ces over F2 Ix] and for integer mat r i ces wi t h small r ank, "Gaussi an el i mi nat i on" t ype pr ocedur es (c~ : = n) are fast est . For di st r i but ed comput at i on use small f l (6 {1, 2}). 700 169 600 500 v 400 .~ 300 200 100 70000 i i i i I ~=L 11_ i i ~0~ - - ' - ~ - - [~-N+l-a ----- J i ; t 20000 I t ' . - 10000 o 0 10 20 30 40 50 60 70 80 90 100 60000 50000 40000 30000 ' ' ' ' ' ' P = I ' ~ ' l ~ N + l - ~ - - ,' l.: I, l:' t ' 0 10 20 30 4 0 50 60 70 80 90 100 Et (a) (b) Fi g. 4. Effect of varyi ng a and /3 for an i nt eger mat r i x: (a) r unni ng t i mes, (b) maxi mum number of bits needed Acknowledgements Th e f i r st a u t h o r was pa r t i a l l y s u p p o r t e d by t h e Au s t r a l i a n Re s e a r c h Counc i l . References 1. W. A. Blankinship. A new version of t he Eucl i di an al gori t hm. Amer. Math. Monthly 70 (1963) 742-745. 2. G. H. Bradley. Al gor i t hm and bound for t he gr eat est c ommon divisor of n integers. Comm. ACM 13 (1970) 433-436. 3. T- W. J . Chou and G.E. Collins. Al gor i t hms for t he sol ut i on of syst ems of linear Di ophant i ne equations. SI AM J. Comput. 11 (1982) 687-708. 4. X. G. Fang and G. Havas. On t he worst -case compl exi t y of i nt eger gaussi an elimination. ISSA C'97 (Proc. 1997 I nt er nat . Sympos. Symbol i c Al gebrai c Corn- put . ), ACM Press (1997) 28-31. 5. K. O. Geddes, S.R. Czapor and G. Labahn. Algorithms for Computer Algebra. Kluwer Academi c Publishers, 1992. 6. M. Giesbrecht. Fast comput at i on of t he Smi t h nor mal f or m of an i nt eger mat r i x. ISSAC' 95 (Proc. 1995 I nt er nat . Sympos. Symbol i c Al gebrai c Comput . ) , ACM Press (1995) 110-118. 7. B. Har t l ey and T. O. Hawkes. Rings, Modules and Linear Algebra. Ch a p ma n and Hall, 1976. 170 8. O. Havas, D.F. Holt and S. Rees. Recognizing badl y present ed Z-modules. Linear Algebra Appl. 192 (1993) 137-163. 9. G. Havas and B.S. Majewski. Hermi t e normal form comput at i on for integer matrices. Congressus Numerantium 105 (1994) 87-96. 10. G. Havas and B.S. Majewski. Ext ended gcd calculation. Congressus Numeran- tium 111 (1995) 104-114. 11. G. Havas and B.S. Majewski. Integer mat r i x diagonalization. J. Symbolic Com- putation 24 (1997) 399-408. 12. G. Havas, B.S. Majewski and K. R. Matthews. Ext ended gcd and Hermi t e nor- mal form algorithms via lattice basis reduction. Experimental Mathematics 7 (1998) 125-135. 13. G. Havas and C. Wagner. Mat ri x reduction algorithms for Euclidean rings. Proc. 1998 Asian Symposium on Computer Mathematics, Lanzhou University Press (1998) 65-70. 14. R. Kaunan and A. Bachem. Polynomial algorithms for comput i ng Smi t h and Hermite normal forms of an integer mat ri x. SI AM J. Comput. 8 (1979) 499-507. 15. R. Kannan. Solving systems of linear equations over polynomials. Theoretical Computer Science 39 (1985) 69-88. 16. E. Kaltofen, M. S. Kri shnamoort hy, and B. D. Saunders. Parallel al gori t hms for mat ri x normal forms. Linear Algebra Appl. 136 (1990) 189-208. 17. B. S. Majewski and G. Havas. A solution to t he ext ended gcd problem. IS- SAC' 95 (Proc. 1995 Int ernat . Sympos. Symbolic Algebraic Comput . ), ACM Press (1995) 248-253. 18. M. Newman. Integral Matrices. Academic Press, 1972. 19. C.C. Sims. Computation with finitely presented groups. Cambri dge University Press (1994). 20. A. Storjohann. Near optimal algorithms for comput i ng Smi t h normal forms of integer matrices. ISSAC' 96 (Proc. 1996 Int ernat . Sympos. Symbolic Algebraic Comput . ), ACM Press (1996) 267-274. 21. A. Storjohann. Comput i ng Hermi t e and Smi t h Normal Forms of Triangular Integer Matrices. Linear Algebra Appl. 282 (1998) 25-45. 22. A. Storjohann and G. Labahn. Asympt ot i cal l y fast comput at i on of Hermi t e normal forms of integer matrices. ISSAC' 96 (Proc. 1996 Int ernat . Sympos. Symbolic Algebraic Comput . ), ACM Press (1996) 259-266. 23. C. Wagner. Normal/ormberechnung yon Matrizen iiber euklidischen Ringen. PhD thesis, Inst i t ut fiir Experimentelle Mat hemat i k, Uni versi t ~t -GH Essen, 1997. Published by Shaker-Verlag, 52013 Aachen/ Germany, 1998. 24. C. Wagner. Fast parallel Hermi t e normal form comput at i on of mat ri ces over F[x]. Euro-Par'98 Parallel Processing, Lecture Notes Comput . Sci. 1470 (1998) 821-830. Performance Analysis of Wavefront Al gori thms on Very-Large Scale Di stri buted Systems Adol fy Hoisie, Ol af Lubeck and Har vey Wasserman <hoisie, oml , hjw> @lanl.gov Scientific Comput i ng Group Los Al amos National Laborat ory Los Alamos, NM 87545 Abstract. We present a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message passing environment. The model combines the separate contributions of computation and communication wave- fronts. We validate the model on three important supercomputer systems, on up to 500 processors. We use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. We also use the validated model to make estimates of per- formance and scalability of wavefront algorithms on 100-TFLOPS computer systems ex- pected to be in existence within the next decade as part of the ASCI program and else- where. On such machines our analysis shows that, contrary to conventional wisdom, in- ter-processor communication performance is not the bottleneck. Single-node efficiency is the dominant factor. 1. Introduction Wavefront techniques are used to enabl e parallelism in algorithms that have re- currences by breaki ng the comput at i on into segments and pipelining the segments through multiple processors [1]. First described as "hyper pl ane" methods by Lamport [2], wavefront methods now find application in several i mport ant areas including particle physics simulations [3], parallel iterative solvers [4], and par- allel solution of triangular systems of linear equations [5-7]. Wavefront computations present interesting i mpl ement at i on and performance modeling challenges on distributed memor y machines because they exhibit a subtle bal ance between processor utilization and communi cat i on cost. Optimal task granularity is a function of machi ne paramet ers such as raw comput at i onal speed, and inter-processor communi cat i on latency and bandwidth. Although it is simple to model the comput at i on-onl y portion of a single wavefront, it is consid- erabl y more compl i cat ed to model multiple wavefronts existing simultaneously, due to potential overl ap of comput at i on and communi cat i on and/or overl ap of different communi cat i on or comput at i on operations individually. Moreover, spe- cific message passing synchronization met hods i mpose constraints that can fur- ther limit the available parallelism in the algorithm. A realistic scalability analy- sis must take into consideration these constraints. 172 Much of the previous parallel performance modeling of software-pipelined appli- cations has involved algorithms with one-dimensional recurrences and/or one- dimensional processor decompositions [5-7]. A key contribution of this paper is the development of an analytic performance model of wavefront algorithms that have recurrences in multiple dimensions and that have been partitioned and pipe- lined on multidimensional processor grids. We use a "compact application" called SWEEP3D, a time-independent, Cartesian-grid, single-group, "discrete ordinates" deterministic particle transport code taken from the DOE Accelerated Strategic Computing Initiative (ASCI) workload. Estimates are that deterministic particle transport accounts for 50-80% of the execut i on time of many realistic simulations on current DOE systems; this percentage may expand on future 100- TFLOPS systems. Thus, an equally-important contribution of this work is the use of our model to explore SWEEP3D scalability and to show the sensitivity of SWEEP3D to per-processor sustained speed, and MPI latency and bandwidth on future-generation systems. Efforts devoted to improving performance of discrete ordinates particle transport codes date back many years and have extended recently to massively-parallel systems [8-12]. Research has included models of performance as a function of problem and machine size, as well as other characteristics of both the simulation and the computer system under study. For example, Koch, Baker, and Al couffe [3] developed a parallel efficiency formula that considered computation only, while Baker and Alcouffe [9] developed a model specific to CRAY T3D put/get communication. However, these previous models had limiting assumptions about the computation and/or the target machines. In this work, we model parallel discrete ordinates transport and account for both computation and communication. We validate the model on several architectures within the realistic limits of all parameters appearing in the model. Sections 2 and 3 of the paper briefly describe the algorithm and its implementation. Sections 4 and 5 derive the performance model and give validation results. In the final sec- tions of the paper, the model is used to estimate SWEEP3D performance on fu- ture generation parallel systems, showing the sensitivity of this application to system computation and communication parameters. Note that although we present results for three different parallel systems, no comparison of achieved system performance or scalability is intended. Rather, measurements from the three systems are presented in an effort to demonstrate generality of the performance model and sensitivity of application performance to machine parameters. 2. Description of Discrete Ordinates Transport Although much more complete treatments of discrete ordinates neutron transport have appeared elsewhere [12-14], we include a bri ef explanation here to make clear the origin of the wavefront process in SWEEP3D. The basis for neutron transport simulation is the time-independent, multigroup, inhomogeneous Boltz- mann transport equation, which is formulated as 173 V-g2q~(r,E,f2) + J'/o(r,E)v(r,E,g2) = f~dE'd'(r,E' --> E,~.~' )~F(r,E' ,f2 ') + (1/4n)/~dE' d~' z(r,E' --> E) va (r,E' )~F(r,E' ,~ ") + Q(r,E,f2). The unknown quantity is ~F, which represents the flux of particles at the spatial point r with energy E traveling in direction f2. Numerical solution involves complete discretization of the multi-dimensional phase space defined by r, fL and E. Discretization of energy uses a "multigroup" treatment, in which the energy domain is partitioned into subintervals in which the depedence on energy is known. In the discrete ordinates approximation, the angular-direction g2 is discretized into a set a quadrature points. This is also re- ferred to as the SN method, where (in 1D) N represents the number of angular or- dinates used. The discretization is completed by differencing the spatial domain of the problem on to a grid of cells. The numerical solution to the transport equation involves an iterative procedure called a "source iteration" (see Ref. 13). The most time-consuming portion is the "source correction scheme," which involves a transport sweep through the entire grid-angle space in the direction of particle travel. A lower triangular ma- trix is obtained, as such one needs to go through the grid only once in inverting the iteration matrix. In Cartesian geometries, each octant of angles has a differ- ent sweep direction through the mesh, and all angles in a given octant sweep the same way. For a given discrete angle, each grid cell has a spatially-exact particle "balance equation" with seven unknowns. The unknowns are the particle fluxes on the six cell faces and the flux within the cell. Boundary conditions and the spatial dif- ferencing approximation are used to provide closure to the system. Boundary conditions (typically vacuum or reflective) allow the sweep to be initiated at the object' s exterior. Thereafter, for any given cell, the fluxes on the three incoming cell planes for particles traveling in a given discrete angle are known and are used to solve for the cell center and the three cell faces through which particles leave the cell. Thus, each interior cell requires in advance the solution of its three upstream neighboring cells - a three-dimensional recursion. This is illus- trated in Figure 1 for a 1-D arrangement of cells and in Figure 2 for a 2-D grid. Figure 1. Dependences for a 1-D Transport Sweep. 174 mml mr. P W um umm ._mmmm Figure 2. 2-D Transport Sweep along a Diagonal Wavefront. 3. Parallelism in Discrete Ordinates Transport The only inherent parallelism is related to the discretization over angles. How- ever, reflective boundary conditions limit this parallelism to, at most, angles within a single octant. The two-dimensional recurrence may be partially eliminated because solutions for cells within a diagonal are independent of each other (as shown in Figure 2). The success of this "diagonal sweep" scheme on SIMD computers such as single- processor vector systems (using 2-D plane diagonals) and the Thinking Ma- chines, Inc. Connection Machine (using 3-D body diagonals) has been demon- strated [3]. Diagonal concurrency can also be the basis for implementation of a transport sweep using a decomposition of the mesh into subdomains using message passing to communicate the boundaries between processors, as described in [12] and shown in Figure 3. The transport sweep is performed subdomain by subdomain in a given angular direction. Each processor' s exterior surfaces are computed by, and received in a message from, "upstream" processors owning the subdomains sharing these surfaces. However, as pointed out by Baker [9] and Koch [3], the dimensionality of the SN parallelism is always one order lower than the spatial dimensionality because re- cursion in one spatial direction cannot be eliminated. Because of this, parallelization of the 3-D SN transport in SWEEP3D uses a 2-D processor decomposition of the spatial domain. Parallel efficiency would be limited if each processor computed its entire local domain before communicating information to its neighbors. A strategy in which blocks of planes in one direction (k, in the current implementation) and angles are pipelined through this 2-D processor array improves the efficiency, as shown in Figure 3. Varying the k- and angle-block sizes changes the balance between parallel utilization and communication time. 175 Figure 3. Illustration of the 2-D Domain decomposition on eight processors with 2 k-planes per block. The transport sweep has started at top of the processor in the foreground. Concurrently-computed cells are shaded. 4. A Pe r f or manc e Mode l f or Paral l el Wave f r ont s This section describes a performance model of a message passing implementation of SWEEP3D. Our model uses a pipelined wavefront as the basic abstraction and predicts the execution time of the transport sweep as a function of primary computation and communication parameters. We use a two-parameter (la- tency/bandwidth) linear model for communication performance, which is equivalent to the LogGP model [15]. We use the term latency to mean the sum of L and o in the LogGP framework, and bandwidth to mean the inverse of G. Since different implementations of MPI use different buffering strategies as a function of message size, a single set of latency/bandwidth parameters describes a limited range of message sizes. Consequently, multiple sets are used to de- scribe the entire range. Computation time is parameterized by problem size, the number of floating-point calculations per grid point, and a characteristic single- CPU floating-point speed. 4.1 Pipelined Wavefront Abstraction An abstraction of the SWEEP3D algorithm partitioned for message passing on a 2-D processor domain (ij plane) is described in Figure 4. The inner-loop body of this algorithm describes a wavefront calculation with recurrences in two dimen- sions. Each processor must wait for boundary information from neighboring processors to the north and west before computing on its subdomain. For con- venience, we assume that the implementation uses MPI with synchronous, blocking sends/receives. There is little loss of generality in this assumption since the subdomain computation must wait for message receipt. Multiple waves initi- ated by the octant, angle-block and k- block loops are pipelined one after another as shown in Figure 5, in which two inner loop bodies (or "sweeps") are executing 176 on a Px by Py processor grid. Each diagonal line of processors is execut i ng the same k-block loop iteration in parallel on a different subdomain; two such diago- nals are highlighted in the figure. Using this pipeline abstraction as the foundation, we can build a model of execu- tion time for the transport sweep. The number of steps required to execut e a computation of N~ee, wavefronts, each with a pipeline length of Ns stages and a repetition delay of d is given by equation (1). Steps = Ns + d(Nsweep - 1), (1) The first wavefront exits the pipeline after Ns stages and subsequent waves exit at the rate of lid. The pipeline consists of both computation and communication stages. The num- ber of stages of each kind and the repetition del ay per wavefront need to be de- termined as a function of the number of processors and shape of the processor grid. The cost of each individual computation/communication stage is dependent on problem size, processor speed and communication parameters. FOR EACH OCTANT DO FOR EACH ANGLE-BLOCK IN OCTANT DO FOR EACH K-BLOCK DO IF (NEIGHBOR ON WEST) RECEIVE FROM WEST (BOUNDARY DATA) IF (NEIGHBOR_ON _NORTH) RECEIVE FROM NORTH (BOUNDARY) COMPUTE_MESH (EVERY I,J DIAGONAL; EVERY K IN K-BLOCK; EVERY ANGLE IN ANGLE-BLOCK) IF (NEIGHBOR_ON_EAST) SEND TO EAST(BOUNDARY DATA) IF (NEIGHBOR_ON_SOUTH) SEND TO SOUTH(BOUNDARY DATA) END FOR END FOR END FOR Figure 4. Pseudo Code for the wavefront Algorithm 4.2 Computation Stages Figure 5 shows that the number of computation stages is simply the number of diagonals in the grid. A different number of processors is empl oyed at each stage but all stages take the same amount of time since processors on a diagonal are executing concurrently. The cost of one computational stage is thus the time to compl et e one COMPUTE_MESH function (see algorithm abstraction above) on a processor' s subdomain. The discussion can be summarized with two equations. Equation (2) gives the number of computation steps in the pipeline, N ~ = P~ + Py- ! (2) and Equation 3 gives the cost of each step, 177 T~p. ( Nx NY+ Nz Na N'a~ = + + ) ( 3 ) Px Py Kb Ab Rpops where Nx, Ny, and Nz are the number of grid points in each direction; Kb is the size of the k-plane block; Ab is the size of the angular block; Nflop, is the number of floating-point operations per gridpoint; and Rflops is a characteristic floating-point rate for the processor. The next sweep can begin as soon as the first processor completes its computation so the repetition delay, d ~ is 1 computational step (i.e., the time for completing one diagonal in the sweep). 4.3 Communication Stages The number and cost of communication stages are dependent on specific charac- teristics of the communication system. The effect of blocking synchronous com- munications is that messages initiated by the same processor occur sequentially in time and messages must be received in the same order that they are sent. As im- plemented, the order of receives is first from the west, then from the north, and the order of sends is first to the east and then to the south. These rules lead to the ordering (and concurrency) of the communications for a 4 x 4 processor grid as shown in Figure 6 for a sweep that starts in the upper-left quadrant. Pv ( ~ . . . . . . . . L t 1 i ) - . . . o ........ O ........ 0 - . . . . 4 3 Px Figure 5. Multidimensional Pipelined Wavefronts Py 6 8~ 1 i 12~ 7.._ v 9 v 1 L v Px Figure 6. Communication Pipeline. In Figure 6 edges labeled with the same number are execut ed simultaneously and the graph shows that it takes 12 steps to complete one communication sweep on a 4 x 4 processor grid. We assume that a logical processor mesh can be imbedded into the machine topology such that each mesh node maps to a unique processor and each mesh edge maps to a unique router link. One can generalize the number of stages to a grid of Px by Py processors by observing that communication for each row of processors is initiated by a message from a north neighbor in the first 178 column of processors. South-going messages in the first col umn of processors occur on every other step since each processor in the col umn a) has no west neighbor, and b) must send east before sending south. Thus the last processor in the first column receives a message on step 2(Py-1). This initiates a string of west-going messages along the last row that are also sent on every other step, and the number of stages in the communication pipeline is given by Ns " " = 2 ( P / - 1 ) + 2 ( P x - 1 ) (4) Analogous to the computational pipeline, different stages of the communi cat i on pipeline have different numbers of point-to-point communications. However, since these occur simultaneously, the cost of any single communication stage is the time of a one-way, nearest neighbor communication. This time is given by: N,,~ T g = t o + ~ ( 5 ) B where latency + overhead (to) and bandwidth (B), are defined in LogGP as noted above. The repetition delay for the communication pipeline, d ~ is 4 because a mes- sage sent from the top-left processor (processor 0) to its east neighbor (processor 1) on the second sweep cannot be initiated until processor 1 completes its com- munication with its south neighbor from the first sweep (Figure 6). 4 . 4 Combining Computation and Communication S t a g e s In the previous two sections, we derived formulas for the modeling of SWEEP3D that are general for any pipelined wavefront computation. We can summarize the discussion in two equations that give the separate contributions of computation and communication: T~~ = [ ( P x + e y - 1) + ( N~ e e t , - 1)] * Tcpu (6) T c~ = [2(P, + Py - 2) + 4 ( N , weep - 1)]*T,,~g (7) The major remaining question is whether the separate contributions, T ~~ and 7 ~~ can be summed to derive the total time. They would not be additive i f there were any additional overlap of communication with computation not al- ready accounted for in each term. To see that this is not the case, consider the task graph for an execution consisting of two wavefronts on a 3 x 3 processor grid (Figure 7). This graph shows communication tasks (circles numbered with a send/receive processor pair) and computation tasks (squares numbered by a com- puting processor). The total number of stages in the combined communica- tion/computation pipeline is equal to the number of nodes (of each type) in the longest path through the graph (the critical path) shown in dotted boxes in the figure. The critical path for the first sweep can be counted from Figure 7 : 5 computational tasks and 8 communication tasks. This result is exactly the num- ber given by eqns. (2) and (4). One can further verify that there is no further overlap between two pipelined sweeps other than the predicted sum of i oi 179 [ . . . . . . . . . 2 Figure 7. Pipelined Wavefront Task Graph. eqns. (6) and (7). The second sweep completes exactly 1 comput at i on and 4 communication steps after the first. In summary, total time for the sweep algorithm is the sum of eqns. (6) and (7), where Tcpu is given by eqn. (3) and Tmsg is given by eqn. (5). The validation of the model against experiment involves the measurement and/or modeling of Tmsg and Tcpu. We take Tmsg to be the time needed for the compl et i on of a send/receive pair of an appropriate size and Tcpu to be the computational work associated with the subgrid computation on each processor. 5. Va l i d a t i o n o f t h e Mo d e l In this section, we present results that validate the model with performance data from SWEEP3D on three different machines, with up to 500 processors, over the entire range of the various model parameters. Inspection of eqns. (6) and (7) leads to identification of the following validation regimes: Nswee p --- 1: This case validates the number of pipeline stages in ~o,~p and T ~~ as 180 functions of ( P , +Py), in the available range of processor configurations. N~ e , p - (Px+Py): Validation of a case where the contributions of the ( P x + P y ) a n d N~eep terms are comparabl e. N~weep >> ( Px +Py ) : This case validates the repetition rate of the pipeline. For each of these three cases, we analyze probl em sizes chosen in such a way as to make: T ~~ >> 7~~ (validate eqn. (6) only) 7 ~~ = 0; (validate eqn. (7) only) T ~~ - T~~ (validate the sum of eqns. (6) and (7)). 5 . 1 N~,,ep = 1 For a single sweep, the coefficients of T,,~g and Tcp. in equations 6 and 7 represent the number of communi cat i on and comput at i on stages in the pipeline, respec- tively. Any overlap in communi cat i on or comput at i on during the single sweep of the mesh is encapsulated in the respect i ve coefficients. In hypothetical probl ems with T,,~ 8 - Tep,, and in the limit of large processor confi gurat i ons (large P~ +Py ) , equations 6 and 7 show that the communi cat i on component of the el apsed t i me would be twice as large as the contribution of the comput at i on time. In reality, for probl em sizes and partitionings reasonabl y designed (small subgrid surface- to-volume ratio), Tcp, is considerably larger than Tm~8. Comput at i on is the domi - nant component of the elapsed time. This is apparent in Figure 8, which presents the model -experi ment compari son for a weak scalability analysis of a 16 x 16 x 1000 subgrid size sweepi ng onl y one octant. This size was chosen to reflect an estimate of the subgrid size for a 1- billion cell-problem running on a machine with about 4, 000 processors; the for- mer is a canonical goal of ASCI and the latter is si mpl y an est i mat e of the ma- chine size that might satisfy a 3-TFLOPS peak per f or mance requirement. In a "weak scalability" analysis, the probl em size scales with the processor configura- tion so that the computational load per processor stays constant. Thi s experi ment shows that the contribution of communi cat i on is small (in fact, the model shows that it is about 150 times smaller than computation), and the model is in very good agreement with the experiment. We note that in the absence of communi cat i on our model reduces to the linear "parallel computational effi ci ency" model s used by Baker [9] and Koch [3] for SN performance, in which parallel computational effi ci ency is defined as the frac- tion of time a processor is doing useful work. To validate the case with N~w~ep = 1 and "compar abl e" contributions of communi - cation and computation we had to use a subgrid size that is pr obabl y unrealistic for actual production simulation purposes (5 x 5 x 1). Even with this size com- putation outweighs communi cat i on by about a fact or of 6. Fi gure 9 depicts a weak scalability analysis on the SGI Origin 2000 for this size. The model - experiment agreement is again very good. Validation of cases where T ~~ = 0 i nvol ved the devel opment of a new code to 80 6O A ~e ! 6 O 20 Measumd Model .L Toomp from Model Se-3 4e- 3 | I~ 2e-3 F- I e- 3 Oe+O Measured Model .L Tcomp frem Model 181 O0 I 0 20 30 4 6 8 10 12 14 16 P x + P y P x + P y Fi g ur e 8. T c~ do mi na nt . N,w,p = 1. I BM RS/ 6000. Fi g ur e 9. r c~ - - Tc~ Sswee p = 1. SGI Or i gi n. simulate the communi cat i on pattern in SWEEP3D in the abs ence o f comput a- tion. The c o de devel oped for this purpose si mpl y i mpl ements a recei ve- west , re- cei ve-north, send- sout h, send-east communi cat i on pattern e nc l os e d i n l oops that initiate mul ti pl e waves. Figure 10 s hows a very good agreement o f the model with the measured data from this code. Measured ~ Me a a u r e d Model Mo d e l 4e-2 5 3e- 2 | ~ 2e- 2 E l e - 2 Oe+O T I me ( s e co nd z ) / / 10 20 30 40 10 2 0 P x + P y P x + P y 30 Fi gur e 10. rc~ Sswee p = 1. SGI Or i g i n . . Fi g ur e 11. Tc~ N~w,p = 10. SGI Or i g i n. s.2 S~w.p - (ex+ey) As descri bed i n Sect i on 4, sweeps o f the domai n generated by s ucces s i ve octants, 182 angle blocks, and k-plane blocks are pipelined, with the depth of the pipeline, N~ p , given by the product of the number of octants, angle blocks, and k-plane blocks. We can select k-plane and angle bl ock sizes so that N~eep = 10, which, in turn, balances the contribution of N~ep and (Px+Py) for processor configurations used in this work. In Figure 11 the comparison using a data size for which T ~~ is dominant is presented, showing an excellent agreement with the measured elapsed time. The case with no computation is in fact a succession of 10 sweeps of the domain, with the communication overlap described by equation 6. Figure 12 shows a very good agreement with experimental data for this case. An excellent model-experiment agreement is similarly shown in Figure 13, for a subgrid size 5 x 5 x 1, which leads to balanced contributions of the communica- tion and computation terms to the total elapsed time of SWEEP3D. 2.0e-2 1.Se-2 I.Oe-2 5.0e-3 o Measur ed Model o MeoBured ~ Tcomp from Model 9 Model I O 9 . ; o . ' o o 9 o 0.0~.0 0 I 0 20 30 40 Px+Py 8.0e-3 7.0e-3 6.0e-3 ~ 5.1~-3 ~ 4.0e-3 3.~-3 2.[~-3 g 0 o o g o o o 4 6 8 10 12 14 16 Px+Py Fi gur e 12. rc~ Sswee p .~ 10. CRAY T3E. Fi g ur e 13 T c~ d o mi n a n t . Nsw. p=10. SGI Or i g i n. 5.3 Nsweep >> Px+Py We present model-data comparisons using weak scalability experiments for cases in which N~eep is large compared with (Px+Py) in Figure 14 (6 x 6 x 360 sub- grid;/~omp - /~omm) and in Figure 15 (16 x 16 x 1000 subgrid; T ~~ dominant). The model is in good agreement with the measured execut i on times of SWEEP3D in both cases. 5.4 Strong Scalability In a "strong scalability" anal ysi s, the overal l probl em s i ze remai ns const ant as the processor confi gurati on increases. Therefore, Tm~g and T o p u vary from run to run as the si ze o f the probl em si ze per processor decreases. In Fi gure 16 t he corn- 183 parison between measured and model ed time for the strong scalability analysis out to nearly 500 processors on the probl em size 50 x 50 x 50 is shown. The agreement is excellent. i o Measur ed Model 1.5 o o o o 1.3 " ' ~ ~ ~ 9 ' " " ~~ 1.1 11.9 0. 7 0. 5 0.3 0 10 20 30 40 px+~ Figure 14. r c~ - T c~ 6 x 6 x 360. Ns.,,p large. CRAY T3E. Kb = 10. o 2.', 1.1 I.C 0.6 2 o Mecmur ed , Model l l m m m D l l U l l l o o o ~ 80 0 0 ~ 0 4 6 8 10 12 14 16 Px+Py Figure 15. T ~~ dominant. 16 x 16 x 1000. N ~, p large. IBM RS/6000 SP. 0 M~ I O d 9 Model 7.0 6.0 5.0 4.0 3.0 2,0 1.0 0.0 gl o i l o 0 lO 20 30 ex+Py Figure 16. Strong Scalability. CRAY T3E. 6. Ap p l i c a t i o n s o f t he Mo d e l . Sc a l a bi l i t y Pr e d i c t i o n s Performance models of applications are important to comput er designers trying to achieve proper balance between performance of different system components. 184 ASCI is targeting a 100-TFLOPS system in the year 2004, with a workload de- fined by specific engineering needs. For particle transport, the ASCI target in- volves O( 10 9) mesh points, 30 energy groups, O(104) time steps, and a runtime goal of about 30 hours. With 5,000 unknowns per grid point, this requires about 40 TBytes total memory. In this section we apply our model to understanding the conditions under which the runtime goal might be met. Two sources of difficulty with such a prognosis are (1) making reasonable esti- mates of machine performance parameters for future systems, and (2) managing the SWEEP3D parameter space (i.e., bl ock sizes). We handle the first by study- ing a range of values covering both conservative and optimistic changes in tech- nology. We handle the second by reporting results that correspond to the shortest execution time (i.e., we use block sizes that minimize runtime). We assume a 100-TFLOPS-peak system composed of about 20, 000 processors (5 GFLOPS peak per processor, an extrapolation of Moor e' s law). With this proc- essor configuration, given the proposed size of the global problem, the resulting subgrid size is approximately 6 x 6 x 1000. Plots showing dependence of runtime with sustained processor speed and latency for MPI communications are shown in Figures 17 and 18 for several k-plane block sizes and using optimal values for the angle-block size. Tabl e 1 collects some of the modeled runtime data for a few important points: sustained proces- sor speeds of 10% and 50% of peak, and MPI latencies of 0.1, 1, and 10 micro- seconds. Our model shows that the dependence on bandwidth (1/G in LogGP) is small, and as such no sensitivity plot based on ranges for bandwidth is presented. Tabl e 1 data assumes a bandwidth of 400 Mbytes/s. One immediate observation is that the runtime under the most optimistic techno- logical estimates in Tabl e 1 is still larger than the 30-hour goal by a factor of two. The execution time goal could be met if, in addition to these values of processor speed and MPI latency (L+o in LogGP), we used what we believe to be an unre- alistically high bandwidth value of 4 GBytes/s. Assuming a more realistic sustained processor speed of 10% of peak (based on data from today' s systems), Table 1 shows that we miss the goal by about a factor of six, even when using 0.1 las MPI latency. With the same assumption for processor speed, but with a more conservative value for latency (1 ~ts), the model predicts that we are a factor of 6.6 off. In fact, our results show that the best way to decrease runtime is to achieve better sustained per-processor performance. Changing the sustained processor rate by a factor of five decreases the runtime by a factor of three, while decreasing the MPI latency by a factor of 100 reduces runtime by less than a factor of two. This is a result of the relatively low com- munication/computation ratio that our model predicts. For example, using values of 1 ~ts and 400 MB/sec for the communication latency and bandwidth, and a sustained processor speed of 0.5 GFLOPS, the communication time will onl y be 20% of the total runtime. 9 I O0 k-pl one$ per bl ock .t 5~] k-pl anes per bl ock 1400 1200 lO00 { - 400 20() 0 0 200 400 600 800 lOGO Sustolned CPU Speed (MFLOPS) 600 500 400 300 200 100 0 I 0 k-pl anes per bl ock 100 k-pl anes per bl ock 500 k-pl ane~ per bl ock I 0 k-pi ones per bl ock 40 60 80 100 Lat ency (usec) 185 Fi gur e 17. Sensi t i vi t y of t he billion- poi nt t r ans por t sweep t i me to sus- t ai nedper - pr oces s or CPU speed on a hypot het i cal 100- TFLOPS syst em as pr oj ect ed by t he model f or sever al k- pl ane bl ock sizes and wi t h MPI la- t ency = 15 /is, and bandwi dt h = 400 Mbyt es/ s. Fi gur e 18. Sensi t i vi t y of t he billion- poi nt t r a ns por t sweep t i me t o MPI l at ency on a hypot het i cal 100- TFLOPS syst em as pr oj ect ed by t he model f or sever al k- pl ane bl ock and wi t h sust ai ned per - pr oces s or CPU speed = 500 MFLOPS, bandwi dt h = 400 Mbyt es/ s. Tabl e 1. Est i mat es of SWEEP3D Per f or mance on a Fut ur e - Ge ne r a t i on Sys- t em as a Funct i on of MPI Lat ency and Sust ai ned Per - Pr oces s or Comput i ng Rat e 10% of Peak 50% of Peak MPI tency 0.1 gs 1.0 g s 10 ~ts La- Runtime (hours) 180 16% 198 20% 291 20% Amount of Communication Amount of Runtime (hours) Communica- tion 56 52% 74 54% 102 58% 7. Concl usi ons A scalability model for parallel, multidimensional, wavefront calculations has been proposed with machine performance characterized using the LogGP frame- work. The model accounts for overlap in the communication and computation components. The agreement with experimental data is very good under a variety of model sizes, data partitionings, blocking strategies, and on three different par- allel architectures. Using the proposed model, performance of deterministic 186 transport codes on future generation parallel architectures of interest to ASCI has been analyzed. Our analysis showed that contrary to conventional wisdom, inter- processor communication performance is not the bottleneck. Single-node effi- ciency is the dominant factor. 8. Acknowledgements We would like to thank Ken Koch and Randy Baker of LANL Groups X-CM and X-TM for many helpful discussions and for providing several versions of the SWEEP3D benchmark. We thank Vance Faber and Madhav Marathe of LANL Group CIC-3 for interesting discussions regarding mapping problem meshes to processor meshes. We acknowledge the use of computational resources at the Advanced Computing Laboratory, Los Alamos National Laboratory, and support from the U.S. Department of Energy under Contract No. W-7405-ENG-36. We also thank Pat Fay of Intel Corporation for help running SWEEP3D on the San- dia National Laboratory ASCI Red TFLOPS system, and SGI/CRAY for a gen- erous grant of computer time on the CRAY T3E system. We also acknowledge the use of the IBM SP2 at the Lawrence Livermore National Laboratory. References 1. G. F. Pfister, In Search of Clusters - The Coming Battle in Lowly Parallel Computing, Prentice Hall PTR, Upper Saddle River, NJ, 1995, pages 219-223. 2. L. Lamport, The Parallel Execution of DO Loops," Communications of the ACM, 17(2):83:93, ?., 19?. 3. K. R. Koch, R. S. Baker and R. E. Alcouffe, "Solution of the First-Order Form of the 3-D Discrete Ordinates Equation on a Massively Parallel Processor," Trans. of the Amer. Nuc. Soc., 65, 198, 1992. 4. W. D. Joubert, T. Oppe, R. Janardhan, and W. Dearholt, "Fully Parallel Global M/ILU Preconditioning for 3-D Structured Problems," to be submitted to SlAM J. Sci. Comp. 5. J. Qin and T. Chan, "Performance Analysis in Parallel Triangular Solve," In Proc. of the 1996 IEEE Second International Conference on Algorithms & Architectures for Par- allel Processing, pages 405-412, June, 1996. 6. M. T. Heath and C. H. Romine, "Parallel Solution of Triangular Systems on Distrib- uted Memory Multiprocessors," SIAM J. Sci. Statist. Comput. Vol. 9, No. 3, May 1988, 7. R. F. Van der Wijngaart, S. R. Sarukkai, and P. Mehra, "Analysis and Optimization of Software Pipeline Performance on MIMD Parallel Computers," Technical Report NAS- 97-003, NASA Ames Research Center, Moffett Field, CA, February, 1997. 8. R. E. Alcouffe, "'Diffusion Acceleration Methods for the Diamond-Difference Dis- crete-Ordinates Equations," Nucl. Sci. Eng. {64}, 344 (1977). 9. R. S. Baker and R. E. Alcouffe, "Parallel 3-D SN Performance for DANTSYS/MP[ on the CRAY T3D, Proc. of the Joint Intl'l Conf. On Mathematical Methods and Supercomputing for Nuclear Applications, Vol 1. page 377, 1997. 187 10. M. R. Dorr and E. M. Salo, "Performance of a Neutron Transport Code with Full Phase Space Decomposition and the CRAY Research T3D," ??? 11. R. S. Baker, C. Asano, and D. N. Shirley, "Implementation of the Fi rst -Order For m of the 3-D Discrete Ordinates Equations on a T3D, Technical Report LA-UR-95-1925, Los Alamos National Laboratory, Los Alemaos, NM, 1995; 1995 American Nuclear Society Meeting, San Francisco, CA, 10/29-11/2/95. 12. M. R. Dorr and C. H. Still, "Concurrent Source Iteration in the Solution of Three- Dimensional Multigroup Discrete Ordinates Neutron Transport Equations, " Technical Report UCRL-JC-116694, Rev 1, Lawrence Livermore National Laboratory, Livermore, CA, May, 1995. 13. E. E. Lewis and W. F. Miller, Computational Methods of Neutron Transport, Ameri- can Nuclear Society, Inc., LaGrange Park, IL, 1993. 14. R. E. Alcouffe, R. Baker, F. W. Brinkley, Mart, D., R. D. O' Del l and W. Waiters, "DANTSYS: A Diffusion Acclerated Neutral Particle Transport Code, " Technical Report LA-12969-M, Los Alamos National Laboratory, Los Alamos, NM, 1995. 15. D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos, K. Schauser, R. Subramonian, and T. von Eiken, "LogP: A Practical Model of Parallel Computation, " Communications of the ACM, 39(11):79:85, Nov., 1996. 16. H. J. Wasserman, O. M. Lubeck, Y. Luo and F. Bassetti, "Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Appl i cat i ons, " Proceedings ofSC97, IEEE Computer Society, November, 1997. 17. C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. L. Hennessy, "The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors, " Stanford University Computer Science Report CSL-TR-95-660, January, 1995. The DFN Gigabitwissenschaftsnetz G-Wi N Ei ke Jessen, DFN/ TU Mtinchen Abstract. The German national scientific networking association, DFN, will provide a gigabit network (G-WiN) in spring 2000. The paper analyzes history and trend of DFN network throughput, bandwidth, and cost, and the traditional and innovative load to be carried by the network is evaluated and forecast. Testbeds, which promote gigabit applications and pilot technology, are described. The current status of G-WiN specification is compared to US projects. 1 Deutsches Forschungsnetz G- Wi N is the abbreviation for Gi gabi t -Wi ssenschaft snet z. It will go into operation in spring 2000. It is provi ded to research and educat i on in Ger many by the Verein zur FOrderung eines Deut schen Forschungsnet zes (DFN), an association of 400 members, mai nl y universities and pol yt echni cs, research institutes and other technical and scientific institutions. The purpose of DFN is to 9 provi de networking for research and educat i on in Ger many; DFN does so by speci fyi ng and procuring net work services; it acts as a cooperat i ve for the interest of its members 9 promot e efficiency and quality of research and educat i on by i nnovat i ve net work usage In 1998, DFN has a turnover of 174 Mio DM, of which 110 Mi o DM are spent for data net worki ng services, including the links f r om Ger many to forei gn countries. The cost of net worki ng services is pai d by the participating institutions, except transient funding by the Federal Ministry of Educat i on and Science, Research and Technol ogy (BMBF), for the introduction of new net work generations. 2 Evolution of Network Throughput and Cost The first DFN Wissenschaftsnetz, the X.25 Wi N, began its operat i on in 1990. Since then, the original net work has been upgraded to access rates of 2 Mb/ s (1992) and been widely replaced by the Brei t bandwi ssenschaft snet z B- Wi N (broad band science network), with access rates of up to 155 Mb/s. These networks have been specified by DFN and are operat ed by the Deut sche Tel ekom. The G- Wi N will offer access rates of up to 2488 Mb/s. 190 Fig. 1 shows the evolution of the annual average t hroughput and the bandwi dt h of the DFN networks. The bandwidth is the maxi mum t hroughput that can be carried by the network, gi ven the existing confi gurat i on and routing and a traffic distribution as observed in current operation. The bandwi dt h has to surpass the average throughput by the peak hour factor (1.8), and a burstiness surcharge factor (2..3 at least). In 1995/96 the X.25 Wi N suffered by heavy congest i on, as the average t hroughput approached net work bandwidth. Net work average throughput has grown by factors of 2...5 per year, where there was onl y one year, that of the introduction of B-Wi N, with the high factor. For future, one can expect a growth of 2..2.5 per year for the traditional net work load. In Fig. 1, this leads to the forecast area marked by the t wo di vergi ng straight lines, roughl y a factor of 5 from summer 1998 to summer 2000. Besides, the net work will enabl e its users to go into new f or ms of communi cat i on, principally characterized by high bandwi dt h real t i me applications. With large uncertainty, a further fact or of 1.5 respect s this innovative usage (i.e. 2.5 times the capaci t y of the B- Wi N for today). So, summi ng it up, G-Wi N should have approxi mat el y 8 times the t hroughput in 2000, compared to B-Wi N in mi d 1998, or 4.5 Gb/s. Our est i mat es are conservative, compar ed to estimates for future US nets. The di agram of fig. 1 describes the cost of science nets in Ger many (core network for connecting 600 institutions) by lines of constant cost. The lines rise over t i me by a factor of 1.4 per year, i.e. net works of this class have - at const ant bandwidth - become cheaper by a factor of 1.4 (or 30 % less cost) per year. Thi s trend will increase in the years to come, and an additional fact or of 2 per year seems credible in the next 3 years, because of excess fi ber capacities in Ger many (investment of the compet i ng line and net work provi ders), mul t i pl ex usage of existing fibers (wave length division multiplexing, WDM), and an effect i ve compet i t i on on the market, which in the high speed data communi cat i on area onl y begins to work. Thi s specific forecast makes the iso-cost-lines bend in 1998. The distance of the iso-cost lines describes the cost of bandwi dt h at constant time. It is slightly better (from the point of vi ew of the net work cust omer) than Gr osch' s law in the first two decades of the comput er market. The cust omer pays onl y square root of n times the price for a n-fold increase in power. Thi s is an , , economy of scale"-effect. During their history, the DFN science net works have taken their way up the slope of cost; for the 400 fold increase in bandwi dt h, onl y 5 times the cost is paid (core net only!), as a consequence of the trend in t i me and the progress to higher bandwidth, using the economy of scale. Ext rapol at i ng the analysis to the year 2000 and to the G- Wi N with 8 times the bandwi dt h of B- Wi N, the cost of G- Wi N should be well under 2 times that of B-Wi N, maybe 1.4. By the way, this means that the price per bit x ki l omet er will be 6 t i mes lower, recommendi ng G-Wi N not onl y as a high per f or mance net work for i nnovat i ve usage, but also a cheap medi um for mass data transport. 191 Q. t..- O'J o o "k ~ t (5 0 E t CO 0 g 0 . ~ , 0 0 e~ e~ E ,,,,-I E O E o o c ) E ( 5 o o E CQ (5 O 192 3 Ne t wo r k Lo a d Traditional network load will grow by a factor 4 to 6 bet ween spring 1998 and spring 2000. This seems to be well established by the experiences of the past, see fig. 1. The network will, however, enable its users to put a far more demandi ng load on the net, in particular high bandwidth communi cat i on under realtime constraints. Fig.2 gives a survey on the types and limiting factors of innovative load and of the resulting traffic vol ume in the network and of the peak bandwi dt h per connection. Among the real-time communication types, met acomput i ng (distributed supercomputing) has the highest requirements for peak bandwidth per connection. Data may be fed into the network at speeds of 0.8 and 1.6 Gb/s. The number of concurrent computations in the network, however, will be small, perhaps 5, and the repetition frequency and vol ume of the communi cat i on bursts will be small, resulting in a medium throughput. There is compet i t i on with local computing, realistic bandwidth cost and security arguments will limit the usage. 4 I n n o v a t i v e Lo a d Visualisation and control of remote computing processes, execut ed on special and high performance computers, will be widely used, but - under adequate compression techniques - not render high data throughput per connection. Even virtual reality can be reduced to multiple video and graphics channels with onl y moderate bandwidth (some Mb/s each) per connection. Interpersonal communication by network phone/video, as well as interpersonal cooperation will be very frequent and therefore result in a high bandwidth demand, though the demand per connection is also moderate. Medi a servers, for instance for distribution of distant learning and teaching multimedia material, will be few in number, but with high throughput, though widely not in realtime, where it does not belong into our survey of realtime innovative load. There is, however, a class of realtime high bandwidth media communi cat i on visible, where studio media data are transported to remote processing systems and retransmitted in realtime, so that during the recording the processed signal can be checked and used for the control of the recording. 193 Type Factors Limiting the usage realtime Network volume total Peak bandwi dth per connecti on Distributed I computation Remote visualisation & control of computation Interpersonal communication & cooperation (phone, video, CSCW, virtual reality Media Servers Signal processing Algorithms, economy, security Competition with Workstations Terminal devices Medium Low Medium high low ... high low ... medium Server technology, Low low ... medium I portable media Medium low ... high Competition with local processin~ non-realtime Update, distribution, backups, non- interactive visualisation, experiments Competition with portable media Low medium Fig. 2: Classes of innovative network load and probable volume (in total) and peak bandwidth (per connection) 194 As the network has much lower costs for the vol ume x distance product , non- realtime mass data transports may be attractive, compar ed to the shi pment of physical media. Thi s will bring soft ware and data distribution ont o the net work, as backups, experi ment al data and non-i nt eract i ve visualisation out put of r emot e computations, to an extent far increased compar ed to today. The limiting fact or will be the progress in the t echnol ogy of port abl e media. Thi s kind of load may be considerable in volume, it is not defining net work peak performance, as it may be delayed. 5 Gigabit Testbeds (1997-2000) DFN runs two gigabit testbeds (see Fig. 3) for application pr omot i on and for t echnol ogy piloting. The testbeds are to del i ver experi ence for the specification of the Gi gabi t wi ssenschaft snet z G-Wi N. The first testbed was established in 1997 and is situated in Nort h Rhi ne- Westfalia. It connects institutions in Jtilich, Bonn, Col ogne and Essen. It has been upgraded to 2.5 Gb/ s and studies applications in met acomput i ng, r emot e visualisation, medi a processing and simulation. The probl ems come f r om physics, material sciences, geosciences, traffic control, and studio dat a processing. The second testbed was established in 1998 and connect s Munich, Erl angen, and Berlin. It pilots wavelength division multiplexing (3x2488 Mb/ s per fiber) and optical amplification. Its applications come f r om the same areas and from medicine and distant teaching. 6 G-WiN Specification The general requirements for G-Wi N are, as of late aut umn 1998, 9 data communi cat i on infrastructure for Ger man science, as the successor of B-Wi N, i.e. a full scale network, compri si ng high per f or mance and commodi t y traffic 9 built with component s from the market (this seems achi evabl e) 9 a production network (not merel y a research object), t hough probabl y beyond state of the market in 2000 9 roughl y 10 fold bandwidth, compar ed to B- Wi N 1998 9 more configuration flexibility, to use advant ageous regi onal opportunities and to offer mor e flexible conditions of usage to the participating institutions 9 efficient IP-based communi cat i on; it is not cl ear in 1998 whet her the net work will, in the long run, base IP on ATM and SDH, as B- Wi N (and 195 vBNS e.g.) has done with very good operat i onal results, or base IP directly on SDH, as the Abilene net work in USA will do, or base the IP protocol stack directly on the ,,black" fiber, mul t i pl exed by WDM (which can be combi ned with the first t wo options as well). The decision reflects mai nl y the t radeoff bet ween bandwi dt h consumpt i on, compl exi t y, and bandwi dt h allocation granularity. In any way, the net work is to cooperat e with ATM based access syst ems, and it is likely that G- Wi N should start with ATM and skip it when the compet i ng architectures are mature. 9 guaranteed quality of service; it is, however, by aut umn 1998, by no means clear how this can be achieved; the solution is cl osel y related to the ATM/ SDH variants. B- Wi N offers ATM per manent virtual channel s as the onl y means of guaranteed service. ATM traffic classes still lack the embeddi ng in application oriented quality of service requirements. Besides, ATM will not generally be avai l abl e on an end-t o-end basis bet ween the host computers. So there is much debat e on schemes based fully on IP, as RSVP and (more recently) MPLS (mul t i -prot ocol label switching). These techniques are, however, not yet well underst ood in actual operation. So, at least now, the way of provi di ng adequat e quality of service to the flows in the net work remai ns open. 9 specification for potential line and net work provi ders until the end of 1998, start of G-Wi N operation in spring 2000 7 Compari son wi t h other Sci enti fi c Net works Fig. 4 puts the G- Wi N into perspect i ve with similar projects of the precedi ng generation and that of t omorrow, vBNS is structurally very similar to B-Wi N, but operates on a higher level of bandwidth and l ower level of load; the distance between G- Wi N and Abilene (one year ahead) will be smaller. Bot h US scientific nets are mai nl y (80%) funded by sponsors, the participating institutions see onl y a small percent age of the actual net work costs, vBNS and Abi l ene are defi ned not to carry the lower bandwidth commodi t y traffic, which deliberately is left to commerci al providers, though it probabl y could be carried by the br oadband networks at lower cost. B-Wi N connects a much larger number of institutions (even neglecting the 4000 users of the B- Wi N dial-up service Wi NShut t l e) 196 8 Gigabit Testbeds (1997-2000) * Technology piloting * Application promotion West South Jiilich, Bonn, Cologne, Essen ... ? Erlangen, Miinchen, Berlin, Suttgart ( ? ) Start: August 97 July 98 Lines: 0.6, 2.5 Gb/s 10.6, 2.5 Gb/s WDM Provider: o.tel.o DTAG HiPPI, ATM/SDH ATM/SDH (2 Gb/s ATM achieved!) Applications Metacomputing + Visualisation * Molecular Dynamics * Earth Shell Media Processing * Distrib. virtual TV Prod. Simulation + Visualisation * Traffic * Black Holes * Surface Effects * Distrib. TV Prod./Server * Virtual Laboratory * Distance Learning * Medicine Fig. 3: DFN Gigabit Testbeds (1997-2000) 197 Network vBNS B-WiN Abilene G-WiN Operational 1996 1996 1999? 2000? since Ordered by NSF DFN UCAID DFN Provider selected Paid by MCI MCI/NSF/instit DTAG e.a. instit/BM BF Qwest, UCAID Qwest/instit to be decided instit/BMBF Protocols IP/ATM/Sonet IP/ATM/S IP/Sonet IP/? DH Sites 71 600 120 600 Usage "projects" (science) (no commodity (science) restriction traffic) Trunks 622 Mb/s 41... 94 2.5 Gb/s 0.6/2.5 Gb/s Gb/s Mb/s Access 155/622 Mb/s 2...155 155/622 Mb/s 2...622 Mb/s Rates Mb/s QoS SVCs, PVCs MPLS SVCs, PVCs PVCs MPLS? Fig. 4: Comparison between G-WiN and other scientific networks. Abbrevi ati ons: ATM: Asynchronous Transfer Mode B-WiN: Breitbandwissenschaftsnetz DTAG: Deutsche Telekom AG MPLS: Multiprotocol Label Switching NSF: National Science Foundation SDH: Synchronous Digital Hierarchy S/PVC: Switched/Permanent virtual circuit UCAID: University Cooperation for Advanced Internet Development vBNS: very high-speed Backbone Network services On Net work Resource Management for End- t o- End QoS* Ibrahim Matta Computer Science Department Boston University Boston, MA 02215, USA matt a@cs. bu. edu Ab s t r a c t . This article examines issues and challenges in building distributed Quality- of-Service (QoS) architectures. We consider architectures that facilitate cooperation between the applications and the network so as to achieve the following tasks. Ap- plications are allowed to express in their own language their QoS (performance and cost) requirements. These application-specific requirements are communicated to the network in a language the network understands. Resources are appropriately allocated within the network so as to satisfy these requirements in an efficient, scalable, and reliable manner. Furthermore, the applications and the network have to cooperate "actively" (or continually) to be able to achieve these tasks under time-varying conditions. 1 I n t r o d u c t i o n Advanced network-based applications have received a lot of at t ent i on re- cently. An example of such applications is depicted in Figure 1. Here, a number of scientists are collaborating in a videoconference to control a dis- t ri but ed simulation. This application requires di st ri but i ng si mul at or out put to the scientists, the exchange of audio and video signals among t hem, etc. Such application demands some high-quality services from the network. For example, the simulator dat a should be distributed to the scientists quickly and with no loss. The audio and video signals should be t r ansmi t t ed to ot her participants in a timely and regular fashion to ensure interactivity, etc. The network in t urn should provide these services while utilizing the network re- sources efficiently (to maximize revenue). Also, the network should deliver these services in a scalable and reliable way. In this article, we examine architectures designed to provide such ad- vanced applications with varying QoS. We start by discussing in Section 2 t radi t i onal network services t hat do not provide QoS support as well as tra- ditional applications t hat do not express their QoS or are not aware of the QoS t hat the network is providing t hem with. We will argue why we need * This work was done while the author was with the College of Computer Sci- ence of Northeastern University. The work was supported in part by NSF grants CAREER ANIR-9701988 and MRI EIA-9871022. 200 Simula'don Task ] Fig. 1. Example of an advanced network-based application. t he network and applications to be aware of (and sensitive to) QoS, and we discuss in Section 3 how QoS support is achieved. Then, for these QoS-aware applications and network to cooperat e, we present in Section 4 a generic inte- grat ed archi t ect ure and describe its component s and discuss its mai n features. These features include the scalability of t he archi t ect ure t o large networks and large number of applications or users, and also its robust ness t o various dynamics. We conclude in Section 5. 2 QoS- obl i vi ous Archi t ect ures In this section, we discuss t radi t i onal network archi t ect ures and t hei r short- comings due to their lack of QoS support . 2.1 QoS- obl i vi ous Net work Servi ces Traditionally, a network, such as t he current Int ernet , provides onl y a best- effort service, which means t hat t he dat a which applications send can expe- rience ar bi t r ar y amount s of loss or delays. On the positive side, t he net work could only i mpl ement simple mechanisms for traffic control. For exampl e, any application can access (or send dat a over) t he network at any time. Net work devices (like switches or routers) can serve packets carryi ng appl i cat i on' s dat a in a simple first-come-first-serve fashion. Packets can be rout ed t o t hei r dest i nat i on over pat hs t hat are opt i mal with respect t o a single met ri c, for exampl e pat hs t hat have the mi ni mum number of links (or t hat t raverse t he least number of routers). Wi t h these mechanisms, however, t he appl i cat i on' s QoS may not be met. For example, short est -di st ance pat hs may be loaded 201 and not provide t he least delay and so may not satisfy t he requi rement s of delay-critical applications. Thi s kind of best-effort net work has served quite well t radi t i onal dat a applications (e.g. Telnet, FTP, E-mail) t hat have rela- tively modest QoS requirements. More advanced applications need addi t i onal support , for exampl e, for multicasting a dat a packet t o many destinations. The net work should t hen be able to establish a delivery t ree root ed at t he source and whose branches lead t o the various destinations. In t he case of many senders, t he net work can build one t ree for each sender. This is what current mul t i cast prot ocol s like DVMRP (Distance Vector Multicast Rout i ng Prot ocol ) [27] and MOSPF (Multicast Open Shortest Pat h First) [23] do. The t ree usually consists of pat hs with mi ni mum delay from t he source t o each dest i nat i on. Fi gure 2 shows an exampl e t hree-node net work with two senders S1 and $2 and with a receiver at each node. A packet from S1 is repl i cat ed on bot h links t o reach the two ot her nodes/receivers. Similarly, for $2. So this mul t i cast uses 4 links or the t ot al cost (in number of links) is 4. Thi s cost may reflect t he overhead of replication or the amount of bandwi dt h consumed by this communi cat i on group. Also, each packet experiences a maxi mum delay of going over 1 link. Sl 2% Fig. 2. Sender-based multicast trees. Cost = 4 Del ay= 1 Ot her multicast routing protocols would build a single shared t ree t hat spans all members of the group. A maj or obj ect i ve of such prot ocol is t o minimize the sum of the link costs, or build a so-called "St ei ner tree. " Since it is well known t hat it is comput at i onal l y expensive (or NP-compl et e) t o find such mi ni mum-cost tree [11], protocols usually i mpl ement heuristics t hat are less expensive and give close t o opt i mal trees. Wi t h such trees, t he goal is t o minimize t he cost of replication and bandwi dt h, possibly at t he expense of higher delays from a source t o a destination. Fi gure 3 shows a shared tree. Here, the dat a packet from $2 needs t o be generat ed onl y once, as opposed t o being repl i cat ed in Figure 2 when source-root ed trees are used. Thi s shared tree uses 3 links only, as opposed t o 4 links with source-rout ed trees. However, the packet from $2 experiences a maxi mum delay of t raversi ng 2 links, as opposed t o 1 link with source-rooted trees. Some prot ocol s like PIM-sparse [8] and CBT (Core Based Tree) [3] t r y t o achieve a bal ance bet ween cost and delay by having a single shared t ree with t he root at a center node and mi ni mum-del ay pat hs built from t he center t o members of t he mul t i cast group. 202 Sl Cost = 3 Del ay = 2 $2 Fig. 3. A shared multicast tree. One can easily see t hat regardless of what t ype of t ree a best-effort multi- cast rout i ng prot ocol builds, this t ree may not be appropri at e. Thi s depends on several factors such as the l ocat i on of t he group members, t he l ayout (topology) of the network, the request ed QoS, etc. Fi gure 4(a) shows a case where a source-based t ree should be built r at her t han a shared t ree (with boldfaced links) as it costs less (using 2 links as opposed t o 3). Fi gure 4(b) shows a case where a shared tree achieves lower cost (using 2 links as opposed t o 3). This illustrates t he need for adapt i ve net work services t hat establish appropri at e st ruct ures so as t o efficiently utilize t he net work resources as well as satisfy the QoS requirements of applications. We l at er discuss such QoS-sensitive multicast rout i ng protocols. S: Source R1, R2: Receivers Shared Tree Source Tree S R1 R2 S R1 R2 (a) (b) Fig. 4. (a) A source-based tree costs less, (b) a shared tree costs less. 2.2 QoS-oblivious Applications As we t radi t i onal l y had networks t hat are insensitive t o various par amet er s for the st at e of the network and applications, we t radi t i onal l y had applica- tions t hat are insensitive to the st at e of t he net work and what kind of QoS t hey are getting from the network. A maj or consequence of this is t hat t he application can experience "arbi t rary" QoS. For exampl e, under loaded net- work conditions, a video application can st art losing its frames and suffering arbi t rary degradat i on in quality. If t he appl i cat i on had adapt ed its coding 203 strategy to further compress its data and thus sent fewer frames, this may have increased the likelihood t hat the transmitted frames make it through the network. As a result, the application gets a consistent service (although of lesser quality due to compression). Another example of QoS-oblivious ap- plications is one t hat "arbitrarily" assigns a distributed computation over the network. Figure 5 shows a video example where the decoder (receiver) could decode either MPEG- or JPEG-coded streams. MPEG is more expensive to decode than JPEG. Thus, if there is enough computation cycles, and communication is expensive, we could just send MPEG-coded video and have a matching MPEG decoder, where video traffic is routed over the shortest (one-link) path. On the other hand, if there are not enough cycles and communication is cheap, we could use the decoder in JPEG mode to reduce computation cost and transparently insert an intermediate hardware-based (computationally inexpensive) transcoder that translates from MPEG format to JPEG format. Here, video traffic is routed through the transcoder over a longer (two-link) path, which is acceptable as communication is assumed to be cheap. This example illustrates the advantages of having applications t hat could adapt their operation mode based on the state of the system. Video Source Transcoder (MPEG) S:MPEG S:MPEG MPEG ~ JPEG Video Display (MPEG or JPEG) G Cheap Computation & Costly Computation & Costly Communication Cheap Communication Fig. 5. Video transport example. 3 QoS- s e ns i t i ve Ar c hi t e c t ur e s Now that we have argued for the flexibility and potential benefits of QoS- sensitive network and applications over traditional ones, we next discuss var- ious issues and challenges that must be addressed to implement them. The network has to distinguish among traffic streams (or flows) requiring different QoS. A flow generally defines a stream of data packets that belong to the same application or a pre-defined aggregated traffic class. The works of several standardization groups, such as the IETF integrated-services [6] and differentiated-services [4] working groups and the ATM Forum Traffic Management working group [14], are based on allocating each traffic flow a given (absolute or relative) share of resources (bandwidth, buffers, etc.). This 204 provides different flows with different services; better service (but typically more expensive) to higher priority flows at the expense of worse service (and usually cheaper) to lower priority flows. Figure 6 shows a general architecture of a QoS network. A QoS manager is responsible for receiving requests for some QoS through some signaling pro- tocol, such as the Internet RSVP protocol [7] or the ATM signaling protocol. The manager communicates with a routing component to find the outgoing link(s) or path t hat can likely satisfy the request and over which the flow would be routed. This path selection is typically based on an outdated view of the state of the network. The manager then communicates with an ad- mission control component which decides whether indeed there are enough resources on the selected link(s) to satisfy the given QoS without violating the QoS already promised to existing flows. If the flow request is accepted (admitted), the QoS manager installs appropriate state in other components: A classifier t hat recognizes packets belonging to the flow. A route lookup component that forwards the flow's packets over the selected path. A shaper (or dropper) that shapes the flow (or drops excess traffic) according to the ini- tial traffic specifications declared by the request and based on which resource allocations have been made. Finally, the manager has to set the parameters of the scheduler to allocate to the flow the necessary resources that are needed to satisfy the requested QoS. QoS request s R o u t n o I _ J o c s M a n a g e r L I A m i s s i o n I Traffic _ [ Cl assi f i er Rout e Lookup Dr opper - I Fi g. 6. A genera] architecture of a QoS network. Figure 7 shows the architecture of a general scheduler and a traffic shaper. The scheduler isolates different traffic flows by allocating them fractions of the link's resources (bandwidth, buffer space). The traffic of a flow is shaped before entering the scheduler to compete for resources. The shaper shown is called "token bucket shaper." Tokens accumulate (fill the bucket) at some specified rate. A packet is allowed to enter the scheduler only if there is one or more tokens to drain. The depth of the bucket allows for the flow to burst (i.e. packets can enter back-to-back), but then the rate at which dat a enters so as to contend for resources is bounded by the rate at which tokens are generated. This bounded traffic specification makes it possible to test whether a new flow could be supported by considering its worst-case behavior. tokens Token Bucket Scheduler Fig. 7. A general scheduler with a traffic shaper. 205 As for routing (multicast) traffic, the routing protocol should choose a path t hat satisfies the QoS requirements of (all) the receiver(s). Thus, in multicasting, the type of multicast tree t hat should be constructed depends on the state of the network. Consider the example in Figure 8, where the numbers shown reflect the maximum link delays in the presence of a three- participant application , two of which send and receive at nodes A and C. Assume the application's QoS requirements are maximum end-to-end delay of 13 from the sender to any receiver, and jitter (or maximum difference between individual end-to-end delays) of 7. A shared tree would violate the application's delay and jitter requirements since the maximum end-to-end delay is 20 and the jitter is 10. Thus, under this network state, source-based trees should be constructed. Figure 9 shows the network in a different state, where a shared tree should be constructed. R3 R3 9 RI k,.._J ~ -~..J R2 R1 (a) (b) 10 10 Sl R2 $2 R310 ~ R2 10 RI ~: $2 (r Fig. 8. (a) Original network, (b) shared tree, (c) source-based trees. Example where source-based trees should be constructed. Clearly, to be QoS-sensitive, protocols have to account for the various dynamics at all levels: application, host, and network. Protocols have to be aware of the current participants in the application. Who communicates with 206 ~ R 3 R3 SI ~ R2 RI R2 ~ R1 R2 (a) R1/ ~ 10 ~ $2 (c) k.EJ ( b ) Fi g. 9. (a) Original network, (b) source-based trees, (c) shared tree. Example where a shared tree should be constructed. whom? Is a par t i ci pant aware of t he qual i t y of her t r ans mi s s i on so she can adapt t o var yi ng qual i t i es? How is da t a gener at ed dur i ng t hi s t r ans mi s s i on? Wha t are t he del i very r equi r ement s? How much r esour ces ar e avai l abl e at host s (or end- syst ems) and at t he net wor k ( s wi t ches / t out er s ) ? How does t he l ayout ( t opol ogy) of t he net wor k l ook like now? By account i ng for al l t hese dynami cs, we woul d have a syst em t ha t is capabl e of sat i sf yi ng a wi de var i et y of QoS r equi r ement s for di verse appl i cat i ons at all t i mes, whi l e oper at i ng efficiently. Goi ng back t o mul t i cast r out i ng as an exampl e of net wor k cont r ol , t he syst em ma y mi gr at e f r om one mul t i cast t r ee t o anot her in r esponse t o changes at t he appl i cat i on, host or net wor k level. Fi gur e 10 shows an exampl e, wher e if t he mul t i cast pr ot ocol builds a single shar ed t r ee, t hen 5 links ar e used, and onl y S2' s packet needs t o be repl i cat ed. Also, if, say, each sour ce t r ans mi t s at x bi t s/ sec, t hen t o guar ant ee r at e for R2 t o recei ve f r om bot h senders, we need t o reserve 2x bi t s / s ec on l i nk ($2, R2). If t he avai l abl e ba ndwi dt h on ( $2, R2) were less t ha n 2x, t hen a QoS net wor k woul d not a dmi t t hi s appl i ca- t i on. However, i f t he mul t i cast r out i ng pr ot ocol adapt s i t s t r ee cons t r uct i on mechani sm and swi t ches t o source-based t rees, t hi s appl i cat i on coul d be ad- mi t t e d since t he traffic is now bet t er di s t r i but ed over t he net wor k. Thus, it is somet i mes beneficial t o swi t ch t o bui l di ng a new t ype of t ree. Bui l di ng sour ce- based t r ees in t he pr evi ous conf i gur at i on had al l owed for admi t t i ng t he appl i cat i on, al t hough at t he expense of mor e r epl i cat i on and mor e r out i ng s t at e i nf or mat i on as we have t o mai nt ai n 2 t r ees (one for each source) as opposed t o 1 t ree. Thi s pr evi ous exampl e also cont r adi ct s t he com- mon view t ha t a mi ni mum- cost shar ed t r ee r educes t he a mount of ba ndwi dt h consumed compar ed t o sender - based t rees, especi al l y when we have mul t i pl e senders as t he pa t h f r om some sender t o a recei ver ma y t ur n out t o be ver y long. Thus, a QoS syst em must empl oy a mor e i nt el l i gent t r ee cons t r uct i on s t r at egy t ha t adapt s t he shape of t he t r ee dynami cal l y, wher e we can t r ade- _. . . ~ Sl Pat h : S2 Pat h Cost = 5 Cost = 4 (a) (b) (c) 207 Fig. 10. (a) Original network, (b) shared tree, (c) source-based trees. Example illustrates benefits of adaptive multicast tree construction. off between revenue from QoS support and cost of overhead to provide this support. We present an example of such strategy later. 4 I nt e gr a t e d QoS Ar c hi t e c t ur e The objective is to build an architecture t hat allows for protocols to adapt their behavior so as to account for various parameters and dynamics. The architecture should facilitate the exchange of information between the ap- plications and network; the application should express an acceptable QoS region and be able to adapt its behavior to one t hat matches the level of QoS t hat the network currently delivers. The network should of course com- municate that QoS level to the application. The goal is to efficiently utilize the network (or maximize its revenue) under the QoS constraints imposed by existing applications. Figure 11 shows a generic integrated QoS architecture. In this archi- tecture, applications express their QoS requirements in application-specific terms to an application-specific QoS manager. This manager understands the various attributes of a specific type of applications, and maps the application- dependent requirements into application-independent and implementation- independent requirements. A host QoS manager maps these in turn into implementation-dependent requirements so that necessary resources are al- located to the application by the operation system and network subsystem at the hosts as well as by router QoS managers within the network. QoS managers communicate with their peers or directly with their neighbor man- agers to coordinate the allocation of resources. Router QoS managers control the allocation of paths within the network by communicating with routing managers, which may use different types of routing protocols to locally or globally build paths that are capable of satisfying the QoS requirements of applications. 208 o s t multic~t muting multicast muting protocol I protor Ro u t e r Fig. 11. Integrated QoS architecture. Such architecture allows for supporting QoS between the endpoints of communication ( i . e . the applications) in the presence of various dynamics. The QoS is controlled in a manner sensitive to the specifics of the application so that the system can successfully and efficiently deliver targeted QoS. The applications and network are allowed to exchange information so as to adapt mutually to system dynamics. In the following subsections, we elaborate on the delivery of targeted (application-oriented) QoS and the application' s ca- pability to be aware of (and adaptive to) system state, and also as an example of network control, we elaborate on building multicast routing trees t hat are sensitive to QoS. We finally discuss mechanisms t hat should be in place in order to scale to large networks and to provide stability and reliability in the presence of changes in the state of the system. 4.1 Application-oriented QoS Mappi ngs An end-to-end QoS architecture has to deal with mapping the logical view of the applications to a physical allocation of resources. This involves application- specific QoS managers that take as input abstract QoS requests, such as application input graph with various tasks and interdependencies between them. Through a protocol to discover the state and location of physical re- sources (this knowledge is typically outdated), the QoS manager produces as output the needed physical resources that are likely to satisfy the applica- tion's QoS in an efficient manner. See Figure 12. This in turn involves host QoS managers that would communicate with other control entities at hosts 2O9 and within the network using some signaling protocol to finally allocate the physical resources. Resource Request I I Resource Handling Discovery Resource Selection and Optimization Fig. 12. Application-oriented QoS mapping. Appl i cat i on- or i ent ed Resour ce Sel ect i on: Given a (typically outdated) view of the system, which physical resources to select to satisfy an application- level QoS request? The answer to this question depends on the nature of the application. For example, consider an application t hat is real-time in nature and requires the scheduling of real-time tasks on a set of hosts. A real-time task needs to be reserved some amount of CPU cycles to meet a deadline. The host t hat is selected may end up rejecting the assigned task if the host finds that it does not currently have enough cycles (capacity). For this application, the major objective is to minimize task rejection rate. A traditional load dis- tribution scheme is "load balancing." The goal here is to equalize the load over candidate hosts. However, as shown in Figure 13, this load-balancing strategy may result in capacity fragmentation and so higher rejection. Thus, although load balancing is adequate for providing best-effort QoS (optimizing average measures), it is not adequate for providing guaranteed QoS (optimiz- ing real-time measures). A more appropriate scheme is load profiling [20]. The idea here is to have a more diverse profile of available capacity on the candi- date hosts, so we increase the likelihood of finding a feasible host for future requests. 100~ 1 2 3 4 5 6 Alternative Resources Load-balanced System ................................. :ii: ....... L ieque~ t 1 2 3 4 5 $ Alternative Rosourcss Load-profiled System Fig. 13. Load balancing versus load profiling. 210 Figure 14 shows how after choosing the least-loaded host (with idle ca- pacity of 15) for a class-1 task, we can then only accept 4 consecutive class-2 tasks. On the other hand, if we choose the most-loaded host (with idle ca- pacity of 11), we can then accept 5 consecutive class-2 tasks, as we won' t have fragmented capacity in the system. Choosing the most-loaded candi- date is a "load packing" strategy, which is only asymptotically optimal for large systems and accurate feedback about system state. In a distributed sys- tem with delayed (inaccurate) feedback, a strategy t hat has the same effect but operates probabilistically is less sensitive to the inaccuracies in feedback information and is more appropriate. We call it "load profiling" strategy. IdleCapacity = 11 Class-1 Request = 1 0 Class-2 Roq~~ci~ @ 14 Id/eCapacity ~; ~ , o | IdleCapacity = 15 Fig. 14. Example illustrates difference between (a) load balancing and (b) load packing/profiling. The main idea behind load profiling is illustrated in Figure 15, where the probabilities of selecting each candidate resource are adjusted so we would bring the distribution of QoS requests as close as possible to the distribu- tion of available capacity. This is the well-known supply-demand matching problem. The gain from load profiling (due to reducing fragmentation) is more sig- nificant when we have large requests, which is especially the case when a request represents the aggregate of many micro-requests. This gain over load balancing is also more pronounced as the system becomes more loaded. Ex- tended models that consider the lifetimes of tasks, the costs of migration of tasks, etc. require more careful and more complicated analysis. So in sum- mary, resource selection is an important and difficult problem: how to select resources so as to optimize some application-oriented measure(s) subject to QoS constraints and possibly other constraints on the type of resources, inter- dependencies between tasks to be assigned, etc. This is a multi-constrained optimization problem that needs fast heuristics t hat can produce high-quality solutions. 4.2 Network-aware Appl i cati ons To make resource selection easier, applications should specify a range of ac- ceptable QoS if possible. This has many other benefits: most i mport ant l y is Percentage of resources as a function of available capacity (Desired) (Current) Probabilistic selection of resources QoS demands ~ Available resources 0.50 211 0.00 0.25 0.50 0.75 1.00 Available Capacity Fig. 15. Maintaining a resource availability profile that matches the characteristics of QoS requests. that if the requested QoS can not be delivered by the network in a strictly guaranteed manner, then the application can adapt to the currently delivered QoS in a controlled way. For example, the application could send fewer dat a that have a chance of making it through the network and so the application gets a consistent (although of lesser quality) service. To do this, application- specific QoS managers need to maintain for applications QoS measures of interest to them, and inform applications about the current QoS operating point so that applications adapt accordingly. Applications could also main- tain quality by compensating for QoS violations. For example, knowing the loss rate at the receiver, a video source could adjust its FEC (Forward Error Correction) error recovery scheme to compensate for errors. QoS managers could also try to hide QoS violations from applications, for example, by re- ducing latency through caching or prefetching requested data, by overlapping communication and computation, by migrating processes to where the dat a resides, etc. 4.3 QoS Mul t i cast Tree Cons t r uct i on Another important component of an end-to-end QoS architecture is efficient network services that are sensitive to the QoS requested by applications. One important service is multicast routing. A major goal here is to build a delivery multicast tree with minimum cost and which satisfies QoS delivery constraints. Again, this is a multi-constrained optimization problem and we need good and fast heuristics. An example heuristic is called QDMR (QoS Dependent Multicast Rout- ing) [15]. A nice feature of this heuristic is t hat it constructs a low-cost tree 212 using a greedy st r at egy t hat augment s t he par t i al l y const r uct ed t ree wi t h nodes of mi ni mum cost. However, since this can lead t o pat hs t hat vi ol at e QoS delay bounds, t he t ree const ruct i on pol i cy is adapt ed on t he fly t o give up some cost savings so as t o increase t he likelihood of sat i sfyi ng t he QoS delay bound. Fi gure 16 i l l ust rat es t he idea. D2 D2 D4 ~ D4 S S .......................... ?o ;:;~;;v;; il I'L~,;,-O',aY Tr."~ co,~cv)- l , o~,e..ise Cos,C.) * COo. v) (a) (b) Delay Bound = 6 D2 D2 ~~~N I/'2 D4 D4 ~. . . D3 ~ 2/3~" D3 ~. ~ - Ol / . 2/1 v 2~1 k 1 oN 2 2'2 X X ON1 oN2 3/1 TreeCost = 11 ~== TreeCost = 8 S S (c) t , . . . (i Delay(u) u receiver . . . . . . . II o-os,,v, =~ Oe,ayOoun~ Cos,ru) , Cru, v)I I k 1 otherwi se Legend: 9 Source node O Non-Destination nodes 9 Destination nodes (d) __ Network links Tree links - - - - Removed links Added least-delay links c/d cost/delay Fig. 16. QoS-aware multicast tree construction. 213 Fi gur e 16(a) shows a l ow-cost t r ee cons t r uct i on pol i cy. The cost of a new node v, denot ed by Cost(v), is defi ned in t e r ms of Cost(u), t he cost of node u t ha t is al r eady on t he t r ee, and C(u,v), t he cost of t he l i nk f r om u t o v. Cost(u) does not cont r i but e t o Cost(v) if u is a recei ver. The i dea is t o gi ve pr i or i t y t o t r ee pat hs goi ng t hr ough des t i nat i on nodes so t hey ar e e xt e nde d t o add new nodes (as t hey woul d l i kel y have l ower cost ) . By l ever agi ng t he cost of r eachi ng a des t i nat i on t o r each ot her des t i nat i ons , t he t ot a l cost of t he t r ee is l owered. Thi s, however, ma y vi ol at e t he r eques t ed del ay bound (whi ch is t he case for des t i nat i on nodes D3 and D4) . I n Fi gur e 16(b), t he ( sender - based) t r ee of l east - del ay pat hs is shown, whi ch woul d sat i sf y t he del ay bound if t hi s were i ndeed feasi bl e. However , t he t r ee cost is hi gh. I n Fi gur e 16(c), we modi f y t he t r ee in (a) by r epl aci ng t he i nf easi bl e pa t hs for des t i nat i ons D3 and D4 by t he cor r es pondi ng l eas t - del ay pa t hs so as t o obt a i n a t r ee t ha t sat i sfi es t he del ay bound. However , QDMR can gener at e a l ower cost feasi bl e t r ee as shown in Fi gur e 16(d). The cost of a new node t o be added t o t he cur r ent t r ee depends on how f ar we ar e f r om vi ol at i ng t he del ay bound. Pa t hs t hr ough des t i nat i on nodes ar e no l onger gi ven pr i or i t y as we get cl oser t o vi ol at i ng t he del ay bound. Thi s ma ke s t he t r ee "bus hi er " and t he del ay bound is sat i sfi ed at a l ower cost . 4. 4 Scal abi l i ty Scal abi l i t y is a not he r i mp o r t a n t as pect of an end- t o- end QoS ar chi t ect ur e, especi al l y for l ar ge wi de- ar ea s ys t ems . One ma i n goal is t o r educe t he vi ew a QoS ma na ge r has a bout t he s t a t e of t he s ys t em, a nd bas ed on whi ch i t schedul es r esour ces (host s, pat hs , et c. ). Thi s ma na ge r coul d be a sender or a recei ver or an agent on behal f of t he appl i cat i on, dependi ng on wher e r esour ce s el ect i on/ al l ocat i on is done. A key t o scal abi l i t y is t o s e pa r a t e t he way t he vi ew is col l ect ed f r om how r esour ces are sel ect ed. One way is t o have pr e- defi ned cl asses of appl i cat i ons and col l ect cl ass s t at i s t i cs t o a t t a c h t o t he vi ew, as oppos ed t o st at i st i cs a bout i ndi vi dual appl i cat i ons , for exampl e, t he t ot al capaci t y used by a cl ass r a t he r t ha n t he i ndi vi dual capaci t i es used by each appl i cat i on. Anot he r a ppr oa c h is t o have speci al cont r ol ent i t i es, cal l ed view-servers [1], wher e each vi ew- ser ver ma i nt a i ns onl y a smal l vi ew of i t s s ur r oundi ng ar ea, as oppos ed t o a full vi ew of t he whol e s ys t em. I f a l ar ger a r e a is needed, mor e t ha n one vi ew- ser ver coul d be quer i ed and t hei r vi ews mer ged. Fi gur e 17 i l l ust r at es t he i dea of vi ew- ser ver s. Anot her mor e t r adi t i onal a ppr oa c h is ar ea- bas ed, wher e nodes ar e gr oupe d i nt o level-1 ar eas, level-1 ar eas ar e gr ouped i nt o l evel -2 ar eas, et c. See Fi g- ur e 18. Thi s is, for exampl e, t he scal i ng a ppr oa c h used in P NNI ATM r out i ng [13]. The i dea is t ha t a node has onl y a det ai l ed vi ew of i t s own ar ea, and less det ai l ed ( s ummar i zed or aggr egat ed) vi ews of r e mot e ar eas, i.e. s u mma - ri zed vi ew of level-1 ar eas in t he s ame level-2 ar ea, of l evel -2 ar eas in t he s ame level-3 ar ea, and so on. Di f f er ent schemes coul d be us ed t o a ggr e ga t e an 214 ~ of 2 Fig. 17. Viewserver hierarchy. area. For example, an area may be represented by a fully connected logical graph connecting all B border nodes (those nodes connecting the area to other areas), or by a logical star graph with a virtual node in the center, or by a logical single node. The accuracy of the view, which is presented to the resource selection process, decreases with more aggressive aggregation at the benefit of lesser overhead. 4. 5 Robus t ne s s Reliability is another important aspect of an end-to-end QoS architecture. This involves replication of important control entities to survive their fail- ures. It involves avoiding oscillations between alternative configurations t hat would arise because of the performance interdependencies among the various applications. This could happen if we blindly honor QoS requests, violating existing QoS promises which are then reinstated by again violating other promises. Also, robustness involves reliable switchover to new configurations. For example, switching to a new multicast tree may require keeping trans- mission over the old tree until the new tree is fully established and the new QoS can be reliably delivered. 5 Conc l us i on This article surveys some of the grand challenges in building integrated end- to-end QoS architectures. As we have seen, this involves a plethora of issues in finding fast and good heuristics, defining secure interfaces between different components, investigating the interactions between these components hori- zontally and vertically, etc. Another important issue is how to develop such a complex software in an easy and reusable way. One recent approach is aspect- oriented programming [18], which differs from the traditional object-oriented in that different aspects of the application or protocol, such as communi- cation, core behavior, structure, etc. are not tangled together, which makes maintenance much easier. 215 Border Nodes level- 1 area \ ~ Summarized Information ~ in the view of a node in area A area Logical Links __ PhysicalLmks Full-Mesh Simple-Node Star O(B 2) overhead 0( I ) overhead O(B) overhead (b) Fig. 18. (a) Area hierarchy, (b) aggregation of area C. Research and devel opment efforts t o build an end-t o-end QoS global sys- t em are necessarily multi-disciplinary. For such syst em t o become reality, solutions to different problems have to be i nt egrat ed into a flexible QoS ar- chi t ect ure. and t he overall performance and cost be evaluated. Initiatives, such as Int ernet 2 [17] and NGI (Next Generat i on Int ernet ) [16], are provid- ing the i nfrast ruct ure t o deploy and t est such advanced architectures. Ref erences 1. C. Alaettinoglu, I. Matta, and A.U. Shankar. A Scalable Virtual Circuit Rout- ing Scheme for ATM Networks. In Proc. International Conference on Com- puter Communications and Networks - I CCCN '95, pages 630-637, Las Vegas, Nevada, September 1995. 2. C. Aurrecoechea, A. Campbell, and L. Hauw. A Survey of QoS Architectures. A CM/ Spri nger Verlag Multimedia Systems, Special Issue on QoS Architecture, May 1998. 3. A. Ballardie, P. Francis, and J. Crowcroft. Core Based Trees. In Proc. SIG- COMM '93, San Francisco, California, September 1993. 4. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Archi- tecture for Differentiated Services. RFC 2475, December 1998. 216 5. J-C. Bolot. Adapt i ve Applications Tutorial. ADAPTS BOF, I ETF Meeting, Washington DC, December 1997. 6. B. Braden, D. Clark, and S. Shenker. Int egrat ed Services in t he I nt er net Ar- chitecture: An Overview. Int ernet Draft, Oct ober 1993. 7. B. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jami n. Resource ReSerVat i on Protocol (RSVP) - Version 1 Functional Specification. Int ernet Draft , March 1996. 8. S. Deering, D. Estrin, D. Farrinacci, V. Jacobson, C. Liu, and L. Wei. Prot ocol Independent Multicast (PIM): Protocol Specification. Int ernet Draft , 1995. 9. S. Fischer, A. Hafid, G. Bochmann, and H. de Meer. Cooperat i ve QoS Manage- ment for Multimedia Applications. In Proc. Fourth I EEE International Con- ference on Multimedia Computing and Systems (ICMCS' 97), pages 303-310, June 1997. 10. I. Foster and C. Kesselman. Globus: A Met acomput i ng Infrast ruct ure Toolkit. Intl. J. Supercomputing Applications, 11(2):115-128, 1997. 11. M.R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979. 12. A. Grimshaw, W. Wulf, and Legion Team. The Legion Vision of a Worldwide Virtual Comput er. Communications of the ACM, 40(1), Januar y 1997. 13. ATM Forum PNNI Subworking Group. Pri vat e Network-Network Specification Interface vl . 0 (PNNI 1.0). Technical report , March 1996. 14. ATM Forum Traffic Management Working Group. ATM Forum Traffic Man- agement Specification v4.0. Technical report , 1996. 15. L. Guo and I. Mat t a. QDMR: An Efficient QoS Dependent Multicast Rout i ng Algorithm. Technical Report NU-CCS-98-05, College of Comput er Science, Nort heast ern University, Boston, MA 02115, August 1998. To appear in I EEE RTAS '99. 16. NGI (Next Generation Internet). ht t p: / / www. ngi . gov. 17. Internet2. ht t p: / / www. i nt ernet 2. edu/ . 18. G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J-M. Loingtier, and J. Irwin. Aspect-Oriented Programmi ng. In Proc. European Conference on Object-Oriented Programming, pages 220-242. Springer Verlag, 1997. 19. J. Kurose. Open Issues and Challenges in Providing Quality of Service Guar an- tees in High-Speed Networks. A CM Computer Communication Review, Januar y 1993. 20. I. Mat t a and A. Bestavros. A Load Profiling Approach to Rout i ng Guar ant eed Bandwi dt h Flows. In Proc. I EEE INFOCOM, March 1998. Ext ended version in European Transactions on Telecommunications - Special Issue on Architectures, Protocols and Quality of Service f or the Internet of the Future, February-March 1999. 21. I. Mat t a and M. Eltoweissy. A Scalable QoS Rout i ng Archi t ect ure for Real- Ti me CSCW Applications. In Proc. Fourth I EEE Real-Time Technology and Applications Symposium (RTAS' 98), June 1998. 22. I. Mat t a, M. Eltoweissy, and K. Lieberherr. From CSCW Applications to Mul- ticast Routing: An Int egrat ed QoS Architecture. In Proc. I EEE International Conference on Communications (ICC' 98), June 1998. 23. J. Moy. Multicast Extensions to OSPF. Int ernet draft, Network Working Group, Sept ember 1992. 24. K. Nahrst edt and J. Smith. The QoS Broker. I EEE Multimedia, 2(1):53-67, Spring 1995. 217 25. P. Steenkiste, A. Fisher, and H. Zhang. Darwin: Resource Management in Application-Aware Networks. Technical Repor t CMU-CS-97-195, Carnegie Mellon University, December 1997. 26. H. Topcuoglu, S. Hariri, D. Ki m, Y. Ki m, X. Bing, B. Ye, I. Ra, and J. Valente. The Design and Evaluation of a Virtual Di st ri but ed Comput i ng Envi ronment . Cluster Computing, 1(1):81-93, 1998. 27. D. Wai t zman, C. Partridge, and S. Deering. Distance Vector Multicast Rout i ng Protocol. Request for Comment s RFC-1175, November 1988. 28. J. Zinky, D. Bakken, and R. Schantz. Architectural Support for Qual i t y of Service for CORBA Objects. Theory and Practice of Object Systems, 3(1):55- 73, 1997. A P r o t o t y p e of a Co mb i n e d Di gi t al and Re t r o di g i t i z e d Se ar c habl e Ma t h e ma t i c a l Jour nal Gerhard O. Michler Institute for Experimental Mathematics, Essen University, Ellernstr. 29, 45326 Essen, Germany 1 I nt r oduc t i on Recently many mathematical journals appear in digital and in paper form. The digital version offers many advantages in the future. It is searchable, and in due course it will technically be possible that also a quoted article can be retrieved and partially be shown in a second window on screen if it will be part of a searchable digital library system. Over scientific wide area networks like the German B-Win all digital libraries of the connected universities can be combined to a national distributed research library, and this is about to happen in Germany, and in many other countries. If the legal problems concerning the authentication of the subscribers to a distributed on-line digital library system will be solved, then the authorized members of a German university or research institute will be able to view, read and print the wanted text of digital issues of an available scientific journal at their personal computers. In the future they also want to search in the whole distributed research library, and not only in the recent articles. Therefore the Deutsche Forschungsgemeinschaft (DFG) has provided financial support for the establishment of two Centers for Retrospective Digitization, one at the State and University Library at G5ttingen and the other at the Bavarian State Library Munich. The mathematicians are lucky, because the GSttingen library has been the DFG-Sammelstelle for Mathematics for long, which means t hat almost all essential mathematical journals and books have been collected there with financial support of the Deutsche Forschungsgemeinschaft. If all the digital versions of the mathematical journals were collected at the digital library in GSttingen, and the document management system AGORA [2] was im- plemented at the GSttingen Center for Retrospective Digitization, then the authorized German mathematicians could use this searchable digital library from their workstations at their university or research institute. Of course there is a long way to achieve this ideal situation. Besides fi- nancial and difficult legal problems there are also a lot of technical problems which still have to be solved. It is the latter point of view which will be addressed in this article. 22O 2 A p r o t o t y p e ma t h e ma t i c a l t e x t r e c o g n i t i o n s y s t e m Si nce 1997 my s t udy gr oup is cooper at i ng wi t h t he publ i sher Birkh~iuser (Basel) in or der t o r et r odi gi t i ze 6 vol umes of t he Ar chi v der Ma t he ma t i k publ i shed by Bi rkh~user in t he years 1993 till 1996. Thi s is a wi del y dis- t r i but ed mat hemat i cal j our nal whi ch publ i shes shor t art i cl es f r om all ma j or areas of mat hemat i cs. Fur t her mor e, its t ypes et t i ng is excel l ent . Thi s offers a chance for t he use of opt i cal char act er r ecogni t i on ( OCR) syt ems for t he r et r ospect i ve di gi t i zat i on t ask. Many or di nar y t ext s can aut omat i cal l y be r et r odi gi t i zed by means of com- mer ci al OCR syst ems. In [1] you can read: " Wi t h Adobe Acr obat Ca pt ur e 2.01 soft ware, you can easily t ur n vol umes of paper i nt o sear chabl e Por t abl e Document For mat ( PDF) libraries. I t ' s per f ect for forms, manual s, specifica- t i ons, books - any i mpor t ant document you need t o make accessible on your Web site or i nt r anet . " Besides Adobe [1], t her e ar e also ma ny ot her commer - cial OCR syst ems, like t he one of t he Wor l d Scientific Publ i shi ng Compa ny [4], and Fi neReader [8]. Wha t is so special wi t h mat hemat i cal t ext s? A convi nci ng answer is gi ven by R. Fat eman in his art i cl e [5]. Ther e he wri t es in t he i nt r oduct i on: "Conven- t i onal OCR pr ogr ams have low accur acy for mat hemat i cs for several reasons. The ver y sensible heuri st i cs t ypi cal l y used for t ext r ecogni t i on i ncl ude com- put i ng t he l ocat i ons of t ext lines and est i mat i ng char act er sizes usi ng gl obal st at i st i cs as well as local processing. Thes e pr ogr ams may also use l anguage- based st at i st i cs ( per haps a spelling di ct i onar y) as t ool s t o i mpr ove r ecogni t i on r at es. By cont r ast , mat hemat i cs is not necessari l y ar r anged on lines, its char- act er sizes vary, t he l et t er and symbol frequenci es ar e di st i nct f r om nor mal t ext , and many ot her t ext - or i ent ed heuri st i cs ar e di r ect l y count er - pr oduct i ve. Addi t i onal l y, even if t he mat hemat i cs were somehow recogni zed, convent i onal OCR pr ogr ams whose t r adi t i onal out put is (say) ASCI I t ext , need t o be sub- st ant i al l y augment ed wi t h some met a-l evel l anguage bef or e t hey can expr ess ' ma t h resul t s' as t hei r out put . Al t hough most advanced wor d- pr ocessi ng pro- gr ams have some escape mechani sm for ' doi ng' mat hemat i cs , t her e is still no uni f or m s t andar d for expressi ng t wo-di mensi onal l ayout s, subscr i pt posi t i on- ing, vari abl e-si zed char act er s, unusual ma t h oper at or s , et c. " In 1997 t he Deut sche Forschungsgemei nschaft has agr eed t o s uppor t t he r esear ch pr oj ect "Ret r odi gi t i zat i on of t he mat hemat i cal j our nal Ar chi v der Ma t he ma t i k" for one and t hen for t wo mor e years. It s pur pos e is t o r et r odi g- i t i ze t he 6 vol umes of t hi s j our nal which have appear ed f r om 1993 t o 1996. Wi t hout t he legal s uppor t of t he publ i sher Birkh~iuser t hi s pr oj ect coul d not have been st ar t ed. The out come will be a pr ot ot ype of a t ext r ecogni t i on syst em which allows t o search in t he or di nar y t ext of an ori gi nal l y pr i nt ed art i cl e. Fur t her mor e, it pr oduces a versi on of t he mat hemat i cs f or mul as and symbol s t ha t reflects t he semant i cs of t he mat hemat i cal par t of t he t ext such t ha t t he r et r odi gi t i zed formul as are wr i t t en in t ex f or mat . Thus t hey can be i ncor por at ed i nt o new mat hemat i cal manuscr i pt s. 221 Many computer algebra systems like AXIOM, MAPLE, MATHEMAT- ICA or MAGMA allow symbolic formula manipulation and to display the calculated formulas as typeset expressions. Unfortunately, at the moment the computer algebra systems are not able to read in digitized mathematical formulas. It is hoped that this deficiency will be overcome in the future. Then the retrospectice digitization text systems will become even more important for mathematical research. 2.1 Re c ogni t i on of mat he mat i c al expres s i ons and f ormul as In [3], [6] and [7] R. Fateman, T. Tokuyasu and coauthors have described a package of LISP programs for optical mathematical formula recognition and translation of a scanned mathematical text into digital LISP format. My former collaborator Dr. J. Rosenboom has trained this special OCR system so that it became acquainted with the special typesetting of mathematical formulas used by the printers of the Archiv der Mathematik. Furthermore, he has extended the recognition algorithms for special mathematical symbols and the geometry of combined formulas. Inspired by R. Fateman' s suggestions Rosenboom has written a program to parse the digital mathematical LISP formulas and then to produce an output in tex format. In particular, he incorporated procedures enabling the LISP program to recognize different types of fonts like normal, italics, Greek, bold face, etc. In the LISP code there are several procedures for understanding the layout of the printed pages of a journal. So it is necessary to have segmentation procedures for dissecting a page into lines, lines into words, words into letters or a mathematical formula into mathematical symbols. Inspired by R. Fateman' s article [5] Rosenboom has written a program which separates ordinary text of a scanned page from areas of the page con- sisting only of mathematical formulas. Such a separation leads to a substantial improvement of the retrospective digitization procedure. This was demon- strated by J. Rosenboom in his lecture [14] at the international workshop on "Retrodigitalization of mathematical journals and automated formula recog- nition" organized by R. Fateman (Berkeley), E. Mittler (GSttingen) and the author at the Institute for Experimental Mathematics of Essen University in December 1997. The idea of separating the ordinary text of a scanned page from the remainder lead us to use a commercial OCR system for the recognition of this part of a scanned page. In our project we use FineReader [8], because it is very reliable and has a very good application programming interface for C++ and other modern programming languages. This is important, because we do not have access to the source code of the commercial product FineReader. The following copy of a Tiff file of a scanned page of an article of the "Archiv der Mathematik" will be used to explain the different retrodigitizing procedures. 222 Arch. Malh,, Vet. 65, 399 -407 {19951) 0003-;'l,l:,J!)X.,'YS.'05tJ,.~4)39~ S 3.),WU c. 194'5 Birkh~i~;,'r Verl:ag. l.}:~s,:] Tameness of biserial algebras ~,~."[LI,LA M C ~ , ~ W t I ? Y - tt,.Dl-: V l : : l t ht thi~ paper ,,,,'e st udy :[iniee di mens i onal as4,~ciati',,c al gebras (wi t h t l over ari M[le- btai~.~a, lty r field K. An a~gebra is, b~ena~ if every inde, c,>mp.usable proir162 left ~,r righl modul e P cont ai ns two unJsenal subnrcKlules whaae sum i s tire uni que r na~hmd s ubmodul e of P and who~e incer:~ec6on is ei t her zero r~r s l i t t p t c . (A ,t~o[.tulr is uni.seriM if it has a uttiqur c:ompc:.~Jt[orl .sc-rie.4..j Arl a[ gcbr a A has ~am: r~-];r{,.'(r/].rafio~r ~.l.,fi, e Jl its indeconlp(~sab~c modul es )k" in one-[~ararnercr N l ' t l i ~ i r , ' . 4 C ' ( ~ i i ' ~ r r 131 . Here we pr ove the fol l owi ng r ~ul t . Theor em A. Bi.~ri~:~f afgebre.~ ha~,e t.a~slr rel.,r~:'.~r .'.}:I.,~.. Thi.~: s o n a r s to, have t:~r coniecu4rcd for so.me mac: it ts al r ead' , kao,~.~ f o r so- caUed special biser.ial al gebras [121 and Ibr ma ny ot her biscriaL al f ebr t l s [s~e h n exampl e [?] and [I0, Chapt er 211. The two mahl hl ~rcdi cnt s irl our pr oof are V' il:~-Freyer' s ~truc~.u:-e {hr for &a-sic biserial algebras, whi ch ,,,,'as mvti~'a~ed by ~his pr obl em, t~nd a modilS- cat i on or' Oeif~'s Theor em thal al gebras wi{h t-t t ame degener at i on a t , t ame. i n v,'l~iclt we var y {he rel at i ons in.stead of the st r uct ur e eucfficieat~ 0f l he a~lehra. IfA[g{dl t~. thr vt-mcty of a~ocLat i ve Bat/at al gebr a slructureS on an d-dime~lsi~:,na] ; ' odor ~pace, II~e.q Gr Theor em [61 sea ~es that [f tb.e cl~.~ure or the G L {d, .K )-orbiI r_, f A ff A.i$ (d) con t ai l ~ a ~arne al gebra t hen A is entree. In our versi on the aLgebr;,s ma y have diII-crcnt dl t nen. ' 4ons: Theor em & Lr ,4 be ~ J # J J [ 4 ' " d l ' / ' ; ' l e ? r l 4 " J O g ' 6 t L r , d r ~ , ' ~ - f # r ( W . ~e( .u ~'~' U/1 irred.~:~b[# varie O' omi ~r /~ . . . . . f , : X- , ,4 ~' morphi,~ms ,%/ vatJerie..~ I'.wh~:.'i,'~: ..4 hgs it.~ r~,~r162162 r xtractarr ,;~- q l f l r I e sp~Ic-e'l. F o r x e X ~ , ' r J ' t r , 4 ~ , = A , , " { J I { , K ) . ) , L , # I - " : ' r j , , & t @ X . J r f , 4 b . . t , . 1 ~ , i t J J . h l f f 4 ,1 # ~ ,:,J .4,: = .,t.,~, . f o r g e n e r , d x ~ . u Le. f o r all x ~ t ~ ~a~_m.empD: ,:~pea s r o f . C t h e n .4 , , L'.; u4~ne. In practior GeJ0' s Theor em has ~,suatly been ~ c d i n ~he l s r m of Theor em B, bec in this c ~ r one also has to check th~tt nil l hc al gebras ..4~ ha' re ttle same dLnlert~Lun_ Of cc)~rsr Theor cr a B and OeiB' s t hear em ha~,e a cc~mm{m rct/neme[l!, in WhJCJl bo{h t he aJf, ebr a .4 and relaekJns ar e al l owed Io vary. ] l i e a u t h o r is sl~pportcd by {he F-PSR(; of Grcar Britain,. ] . P r o o f o f TJ~eorel~l B . t f a n a l g e b r a i c g r o u p G u ~ t s o n a {JL(.~r []r i r r e d u c i b l e ) variety Y Ihcn the ,=umber qfpor~te~r of G on u is dim~; Y = max .',dim )i..,:, - ~1 s ~ 0'1 wher ~,) is the U~lion ,:,f the orbic.~ of di met l si on ~. By a G-startle ;ab.~r Z E "k we 223 2. 2 Re c o g ni t i o n of ordi nary ma t he ma t i c a l t e xt s Fi neReader can recogni ze art i cl es wr i t t en in di fferent l anguages like Engl i sh, French, Ge r ma n and Russi an. However, it does not like mi xt ur es of l anguages in a gi ven paper . It achieves r ecogni t i on by checki ng its di ct i onar i es of words in a chosen l anguage. For t he r et r ospect i ve di gi t i zat i on of ma t he ma t i c a l art i - cles t hese di ct i onari es do not suffice, because t hey do not cont ai n t he speci al mat hemat i cal t er ms, abbr evi at i ons, names of aut hor s, quot ed sci ent i st s or mat hemat i cal j our nal s. Ther ef or e addi t i onal mat hemat i cal l y or i ent ed di ct i o- nari es have been wr i t t en in C+ + . The y can be r ead by Fi neReader over i t s appl i cat i on i nt erface. Thus t he or di nar y t ext of a scanned page of a mat he- mat i cal art i cl e is recogni zed by Fi neReader al most perfect l y. I t can be r ead on screen and its di gi t al versi on is wr i t t en in ASCII. Ther ef or e i t is possi bl e t o search in t hi s di gi t i zed t ext for words. Thos e par t s of t he scanned page whi ch have not been recogni zed by Fi neReader like mat hemat i cal formul as, geomet r i c pi ct ur es or di agr ams ar e defi ned by our r et r os pect i ve di gi t i zat i on syst em t o be mat hemat i cal formul as. Thes e sect i ons of t he scanned page ar e sent t o t he LI SP pr ogr am. It s out put is an ASCI I t ext in t ex f or mat . Also t hi s mat hemat i cal cont ent s of t he scanned page can be vi ewed on screen. How- ever, t hi s is done in a di fferent wi ndow t han t he or di nar y Fi neReader t ext . In sect i on 2.5 it is descri bed how t hese t wo par t s of t he t e xt of a scanned art i cl e ar e linked t o each ot her . 2. 3 Ge t t i ng bi bl i ographi c dat a of an art i cl e In or der t o recogni ze t he speci al l ayout of t he first page of a scanned ar t i cl e anot her special pr ogr am has been wr i t t en. It r eads any scanned page. By anal yzi ng t he first recogni zed l et t er s of it, it is able t o deci de whet her t hi s page is t he first page of t he art i cl e. If so, t hen t he pr ogr am recogni zes all bi bl i ographi c da t a about t hi s paper : Name of t he j our nal , number of t he vol ume cont ai ni ng t he art i cl e, first and l ast page, year of appear ance, t he i nt er nat i onal s t andar d serials number (ISSN), owner of copyr i ght ( Bi r kh~user Verlag), t he t own of t he publ i sher (Basel). Moreover, t he pr ogr am recogni zes t he t i t l e, t he number of aut hor s and t hei r names t oget her wi t h t hei r first names and initials. Fr om t he Ti f f file of t he exampl e gi ven in sect i on 2.1 it gives t he following out put . Wi l l i am Crawl ey-Boevey, Tameness of biserial al gebras, Arch. Mat h. 65, 399- 407 (1995) Fr om t hese da t a anot her pr ogr am pr oduces t he following bi bl i ogr aphi c da t a in SGML f or mat : 224 (journal) (volume) (firstPage) (lastPage) (year) (authors) (author) (lastName) (firstName) (firstInitial) (secondInitial) (/ aut hor) (authors) (title) Arch. Math. 65 399 407 1995 ( / j our nal ) (/ vol ume/ (/firstPage) (/ l ast Page) (/ year) Crawley-Boevey William (/ l ast Name/ (/firstName) (/firstInitial) (/secondInitial) Tameness of biserial algebras (/ t i t l e / These dat a written in t he st andard generalized markup language (SGML) will enable ot her digital library document management systems like AGORA [2] or MILESS [11] to retrieve these bibliographic records. Such an example is described in [10]. The Archiv der Mat hemat i k always prints t he complete addresses of all authors at t he end of an article. In recent volumes also their e-mail adresses are given. Both are recognized by our retrodigitizing programs. If necessary these informations can also be produced in SGML format. The end of t he last address is also used to mark t he end ot t he retrodigitized article. 2. 4 R e c o g n i z i n g t h e r e f e r e n c e s o f a n a r t i c l e My collaborators Dr. G. Hennecke and Dr. H. Gollan have wri t t en anot her special program for t he recognition of t he references. It reads any scanned page and decides by itself, whet her or not this page contains t he beginning or t he remaining part of t he references. Each reference is digitized in full t ext , including t he abbreviations of t he cited journals, volumes, years and page numbers. In t he example of section 2.1 t he reference [12] ment i oned on t he first page is recognized as follows. [12] B. WALD and J. WASCHBUSCH, Tame biserial algebras, J. Algebra 95,480-500 (1985). From these dat a t he program produces t he following SGML file: 225 (referenceNumber) [12] (authors) (author) (lastName) Wald (firstName) (firstInitial) B. (secondInitial) ( / aut hor ) (author) (lastName) (firstName) (firstInitial) (secondInitial) ( / aut hor ) (/ aut hors) (title) (journal) (series) (volume) (firstPage) (lastPage) (year) (/ referenceNumber) (/ l ast Name) (/firstName) (/firstInitial) (/secondInitial) Waschbfisch (/ l ast Name) (/firstName) J. (/firstInitial) (/secondInitial) Tame biserial algebras (/ t i t l e / Journal of Algebra (/ j ournal / (/series) 95 (/volume) 480 (/firstPage) 500 (/lastPage) 1985 (/ year) Using later t he functions of a distributed digital library document man- agement system these dat a allow to search in t he digitized volumes of t he mat hemat i cal journal and retrieve t he quoted digital articles. Such an exam- ple is described in [10]. 2. 5 I nc or por at i on of t he di f f erent di gi t i zed t e xt s i nt o a mul t i va l e nt d o c u me n t s y s t e m Many mat hemat i cal articles of t he Archiv der Mat hemat i k contain pictures, complicated diagrams and tables which cannot be recognized by any OCR system. These part s of a scanned page have to be stored as images. However, t hey do not contain any information which is necessary for t he searchability of a mat hemat i cal article. In order to enable t he reader to view t he complete content of an article of t he retrodigitized mat hemat i cal journal on screen each page is scanned with 600 dots per inch, and a TI FF file of t he whole page is produced. Applying t hen t he procedures described in sections 2.1, 2.2, 2.3 and 2.4 we so obt ai n t he following separate files: 1) TI FF file of t he whole scanned page, 2) ASCII file of its ordi nary t ext , 3) Tex file of its mat hemat i cal formulas text, 226 4) Text and SGML files of the quot ed references, 5) SGML file of its bibliographic data. These different files are only useful for t he scientists if t hey can be linked to each other. This is done by the multivalent document syst em designed by T.A. Phelps and R. Wilensky [13]. This software syst em has been produced by T.A. Phelps [12], recently. It is a new general paradigm t hat regards complex documents as multivalent documents comprising multiple layers of distinct but intimately related content. Phelps and Wilensky write in [13]: "Small, dynamically-loaded program objects, or ' behaviors' , activate the content and work in concert with each other and layers of content to support arbi t rari l y specialized document types. Behaviors bind t oget her t he di sparat e pieces of a multivalent document to present the user with a single unified conceptual document. Examples of the diverse functionality in multivalent document s include: ' OCR select and paste' , where the user describes a geometric region on the scanned image of a printed page and t he corresponding t ext characters are copied out . " Therefore this multivalent document syst em is very useful for our ret- rospective digitization project. The following pi ct ure describes t he different layers containing t he separate files 1) till 5), and t he layer for user annot a- tions. I F u t u r e c o n t e n t s I Bibliographic dat a I User annot at i ons I Cited references I Mathematical formulas I Ordinary text parts Tiff file of a whole scanned page of a printed mat hemat i cal article Multiple active semantic layers of the contents of a scanned page 227 The pr ogr am of Phel p' s mul t i val ent document syst em ( MVD) can aut o- mat i cal l y r ead t he TI FF files 1) of t he scanned pages of an ar t i cl e and show t hem on screen. Thus t he user can have a pr i nt out of t he whol e manus cr i pt . The TI F F file of t he mat hemat i cal art i cl e is cont ai ned in l ayer 1 of i t s MVD syst em. In or der t o i ncor por at e t he or di nar y t ext file 2) i nt o t hi s s ys t em a not he r pr ogr am is wr i t t en whi ch enabl es Fi neReader t o pr oduce t he di gi t i zed ordi - na r y t ext in Xdoc f or mat . Thi s Xdoc file descri bes besi des t he ASCI I t e xt also t he coor di nat es of each of i t s l et t ers. The MVD syst em now allows t o search for a word in t he second l ayer and t o show t he r esul t on t he fi rst l ayer on screen. Thus t he syst em pr ovi des full sear chabi l i t y in t he or di nar y t e xt of r et r odi gi t i zed page of a mat hemat i cal art i cl e. The or di nar y t e xt par t s of t hi s mat hemat i cal art i cl e are cont ai ned in l ayer 2 of its MVD syst em. As an exampl e we now pr esent t he Xdoc file of t he or di nar y t e xt of W. Cr awl ey- Boevey' s art i cl e "Tameness of bi seri al al gebr as" cor r es pondi ng t o t he Ti f f file gi ven in sect i on 2.1. I t is free of mat hemat i cal f or mul as or expressi ons. [ a ; " X D O C . i 0 . 0 " ; E ; " F r E n g i n e 4 0 - C B l " ] [ d ; " 6 5 _ 5 _ 3 9 9 . x d c " ] [ p ; I ; P ; 8 3 ; S ; O ; 1 6 6 6 ; 0 ; 0 ; 3 0 1 0 ; 4 6 1 8 ] [ t ; i ; 1 ; O ; O ; A ; .... ; .... ; .... ; 0 ; 0 ; 0 ; 0 ; i ] [ f ; 0 ; " < D E F A U L T > " ; R ; q ; I O ; V ; O ; O ; O ; i 0 ; i 0 0 ] [ f ; i ; " C o u r i e r " ; R ; q ; I 0 ; V ; 6 0 ; 5 0 ; I 0 ; 1 5 ; i 0 0 ] [ s ; l ; 8 8 ; O ; 7 0 ; p ; l ] A r c h . [ h ; 1 9 6 ; 3 2 ] M a t h . , [ h ; 4 2 6 ; 2 8 ] V o l . [ h ; 5 6 9 ; 2 5 ] 6 5 , [ h ; 6 7 7 ; 3 0 ] 3 9 9 - 4 0 7 [ h ; 9 6 5 ; 2 9 ] ( 1 9 9 5 ) [ h ; 1 1 7 7 ; 8 0 2 ] 0 0 0 3 - 8 8 9 X / 9 5 / 6 5 0 5 - 0 3 9 9 [ h ; 2 7 1 9 ; 2 8 ] $ [ h ; 2 7 7 8 ; 2 3 ] 3 . 3 0 / 0 [ y ; 2 9 7 8 ; 0 ; 7 0 ; 0 ; H I [ s ; 1 ; 1 9 8 0 ; 0 ; 1 6 6 ; p ; i ] [ h ; 2 0 4 4 ; 3 4 ] 1 9 9 5 [ h ; 2 2 0 9 ; 3 2 ] B i r k h \ " a u s e r [ h ; 2 5 6 5 ; 2 5 ] V e r l a g , [ h ; 2 7 9 5 ; 3 2 ] B a s e l [ y ; 2 9 7 8 ; 0 ; 1 6 6 ; 1 ; S ] [ s ; i ; 8 4 4 ; 0 ; 7 7 4 ; p ; l ] T a m e n e s s [ h ; 1 2 7 1 ; 4 1 ] o f [ h ; 1 4 1 1 ; 2 9 ] b i s e r i a l [ h ; 1 7 5 6 ; 4 2 ] a l g e b r a s [ y ; 2 1 6 5 ; 0 ; 7 7 4 ; 1 ; S ] [ s ; i ; 1 4 6 1 ; 0 ; l O 0 9 ; p ; 1 ] B y [ y ; 1 5 3 7 ; 0 ; 1 0 0 9 ; I ; S ] [ s ; 1 ; 1 0 9 2 ; 0 ; 1 1 4 8 ; p ; 1 ] W I L L I A M [ h ; 1 3 5 1 ; 2 8 ] C R A W L E Y - B O E V E Y [ y ; 1 9 1 4 ; 0 ; I 1 4 8 ; 1 ; S ] [ s ; 1 ; 1 0 8 ; 0 ; 1 4 4 1 ; p ; I ] I n [ h ; 1 7 6 ; 3 4 ] t h i s [ h ; 3 2 7 ; 3 3 ] p a p e r [ h ; 5 4 8 ; 3 2 ] w e [ h ; 6 6 7 ; 3 2 ] s t u d y [ h ; 8 7 9 ; 3 1 ] f i n i t e [ h ; 1 0 7 6 ; 3 i ] d i m e n s i o n a l [ h ; 1 5 1 1 ; 3 3 ] a s s o c i a t i v e [ h ; 1 9 0 2 ; 3 2 ] a l g e b r a s [ h ; 2 2 0 9 ; 3 I ] ( w i t h [ h ; 2 4 0 7 ; 3 7 ] 1 ) [ h ; 2 4 9 6 ; 3 3 ] o v e r [ h ; 2 6 7 4 ; 3 1 ] a n [ h ; 2 7 8 3 ; 3 4 ] a l g e - [ y ; 2 9 7 3 ; 0 ; 1 4 4 1 ; 0 ; H ] [ s ; 1 ; 2 8 ; 0 ; 1 5 3 9 ; p ; i ] b r a i c a l l y [ h ; 3 1 5 ; 2 7 ] c l o s e d [ h ; 5 4 5 ; 3 0 ] f i e l d [ h ; 7 1 2 ; 3 6 ] K . [ h ; 8 1 8 ; 3 i ] A n [ h ; 9 4 4 ; 2 9 ] a l g e b r a [ h ; 1 2 1 7 ; 2 9 ] i s [ h ; 1 2 9 3 ; 3 0 ] b i s e r i a l [ h ; 1 5 5 8 ; 2 6 ] i f [ h ; 1 6 3 2 ; 1 8 ] e v e r y [ h ; 1 8 2 6 ; 2 7 ] i n d e c o m p o s a b l e [ h ; 2 3 9 3 ; 2 8 ] p r o j e c t i v e [ h ; 2 7 4 8 ; 2 7 ] l e f t [ h ; 2 8 7 4 ; 3 0 ] o r [ y ; 2 9 7 5 ; 0 ; 1 5 3 9 ; 0 ; H ] [ s ; 1 ; 2 8 ; 0 ; 1 6 3 6 ; p ; I ] r i g h t [ h ; 1 8 3 ; 3 5 ] m o d u l e [ h ; 4 6 6 ; 3 9 ] P [ h ; 5 4 7 ; 3 6 ] c o n t a i n s [ h ; 8 6 2 ; 3 4 ] t w o [ h ; 1 0 1 8 ; 3 6 ] u n i s e r i a l [ h ; 1 3 3 5 ; 3 5 ] s u b m o d u l e s 228 [ h ; 1 7 6 5 ; 3 6 ] wh o s e [ h ; 2 0 0 6 ; 3 4 ] s um [ h ; 2 1 7 4 ; 3 5 ] i s [ h ; 2 2 5 7 ; 3 4 ] t h e [ h ; 2 3 9 1 ; 3 5 ] u n i q u e [ h ; 2 6 5 2 ; 34] m a x i m a l [ y ; 2 9 7 3 ; 0 ; 1 6 3 6 ; 0 ; HI [ s ; 1 ; 2 7 ; 0 ; 1 7 3 4 ; p ; 1] s u b m o d u l e [ h ; 3 9 2 ; 2 7 ] o f [ h ; 4 8 7 ; 2 4 ] P [ h ; 5 5 3 ; 2 7 ] a n d [ h ; 7 0 2 ; 2 7 ] wh o s e [ h ; 9 3 4 ; 2 6 ] i n t e r s e c t i o n [ h ; 1 3 4 6 ; 2 7 ] i s [ h ; 1421 ; 2 5 ] e i t h e r [ h ; 1635 ; 2 3 ] z e r o [ h ; 1798 ; 2 8 ] o r [ h ; 1 8 9 7 ; 2 5 ] s i m p l e . [ h ; 2 1 5 0 ; 2 7 ] (A [ h ; 2 2 5 2 ; 26] m o d u l e [ h ; 2 5 2 6 ; 2 6 ] i s [ h ; 2 6 0 0 ; 2 7 ] u n i s e r i a l [ h ; 2 9 0 8 ; 26] i f [ y ; 2 9 8 2 ; 0 ; 1 7 3 4 ; 0 ; H] [ s ; 1 ; 2 6 ; 0 ; 1 8 3 2 ; p ; 1] i t [ h ; 6 9 ; 4 1 ] h a s [ h ; 2 1 8 ; 3 8 ] a [ h ; 2 9 1 ; 3 9 ] u n i q u e [ h ; 5 5 6 ; 3 6 ] c o m p o s i t i o n [ h ; 1005 ; 4 0 ] s e r i e s . ) [ h ; 1 2 6 2 ; 3 7 ] An [ h ; 1 3 9 5 ; 4 0 ] a l g e b r a [ h ; 1679 ; 4 2 ] A [ h ; 1772 ; 3 6 ] h a s [ h ; 1 9 1 7 ; 4 0 ] t a m e [ h ; 2 1 0 8 ; 3 7 ] r e p r e s e n t a t i o n [ h ; 2 6 1 2 ; 37] t y p e [ h ; 2 7 8 3 ; 3 8 ] i f [ h ; 2 8 6 9 ; 2 9 ] i t s [ y ; 2 9 7 2 ; O ; 1 8 3 2 ; O ; H ] [ s ; 1 ; 26 ; 0 ; 1930 ; p ; 1] i n d e c o m p o s a b l e [ h ; 5 6 5 ; 3 8 ] m o d u l e s [ h ; 8 8 1 ; 3 9 ] l i e [ h ; 9 9 4 ; 3 8 ] i n [ h ; 1 0 9 2 ; 3 9 ] o n e - p a r a m e t e r [ h ; 1621 ; 3 6 ] f a m i l i e s , [ h ; 1 9 2 6 ; 4 0 ] s e e [ h ; 2 0 6 1 ; 38] f o r [ h ; 2 1 9 2 ; 3 6 ] e x a m p l e [ h ; 2 5 0 3 ; 4 0 ] [ [ 3 ] . [ h ; 2 6 4 6 ; 4 1 ] H e r e [ h ; 2 8 4 7 ; 3 7 ] we [ y ; 2 9 7 2 ; 0 ; 1930 ; 0 ; H] [ s ; 1 ; 26 ; 0 ; 2 0 4 5 ; p ; 1] p r o v e [ h ; 2 1 3 ; 3 1 ] t h e [ h ; 3 4 4 ; 3 0 ] f o l l o w i n g [ h ; 6 8 4 ; 3 0 ] r e s u l t . [ y ; 911 ; 0 ; 2 0 4 5 ; 1 ; S] [ s ; 1 ; 106 ; 0 ; 2 2 2 4 ; p ; 1] T h e o r e m [ h ; 3 9 8 ; 2 4 ] A. [ h ; 4 9 2 ; 3 6 ] B i s e r i a l [ h ; 7 7 2 ; 3 0 ] a l g e b r a s [ h ; 1073 ; 3 1 ] h a v e [ h ; 1245 ; 3 2 ] t a m e [ h ; 1428 ; 3 1 ] r e p r e s e n t a t i o n [ h ; 1 9 2 7 ; 2 7 ] t y p e . [ y ; 2 1 0 6 ; 0 ; 2 2 2 4 ; 1 ; S ] [ s ; 1 ; 1 0 6 ; 0 ; 2 3 7 2 ; p ; 1 ] T h i s [ h ; 2 5 0 ; 3 9 ] a p p e a r s [ h ; 5 4 5 ; 3 9 ] t o [ h ; 6 4 9 ; 3 8 ] h a v e [ h ; 8 4 2 ; 3 6 ] b e e n [ h ; 1 0 3 0 ; 3 7 ] c o n j e c t u r e d [ h ; 1 4 5 3 ; 3 8 ] f o r [ h ; 1 5 8 4 ; 3 7 ] s o m e [ h ; 1 7 9 1 ; 3 8 ] t i m e ; [ h ; 1 9 9 7 ; 3 6 ] i t [ h ; 2 0 7 7 ; 3 8 ] i s [ h ; 2 1 6 3 ; 3 7 ] a l r e a d y [ h ; 2 4 4 5 ; 3 6 ] k n o w n [ h ; 2 7 0 8 ; 3 8 ] f o r [ h ; 2 8 4 0 ; 3 6 ] s o - [ y ; 2 9 7 2 ; 0 ; 2 3 7 2 ; 0 ; H ] [ s ; 1 ; 2 7 ; 0 ; 2 4 6 9 ; p ; 1 ] c a l l e d [ h ; 2 1 7 ; 2 6 ] s p e c i a l [ h ; 4 6 4 ; 2 8 ] b i s e r i a l [ h ; 7 2 9 ; 2 6 ] a l g e b r a s [ h ; 1 0 3 0 ; 2 6 ] [ [ 1 2 ] [ h ; 1 1 7 1 ; 2 8 ] a n d [ h ; 1 3 2 1 ; 2 5 ] f o r [ h ; 1 4 3 9 ; 2 5 ] m a n y [ h ; 1 6 4 8 ; 2 6 ] o t h e r [ h ; 1 8 5 0 ; 2 5 ] b i s e r i a l [ h ; 2 1 1 2 ; 2 6 ] a l g e b r a s [ h ; 2 4 1 3 ; 2 4 ] ( s e e [ h ; 2 5 5 6 ; 2 5 ] f o r [ h ; 2 6 7 4 ; 2 4 ] e x a m p l e [ y ; 2 9 7 3 ; 0 ; 2 4 6 9 ; 0 ; H ] [ s ; 1 ; 2 9 ; 0 ; 2 5 8 1 ; p ; 1 ] [ [ 7 ] [ h ; 1 0 4 ; 2 2 ] a n d [ h ; 2 4 8 ; 2 2 ] [ [ 1 0 , [ h ; 3 8 2 ; 2 2 ] C h a p t e r [ h ; 6 7 6 ; 2 2 ] 2 ] ) . [ h ; 8 0 4 ; 2 3 ] T h e [ h ; 9 5 4 ; 2 1 ] t w o [ h ; 1 0 9 6 ; 2 0 ] m a i n [ h ; 1 2 8 1 ; 2 2 ] i n g r e d i e n t s [ h ; 1 6 6 8 ; 2 1 ] i n [ h ; 1 7 5 0 ; 2 1 ] o u r [ h ; 1 8 8 6 ; 1 8 ] p r o o f [ h ; 2 0 9 1 ; 1 2 ] a r e [ h ; 2 2 0 3 ; 2 3 ] V i l a - F r e y e r ' s [ h ; 2 6 5 6 ; 1 9 ] s t r u c t u r e [ y ; 2 9 7 2 ; 0 ; 2 5 8 1 ; 0 ; H ] I s ; 1 ; 2 8 ; 0 ; 2 6 6 5 ; p ; i ] t h e o r e m [ h ; 3 0 0 ; 2 6 ] f o r [ h ; 4 1 9 ; 2 7 ] b a s i c [ h ; 6 1 1 ; 2 5 ] b i s e r i a l [ h ; 8 7 3 ; 2 8 ] a l g e b r a s , [ h ; 1 1 9 2 ; 2 7 ] w h i c h [ h ; 1 4 1 4 ; 2 7 ] w a s [ h ; 1 5 6 3 ; 2 6 ] m o t i v a t e d [ h ; 1 9 2 5 ; 2 8 ] b y [ h ; 2 0 3 2 ; 2 6 ] t h i s [ h ; 2 1 7 5 ; 2 6 ] p r o b l e m , [ h ; 2 4 9 6 ; 2 7 ] a n d [ h ; 2 6 4 6 ; 2 6 ] a [ h ; 2 7 0 8 ; 2 5 ] m o d i f i [ y ; 2 9 7 1 ; 0 ; 2 6 6 5 ; 0 ; H ] [ s ; 1 ; 2 6 ; 0 ; 2 7 6 3 ; p ; 1 ] c a t i o n [ h ; 2 3 0 ; 2 8 ] o f [ h ; 3 2 7 ; 1 7 ] G e i ' s [ h ; 5 5 2 ; 2 7 ] T h e o r e m [ h ; 8 7 8 ; 2 9 ] t h a t [ h ; 1 0 3 7 ; 2 7 ] a l g e b r a s [ h ; 1 3 3 9 ; 2 6 ] w i t h [ h ; 1 5 0 9 ; 2 7 ] a [ h ; 1 5 7 2 ; 2 5 ] t a m e [ h ; 1 7 5 8 ; 2 7 ] d e g e n e r a t i o n [ h ; 2 2 1 6 ; 2 7 ] a r e 229 [ h; 2343 ; 28] t a me , [ h; 2548 ; 27] i n [ h; 2636 ; 25] wh i c h [ h; 2857 ; 27] we [ y ; 2 9 7 1 ; O ; 2 7 6 3 ; O ; H ] [ s ; 1 ; 2 7 ; 0 ; 2 8 8 0 ; p ; 1 ] v a r y [ h ; 1 7 1 ; 2 1 ] t h e [ h ; 2 9 1 ; 2 1 ] r e l a t i o n s [ h ; 5 9 8 ; 2 0 ] i n s t e a d [ h ; 8 5 3 ; 2 0 ] o f [ h ; 9 4 1 ; 1 4 ] t h e [ h ; 1 0 5 4 ; 2 0 ] s t r u c t u r e [ h ; 1 3 7 0 ; 1 8 ] c o e f f i c i e n t s [ h ; 1 7 5 3 ; 2 1 ] o f [ h ; 1 8 4 3 ; 1 2 ] t h e [ h ; 1 9 5 6 ; 1 9 ] a l g e b r a . [ h ; 2 2 3 5 ; 2 2 ] I f i s [ h ; 2 8 0 4 ; 2 1 ] t h e [ h ; 2 7 2 5 ; 1 9 ] v a r i e t y [ y ; 2 9 7 1 ; O ; 2 8 6 0 ; O ; H ] [ s ; 1 ; 2 6 ; 0 ; 2 9 5 8 ; p ; 1 ] o f [ h ; 9 5 ; 2 6 ] a s s o c i a t i v e [ h ; 4 7 8 ; 3 6 ] u n i t a l [ h ; 7 0 4 ; 3 5 ] a l g e b r a [ h ; 9 8 4 ; 3 3 ] s t r u c t u r e s [ h ; 1 3 4 3 ; 3 6 ] o n [ h ; 1 4 6 1 ; 3 6 ] a n [ h ; 1 5 7 5 ; 3 4 ] v e c t o r [ h ; 2 3 2 1 ; 3 4 ] s p a c e , [ h ; 2 5 5 0 ; 3 5 ] t h e n [ h ; 2 7 2 8 ; 3 5 ] G e i ' s [ y ; 2 9 7 1 ; 0 ; 2 9 5 8 ; 0 ; H ] [ s ; 1 ; 2 6 ; 0 ; 3 0 5 4 ; p ; 1 ] T h e o r e m [ h ; 3 2 5 ; 2 6 ] [ [ 6 ] [ h ; 4 2 6 ; 2 5 ] s t a t e s [ h ; 6 3 5 ; 2 4 ] t h a t [ h ; 7 8 9 ; 2 5 ] i f [ h ; 8 6 1 ; 1 7 ] t h e [ h ; 9 7 7 ; 2 3 ] c l o s u r e [ h ; 1 2 3 6 ; 2 3 ] o f [ h ; 1 3 2 8 ; 1 5 ] t h e [ h ; 1 4 4 3 ; 2 5 ] o f c o n t a i n s [ h ; 2 7 2 6 ; 2 4 ] a [ h ; 2 7 8 6 ; 2 3 ] t a m e [ y ; 2 9 7 0 ; 0 ; 3 0 5 4 ; 0 ; H] [ s ; 1 ; 2 6 ; 0 ; 3 1 5 2 ; p ; 1 ] a l g e b r a [ h ; 2 7 0 ; 3 0 ] t h e n [ h ; 4 4 3 ; 3 6 ] A [ h ; 5 3 0 ; 2 6 ] i s [ h ; 6 0 4 ; 3 2 ] t a m e . [ h ; 8 1 2 ; 3 3 ] I n [ h ; 9 1 4 ; 3 1 ] o u r [ h ; 1 0 6 0 ; 3 0 ] v e r s i o n [ h ; 1 3 2 8 ; 3 2 ] t h e [ h ; 1 4 5 9 ; 3 1 ] a l g e b r a s [ h ; 1 7 6 4 ; 3 0 ] m a y [ h ; 1 9 3 6 ; 3 0 ] h a v e [ h ; 2 1 1 9 ; 3 0 ] d i f f e r e n t [ h ; 2 4 2 3 ; 2 8 ] d i m e n s i o n s : [ y ; 2 8 5 0 ; 0 ; 3 1 5 2 ; 1 ; S ] [ s ; 1 ; 1 0 2 ; 0 ; 3 3 4 8 ; p ; 1 ] T h e o r e m [ h ; 3 9 4 ; 2 5 ] B . [ h ; 4 8 4 ; 3 7 ] L e t [ h ; 6 2 7 ; 3 8 ] b e [ h ; 8 1 3 ; 3 4 ] a [ h ; 8 8 4 ; 2 0 ] f i n i t e [ h ; 1 0 7 9 ; 3 5 ] d i m e n s i o n a l [ h ; 1 4 9 4 ; 3 0 ] a l g e b r a , [ h ; 1 7 8 1 ; 3 4 ] l e t [ h ; 1 8 9 1 ; 3 7 ] [ h ; 1 9 8 0 ; 3 8 ] b e [ h ; 2 0 8 6 ; 3 3 ] a n [ h ; 2 1 9 5 ; 3 0 ] i r r e d u c i b l e [ h ; 2 5 6 7 ; 3 3 ] v a r i e t y [ h ; 2 8 1 9 ; 3 5 ] a n d [ y ; 2 9 7 1 ; 0 ; 3 3 4 8 ; 0 ; H ] [ s ; 1 ; 2 7 ; O ; 3 4 4 5 ; p ; 1 ] l e t [ h ; 1 0 1 ; 3 4 ] b e [ h ; 8 0 2 ; 4 4 ] m o r p h i s m s [ h ; 1 1 8 9 ; 4 7 ] o f [ h ; 1 3 0 2 ; 3 9 ] v a r i e t i e s [ h ; 1 6 0 9 ; 4 4 ] ( w h e r e [ h ; 1 8 6 6 ; 5 3 ] A [ h ; 1 9 7 0 ; 4 6 ] h a s [ h ; 2 1 2 0 ; 4 7 ] i t s [ h ; 2 2 4 0 ; 4 6 ] n a t u r a l [ h ; 2 5 2 2 ; 4 6 ] s t r u c t u r e [ h ; 2 8 5 4 ; 4 8 ] a s [ y ; 2 9 6 8 ; 0 ; 3 4 4 5 ; 0 ; H ] [ s ; 1 ; 2 4 ; 0 ; 3 5 4 3 ; p ; 1 ] a f f i n e [ h ; 2 0 9 ; 5 5 ] s p a c e ) . [ h ; 4 7 5 ; 6 0 ] F o r [ h ; 6 4 7 ; 5 7 ] w r i t e [ h ; l l O l ; 5 8 ] L e t [ h ; 1 8 1 9 ; 5 5 ] I f [ h ; 2 3 3 8 ; 4 9 ] i s [ h ; 2 5 9 4 ; 5 4 ] t a m e [ h ; 2 7 9 8 ; 5 6 ] a n d [ y ; 2 9 7 0 ; 0 ; 3 5 4 3 ; 0 ; H ] [ s ; i ; 2 8 ; 0 ; 3 6 4 1 ; p ; 1 ] f o r [ h ; 4 4 7 ; 2 7 ] g e n e r a l [ h ; 7 1 2 ; 2 2 ] i . e . [ h ; 1 0 4 4 ; 1 8 ] f o r [ h ; 1 1 5 7 ; 2 5 ] a l l [ h ; 1 2 6 2 ; 2 3 ] i n [ h ; 1 4 0 8 ; 2 3 ] a [ h ; 1 4 6 8 ; 2 1 ] n o n - e m p t y [ h ; 1 8 2 5 ; 2 7 ] o p e n [ h ; 2 0 0 3 ; 2 5 ] s u b s e t [ h ; 2 2 2 4 ; 2 3 ] t h e n [ h ; 2 5 6 8 ; 2 3 ] i s [ h ; 2 7 7 3 ; 2 5 ] t a m e . [ y ; 2 9 6 6 ; 0 ; 3 6 4 1 ; 1 ; S ] [ s ; i ; i 0 1 ; 0 ; 3 7 8 7 ; p ; i ] I n [ h i 1 6 9 ; 3 3 ] p r a c t i c e [ h ; 4 6 3 ; 3 1 ] G e l ' s [ h ; 7 0 2 ; 3 3 ] T h e o r e m [ h ; 1 0 3 4 ; 3 2 ] h a s [ h ; 1 1 7 5 ; 3 4 ] u s u a l l y [ h ; 1 4 4 3 ; 3 2 ] b e e n [ h ; 1 6 2 6 ; 3 4 ] u s e d [ h i 1 8 0 6 ; 3 3 ] i n [ h ; 1 9 0 0 ; 3 2 ] t h e [ h ; 2 0 3 2 ; 3 0 ] f o r m [ h ; 2 2 1 8 ; 3 4 ] o f [ h ; 2 3 2 1 ; 2 2 ] T h e o r e m [ h ; 2 6 4 4 ; 2 5 ] B , [ h ; 2 7 3 2 ; 3 3 ] b u t [ h ; 2 8 7 4 ; 3 2 ] i n [ y ; 2 9 6 7 ; 0 ; 3 7 8 7 ; 0 ; H ] [ s ; 1 ; 2 2 ; 0 ; 3 8 8 4 ; p ; i ] t h i s [ h ; 1 3 9 ; 1 8 ] c a s e [ h ; 2 9 3 ; 2 2 ] o n e [ h ; 4 3 2 ; 1 9 ] a l s o [ h ; 5 8 1 ; 2 1 ] h a s [ h ; 7 1 1 ; 2 1 ] t o [ h ; 7 9 6 ; 2 0 ] c h e c k [ h ; I 0 0 4 ; 2 1 ] t h a t 230 [ h; 1155; 20] a l l [ h; 1253 ; 22] t h e [ h; 1375; 18] a l g e b r a s [ h; 1669 ; 25] A ^ [ h; 1772 ; 20] have [h; 1946 ; 20] t h e [ h; 2066 ; 20] same [ h; 2251 ; 20] d i me n s i o n . [ h; 2630; 22] 0f [ h; 2742 ; 11] c o u r s e [ y; 2967; 0; 3884; 0; H] [ s ; 1 ; 20; 0; 3982; p; 1] Theor em [ h; 319 ; 28] B [h ; 394; 25] and [ h; 541 ; 26] Gei ' s [h; 775; 26] t he or e m [h; 1073 ; 27] have [h; 1253 ; 25] a [ h; 1314; 24] common [ h; 1630; 26] r e f i n e me n t [ h; 2008; 25] i n [ h; 2093; 26] whi ch [ h; 2315; 25] b o t h [h; 2493; 26] t h e [ h; 2620; 24] a l g e b r a [h ; 2888 ; 31] A[ y; 2971; O; 3982; O; H] I s ; 1 ; 20 ; 0 ; 4080 ; p ; 1] and [h; 141 ; 39] r e l a t i ons [h; 467 ; 38] a r e [h ; 606 ; 38] a l l o we d [h; 900 ; 38] t o [h; 1003; 38] v a r y . [ h; 1201 ; 39] The [ h; 1367 ; 38] a u t h o r [h; 1630 ; 36] i s [h; 1714 ; 38] s u p p o r t e d [ h; 2087 ; 39] by [h; 2205; 37] t h e [ h; 2342 ; 39] EPSRC [h; 2638 ; 39] of [ h; 2746 ; 29] Gr e a t [y ; 2966 ; 0; 4080 ; 0 ; H] [ s ; 1 ; 2 1 ; O; 4 1 7 8 ; p ; 1 ] Br i t a i n . [ y ; 2 6 6 ; 0 ; 4 1 7 8 ; 1; S] [ s ; 1; 1 0 3 ; 0 ; 4 3 7 5 ; p ; 1 ] 1. [h; 151; 29] Pr o o f [h ; 372 ; 21] of [h ; 462 ; 20] Theor em [ h ; 7 7 5 ; 2 5 ] B. [ h; 865; 29] I f [ h ; 9 5 0 ; 2 0 ] a n[ h; 1048; 30] a l g e b r a i c [ h; 1379 ; 27] gr oup [h; 1604; 31] g [h; 1685 ; 29] a c t s [h ; 1841; 29] on [ h; 1952; 29] a [h; 2017 ; 27] ( not [h; 2176; 29] n e c e s s a r i l y [h ; 2562 ; 27] i r r e d u c i b l e ) [ y; 2967; 0 ; 4375 ; 0 ; HI [s ; 1 ; 20; 0 ; 4 4 7 2 ; p ; 1] v a r i e t y [h; 246 ; 31] Y [h; 324; 24] t h e n [h; 491 ; 27] t h e [h; 618 ; 25] number [h ; 879 ; 26] of [ h; 972; 15] p a r a me t e r s [ h; 1346 ; 26] of [h ; 1441; 20] G [h ; 1510 ; 27] on [h ; 1619; 31] Y[ h; 1698; 22] i s [h ; 1769 ; 24] [ s ; 1 ; 19 ; 0 ; 4570 ; p ; 1] wher e [h; 215 ; 43] i s [h; 436 ; 38] t h e [h ; 574; 40] uni on [ h; 804 ; 40] of [ h; 913 ; 30] t h e [ h; 1042 ; 41] o r b i t s [ h; 1273 ; 40] of [h ; 1381 ; 29] di me ns i on [h ; 1764; 39] s . [h; 1835 ; 42] By [ h; 1964 ; 38] a [h; 2038 ; 40] G- s t a b l e [h ; 2345 ; 40] s u b s e t [ h; 2590 ; 37] we [y ; 2965 ; 0; 4570; 1 ; S] [g; 1 6 6 6 ; 0 ; 0 ; 3 0 1 0 ; 4 6 1 8 ] In or der t o i ncor por at e t he mat hemat i cal f or mul a l at ex file 3) i nt o t he MVD syst em t he soft ware of t he MVD has been ext ended such t ha t i t can r ead in file 3) as its t hi r d layer. To each mat hemat i cal f or mul a or expr essi on its coor di nat es ar e at t ached. Thi s is necessar y in or der t o keep t he connect i on bet ween t he or di nar y t ext of t he second l ayer and t he mat hemat i cal t ext of t he t hi r d layer. Her e t he r eader can mar k a r el evant ma t he ma t i c a l f or mul a and show it on screen. Anot her pr ogr am will be wr i t t en enabl i ng t he vi ewer t o t r ansf er t he obt ai ned mat hemat i cal f or mul a i nt o a new mat hemat i cal l at ex manuscr i pt . As an exampl e we now pr i nt out t he mat hemat i cal f or mul a t ext of Cr awl ey- Boeyey' s art i cl e "Tameness of bi seri al al gebras". Alg(d) [h; 2537; 19] d-di mensi onal [h; 2080; 35] 231 GL(d, K) - or bi t [h; 1967;24] of A E [h; 2188; 23]Alg(d) [h; 2426; 20] T h e o r e m [h; 394; 25] S . [h; 484; 37] Let [h; 627; 38] A [h; 715; 31] X [h; 1980; 38] f l , . . . , f r : [h; 434; 31]X[h; 515; 30] --+ [h; 607, 33]A [h; 691; 36] x C X [h; 881; 58] A= = [h; 1320; 32]A/(fi(x)). [h; 1654; 59] Xo, xl E X [h; 2222; 59] A= o [h; 2494; 52] A=[h; 107; 29] ~ [h; 187; 39]A~ 1 [h; 325; 33] x[h; 774; 22] 9 [h; 833; 28]X [h; 934; 26] x [h; 1324; 26] X [h; 2406; 26] A= 1 [h; 2696; 29] di mo[ h; 1961; 21]Y[h; 2030; 21] = [h; 2105; 27] max[h; 2272; 23] {dim[h; 2450; 2 2 ] Y / ~ ) [ h ; 2564; 27] - [h; 2645; 28Is[h; 2700; 16] I [h; 2743; 18] s[h; 2788; 27] _ [h; 2867; 28]0}[y; 2962; 0; 4472; 0; H][s; 1; 19; 0, 4570,p; 1] Y(~) [h; 349; 39] Z[h; 2682; 27] C_ [h; 2762; 36]]z [h; 2848; 29] We r emar k t ha t T h e o r e m B has onl y be put i nt o t hi s pr i nt out of t he mat hemat i cal formul a t ext in or der t o help t he r eader finding t he correspond- ing t ext in t he Xdoc file of t he or di nar y t ext and in t he Tiff file. The f our t h layer cont ai ns file 4) wi t h t he bi bl i ographi c da t a of t he quot ed articles and t hei r SGML format s. The layer wi t h t he user annot at i ons allows t he communi cat i on bet ween t he aut hor or l i br ar y and t he possible r eader s of a r et r odi gi t i zed art i cl e. So mi spri nt s, comment s and suggestions for f ur t her i mpr ovement of t he ar t i cl e and its pr esent at i on in t he di gi t al l i br ar y can be ment i oned here. The si xt h layer is used for t he pr epar at i on of t he searchabi l i t y bet ween t he different articles of t he r et r odi gi t i zed vol umes of t he mat hemat i cal j our nal . 232 Therefore it contains all the bibliographic data about the given article in SGML format. Since our retrospective digitization programs can recognize the special layout of the first page and of the references, the MVD system of the proto- type allows to mark and call a reference of a retrodigitized article contained in the digital library within the MVD format. 3 Combi ni ng r e t r o d i g i t i z e d a n d r e c e n t di g i t a l vol ume s All the programs described in the second section will be applied to retrodig- italize the 6 volumes of the Archiv der Mathematik published in the years 1993 till 1996. The result can be considered to be a prototype for retrospec- tive digitization of general mathematical journals. Of course for each other periodical the training systems of the OCR programs have to be modified. However, the developed and applied technologies can be adjusted. It is therefore even more important to combine the retrodigitized volumes of the Archiv der Mathematik with the recent digital issues of this periodical. Since 1997 this journal is published in paper and in digital form. The digital articles can be received from the Springer Link in Heidelberg over the in- ternet by the authorized members of the universities and research institutes, because the Springer data base contains them in portable document format (PDF). This format allows to view each page of the received article on screen. Furthermore, the reader can produce a printout of it. Also the full text can be searched for words, but not for mathematical formulas or expressions. At the Springer Link the bibliographic records and abstracts of the articles are stored in SGML format. The publisher Birkh/iuser has agreed to extend our present collaboration in order to produce a software allowing to connect the retrodigitized and the recent digitized volumes of the Archiv der Mathematik in one experimental digital library system. For that he will provide the digital texts of the recent 6 volumes in PDF format together with bibliographic records about their articles in SGML format. Since we then will have the bibliographic dat a of the retrodigitized and of the digital volumes in SGML format, we can use the MILESS system [11] to produce a common platform for the both parts of the digital articles of the Archiv der Mathematik. MILESS is a library server, developed at Es- sen University in a joint research project of the Computer Center and the Central Library. It uses the IBM DB2 Digital Library product [9] to provide access to digital and multimedia documents in a reliable and systematic way. As described in [10] MILESS allows the storage of such material in any for- mat such as audio, video, HTML, XML, PDF. Since both the retrodigitized and the new digital articles of the Archiv der Mathematik have an SGML- description of their bibliographic data, we can use these to provide MILESS with the necessary information about any given article. For the old retrodig- 233 itized articles we will use the produced MVD system described in the second section as the storage format. The new articles will be stored as PDF-files, as it is done in the Springer LINK. So both parts of the digital journal can be linked within the MILESS library system. Furthermore, the search functions of MILESS allow now to search in the bibliographic data and also for words in ASCII-versions of the articles of both parts of the Archiv der Mathematik. This part of the project will be pursued in due course. The director of the Essen Computer Center has agreed to provide access to the MILESS digital library program. We do not need any access to the source code of the IBM DB2 Digital Library product which is called by the useful public domain programs of MILESS. There is no doubt that this experiment will be successful. It is the last corner stone for building a prototype of a combined digital and retrodigitized searchable mathematical journal. Instead of MILESS the publisher could also use the digital library data base of the Springer LINK. Since the Center for Retrospective Digitization of the GSttingen State and University Library is about to introduce AGORA, a new document management system [2], it is very likely that also by means of t hat library system retrodigitized and recent digital volumes of a mathematical journal can be linked. The director of the GSttingen State and University Library has agreed to support such experiments with our prototype in another joint project. 4 Fut ur e i mpr ove me nt s As the reader will have observed so far the multivalent document system (MVD) does not produce the ordinary mathematical text parts of scanned mathematical article in PDF format. On the other side the programs used at the Springer LINK do not provide any access to the mathematical formulas. They are also not able to call the text of a quoted article in the references which is available in the publishers digital library data base. In order to show that referencing within the digital articles of the Archiv der Mathematik is possible one could also retrodigitize the recent digital texts and put the obtained ordinary text part of a page into the multivalent document system (MVD) in Xdoc format. Such an experiment is planned, and will have the support of the publisher Birkh~iuser. However, it will only prove that it is very promising to start a new joint project, in which the software of the present prototype is extended in such a way t hat the retrodigitized and the recent digital volumes will be stored in a common PDF or postscript format. Furthermore, the bibliographic and referencing records have to be written in a common SGML or XML format. Such a future project requires joint efforts of the software developers for the publisher, for the MILESS project and for the Essen retrodigitization project. If successful, then the resulting new prototype of one combined retrodigi- tized and digital mathematical journal will show t hat such a software system 234 can also be used for r et r ospect i ve di gi t i zat i on and combi ni ng i t s r esul t s wi t h t he digital vol umes of ot her mat hemat i cal j our nal s cont ai ned in a di s t r i but ed di gi t al research library. In par t i cul ar , it will t hen be possi bl e t o r et r i eve and r ead quot ed art i cl es of di fferent j our nal s as l ong as t hey are pa r t of a di gi t al l i br ar y syst em. Fur t her mor e, such a soft ware will allow t o mar k a r et r odi gi - t i zed or recent di gi t al mat hemat i cal f or mul a on screen and have it t r ans f er r ed i nt o a new mat hemat i cal manuscr i pt . Ac k n o wl e d g e me n t s The aut hor ki ndl y acknowl edges fi nanci al s uppor t by DFG gr ant I I I N 2 - 542 81(1) Essen BI B45 ENug 01-02. He is ver y gr at ef ul t o t he Birkh~iuser Verl ag for its gener ous t echni cal and legal suppor t . The a ut hor also t hanks Pr of essor R. Fat eman, T. A. Phel ps Ph. D and Professor D. Wi l ensky of t he Uni ver si t y of Cal i f or ni a (Berkel ey) for t hei r advi ce and t hei r pr ogr ams. Finally, he owes his t hanks t o his f or mer col l aborat ors Dr. G. Hennecke, Dr. J. Rosenboom, and pr esent col l abor at or s C. Begall, Dr. H. Gol l an and Dr. R. St aszewski who have done or do all t he pr ogr ammi ng and har d work. Re f e r e n c e s 1. Adobe Acrobat Capture 2.01 for Windows 95 and Windows NT(R), ht t p: / / www. adobe. com/ prodi ndex/ acrobat / capt ure. ht m 2. Agora - Digitales Dokumentenmanagementsystem ffir die Inhalte Ihrer Biblio- thek, http://www. agora. de 3. Benjamin P. Berman, Richard J. Fateman, Nicholas Mitchell and Taku Tokuyasu, "Optical character recognition and parsing of typeset mat hemat i cs", Journal of Visual Communication and Image Representation, vol. 7 (1996), 2- 15. 4. D. Blostein and A. Grbavec, "Recognition of Mathematical Notation", Chapt er 22 in P.S.P. Wang and H. Bunke (eds.), Handbook on Optical Character Recog- nition and Document Analysis, World Scientific Publishing Company, 1996. 5. R. J. Fateman, "How to find mathematics on a scanned page", Prepri nt 1997, Univ. Calif. Berkeley. 6. R. Fateman and T. Tokuyasu, "A suite of programs for document st ruct uri ng and image analysis using Lisp", UC Berkeley, technical report, 1996. 7. R. Fateman and T. Tokuyasu, "Progress in recognizing typeset mat hemat i cs", 1997, Univ. California Berkeley. 8. FineReader OCR Engine, Developer' s Guide, ABBYY Software House (BIT Software), Moscow, 1993-1997. 9. IBM DB2 Digital Library, ht t p: / / www. soft ware. i bm. com. / i s/ di g-l i b/ 10. H. Gollan, F. Lfitzenkirchen, D. Nastoll, "MILESS - a learning and teaching server for multi-media documents", Preprint. 11. MILESS - Multimedialer Lehr- und Lernserver Essen, ht t p: / / mi l ess. uni - essen.de 235 12. T. A. Phelps, "Multivalent Documents: Anytime, Anywhere, Any Type, Ev- ery Way User-Improvable Digital Document s and Syst ems", dissertation, UC Berkeley, 1998. 13. T. A. Phelps, R. Wilensky, "Multivalent Documents: Induci ng St ruct ure and Behaviors in Online Digital Document s", in Proceedings of the 29th Hawaii International Conference on Syst em Sciences Maui, Hawaii, Januar y 3-6, 1996. 14. J. Rosenboom "A pr ot ot ype mat hemat i cal t ext recognition syst em", Lecture at t he international workshop on "Retrodigitalization of mat hemat i cal journals and aut omat ed formula recognition", Inst i t ut e for Experi ment al Mat hemat i cs, Essen University, 10 - 12 December 1997. Gigabit Networking in Norway Infrastructure, Applications and Projects Thoma s Pl agemann UniK - Center for Technology at Kjeller University of Oslo http://www.unik.no/.,~plageman Ab s t r a c t . Norway is a count ry with large geographical dimensions and a very low number of inhabitants. This combination makes advanced telecommunication services and distributed multimedia applications, like telemedicine and distance education, very i mport ant for t he Norwegian society. Obviously, an appropriate networking infrastructure is necessary to enable such services and applications. In order to cover all i mport ant locations in Norway, this network represents a very large Wide-Area Network (WAN) within a single nation. This paper describes the Norwegian academic networking infrastructure, and gives an overview of Norwegian research institutions, programs, and projects. Furthermore, we describe in two case studies one examplary multimedia application and one ongoing research project in the area of gigabit networking and multimedia middleware. 1 I n t r o d u c t i o n Tradi t i onal l y, Nor way is a ver y advanced count r y in t he ar ea of net wor ki ng and da t a communi cat i ons. For exampl e, in 1975 Kj el l er (whi ch is si t uat ed 30 ki l omet ers nor t h- eas t of Oslo) and London were t he onl y eur opean nodes in t he f or mer i nt er net , called ARPANET. Today, Nor way is one of t he lead- i ng count r i es in t he worl d wi t h r espect t o access and usage of t he i nt er net . Accordi ng t o t he Norwegi an Gal l up i nst i t ut e t ha t is speci al i zed in i nt ervi ew based mar ket anal ysi s, 46% of all Norwegi ans have access t o t he i nt er net , 33% are r egul ar l y usi ng t he i nt er net , and 24% of all Nor wegi an househol ds ar e connect ed t o t he i nt er net [8]. Ther e are t wo f ur t her f act s about Nor way t ha t make a s t udy of advanced net wor ki ng in Nor way qui t e i nt erest i ng: 9 Nor way has appr oxi mat el y 4.3 mi l l i on i nhabi t ant s. Consequent l y, t her e are onl y a few uni versi t i es and r esear ch i nst i t ut i ons. 9 Nor way st r et ches over 2000 ki l omet er s f r om s out h t o nor t h. Thi s geo- gr aphi cal di mensi on combi ned wi t h t he low numbe r of i nhabi t ant s makes advanced t el ecommuni cat i on services and di s t r i but ed mul t i medi a appli- cat i on, like t el emedi ci ne and di st ance educat i on, ver y i mpor t a nt for t he Norwegi an society. Obvi ousl y, an appr opr i at e net wor ki ng i nf r as t r uct ur e is necessar y t o enabl e such services and appl i cat i ons. In or der t o cover all i mpor t a nt l ocat i ons in Norway, t hi s net wor k r epr esent s a ver y l arge Wi de- Ar ea Net wor k (WAN) wi t hi n a single nat i on. 238 This paper has two main goals: (1) to give a general overview of Norwegian research activities in the area of gigabit networking and to provide appro- priate references; and (2) to give a more detailed description of two typical examples for research projects and the usage of the Norwegian networking in- frastructure. In this survey, we consider also distributed multimedia systems and applications. These systems typically operate only with several Mbit/s bandwidth per user, but the potentially large number of concurrent users imposes considerable requirements onto gigabit networks. The first part of this paper describes the Norwegian academic research network infrastructure, the connected research institutions, research pro- grams, and relevant projects. In the second part, we present two case studies. The first case study describes the electronic classroom system t hat is used for teaching regular university courses. The second case study presents the on- going MULTE (Multimedia Middleware for Low-Latency High-Throughput Environments) project. 2 Na t i ona l Ne t wor ki ng I nf r a s t r uc t ur e Since 1987, the organization UNINETT has the responsibility for the acad- emic networking infrastructure in Norway. This includes [17]: 9 to develop and maintain the national data network for research and higher education, 9 to propagate the usage of open standards, and 9 to stimulate research and development that is important in the context of UNINETT' s activities. It is a strategic goal for Norway, to keep up with research and development on new network services, like Internet 2 and Next Generation Internet, in the USA; and to actively participate in the 5th Framework of the European Union. In this context, multimedia and real-time services, like IP telephony, digital libraries, distance education, and virtual reality, play an important role and require a considerable amount of bandwidth. Table 1 summarizes the bandwidth requirements UNINETT estimates for the periode 1998 - 2003 for the Norwegian backbone network, regional networks, access networks, and internal educational networks [18]. In order to meet the current and future requirements, UNINETT, the Nor- wegian Research Council, and Telenor officially opened in September 1998 the National Research Network. This network comprises two parts: the research network and the test network. The research network is a stable network to be used for productive services and for new (multimedia) applications. All Nor- wegian Universities, four Engineering Schools (Mo i Rana, Stavanger, Grim- stad, and Halden) and research institutions at Kjeller are connected by the research network. Additionally, Lillehammer will be connected during 1999. Tabl e 1. Estimated bandwidth requirements in Mbit/s [18] 1998 1999 2000 2003 239 Backbone network 40-150 100-300 300-600 2000-4000 Regional networks 0,25-30 1 0 - 6 0 2 0 - 1 5 0 150-600 Access networks 0, 1-10 0 , 5 - 2 0 2 - 1 5 0 150-300 Internal educational networks 2-10 10-40 1 0 - 1 0 0 80-300 Tromsr [ Bergen I ~ f 30 Mb/s I Trondheim ] 70 M~s S i v a n g e r m s t a d Halden t 15 Mb/s 60 M b / s / / 1 2 3 25 Mb/s/ISLillehammerl ! Oslo ] 15 Mb/s ] Kjeller ] Fi g. 1. Topology of the Norwegian test network (according to ]19]) Fi gure 1 specifies t he links and t he available amount of bandwi dt h bet ween t hese i nst i t ut i ons. In cont rast , t he t est net work enabl es academi c research i nst i t ut i ons t o exper i ment wi t h new net work prot ocol s, e.g., I Pv6 and RSVP, and appl i ca- tions in a Wi de- Ar ea Net work (WAN) wi t hout i nt erferi ng wi t h t he pr oduc- t i ve services in t he research net work. Onl y t he l eadi ng (academi c) r esear ch i nst i t ut i ons in t he ar ea of net worki ng and di st r i but ed syst ems, i.e., t he four Universities, Uni K - Cent er for Technol ogy at Kj el l er and Tel enor Resear ch 240 and Devel opment have access t o t he t es t ne t wor k3 I n par t i cul ar , t he t es t net wor k offers an i nf r as t r uct ur e t o [18]: 9 real i ze t he nat i onal I Pv6 i nf r as t r uct ur e, 9 exper i ment wi t h pr ot ocol mechani s ms t o s uppor t QoS on t op of ATM, I Pv6, or ot her i nt er net bas ed servi ces, 9 i nt r oduce rel i abl e mul t i cast , and 9 per f or m e xpe r i me nt a l r esear ch wi t h new pr ot ocol s and servi ces. Bot h net wor ks - t he r esear ch net wor k and t he t es t ne t wor k - ar e bas ed on t he commer ci al 155Mbi t / s ATM/ S DH WAN f r om Tel enor , cal l ed Nordicom. I n or der t o ma na ge a nd cont r ol t he access t o t he r es ear ch net wor k, UNI NE T T connect s each node 2 in t he Nat i onal Res ear ch Ne t wor k wi t h an ATM swi t ch (Cisco Li ght St r e a m A1010) t o Nor di com. Bas ed on Vi r t ua l Pa t h s ( VPs) in Nor di com, t hese ATM swi t ches est abl i sh Vi r t ual Ci r cui t s ( VCs) , by usi ng t he Pr i vat e Net wor k- t o- Net wor k I nt er f ace ( PNNI ) si gnal l i ng pr ot ocol . Thus , t he Cisco swi t ch s uppor t s t he Us er - Net wor k I nt er f ace ( UNI ) si gnal l i ng pr ot ocol . Addi t i onal l y, t he ATM swi t ches f r om UNI NE T T ar e connect ed t o I P r out e r s t ha t r out e I Pv4 packet s in t he r esear ch ne t wor k ( and I Pv 6 packet s in t he t est net wor k) over t he VCs t owar ds t hei r des t i nat i on. Res ear ch i ns t i t ut i ons wi t h local ATM net wor ks can choose whet her t hey want t o use I P ser vi ces or di r ect l y ATM servi ces. Fi gur e 2 i l l ust r at es t he basi c ar chi t ect ur e at a node. Fi g. 2. Node architecture 3 Ove r vi e w of Re s e ar c h Ac t i v i t i e s 3. 1 Re s e a r c h I n s t i t u t i o n s The mai n academi c r esear ch i nst i t ut i ons in t he ar ea of gi gabi t net wor ki ng and r el at ed ar eas are: z Commercial institutions can appl y for access to t he t est network on a per-proj ect basis. 2 A node corresponds to Gigabit Point of Presence ( Gi gaPOP) in t he Int ernet 2 terminology. 241 9 At t he University of Bergen, t he Depart ment of Informat i on Science does research in t he area of information syst ems and t he Depart ment of Infor- matics in t he areas of algorithms, bioinformatics, code theory, numerical analysis, program development, and optimization. 9 The Depart ment of Informatics at t he University of Oslo is actively work- ing in the areas: comput er science, microelectronics, mat hemat i cal mod- eling, systems development, and image processing. Relevant activities include: Swipp (Switched Interconnection of Parallel Processors), SCI (Scalable Coherent Interface), Mul t i medi a Communi cat i on Labor at or y (MMCL), and ENNCE (Enhanced Next Generat i on Net worked Comput - ing Environment). 9 At the Norwegian University of Science and Technology in Trondheim (NTNU), t he Depart ment of Telematics is working in t he areas of distrib- ut ed systems, traffic analysis, and reliability. The Depar t ment of Com- put er Science and Information Science is doing research in t he areas of artificial intelligence, image processing, human-comput er interfaces and systems development, information management , algorithms, and dat a- base systems. Import ant proj ect s include PaP (plug-and-play) proj ect and WI RAC (Wi deband Radio Access). 9 UniK is a foundation at which faculty members are either affiliated with t he University of Oslo or t he NTNU. Areas of research interest at UniK are: di st ri but ed multimedia systems, telecommunications, opto-electronics, and mat hemat i cal modeling. Relevant activities include: OMODIS (Obj ect - Oriented Modeling and Dat abase Support in Di st ri but ed Systems), IN- STANCE (Intermediate St orage Node Concept), ENNCE/ MULTE (Mul- timedia Middleware for Low-Lat ency Hi gh-Throughput Environments). 9 The Depart ment of Comput er Science at t he University of Tromsr is focussing its activities on di st ri but ed operat i ng syst ems and open dis- t ri but ed systems. Relevant activities include: TACOMA, MacroScope, Vortex; ENNCE/ MULTE. 9 The Norwegian Comput i ng Center performs applied research in t he fields of information technology. Selected activities include: LAVA (Delivery of video over ATM), IMiS (Infrast ruct ure for Mul t i medi a Services in Seam- less Networks), and ENNCE. 9 SI NTEF Telecom and Informatics performs research in t he areas of com- put er science, telecommunications, electronics, and accoustics. Relevant activities include IMiS and OMODIS. 9 NORUT IT is working in t he areas of eart h observation, information and communication technology. Selected activities include NorTelemed (telemedicine applications) and LAVA. 9 The Norwegian Defence Research Est abl i shment (FFI) is a st at e oper- ated, civilian research establishment report i ng directly t o t he Norwegian Ministry of Defence. FFI is an interdisciplinary establishment represent- ing most of t he engineering fields, as well as biology, medicine, political 242 science, and economics. Relevant and not classified research activities include ENNCE/MULTE. The leading commercial research intitution is Telenor Research and De- velopment (R&D), a part of the former national PTT. Main areas of interest include service development and network solutions. Interesting projects in- clude: Project I (next generation IP protocols and applications) and DOVRE (Distributed Object-oriented Virtual Reality Environment). Further compa- nies that perform research in the scope of this paper are Ericsson, Alcatel, and Thomson-CSF Norcom. 3.2 Research Programs and Proj ect s National research projects are mainly founded by the Norwegian Research Council (NFR). NFR is organized in areas, like the area of Natural Sciences and Technology (NT) which in turn are organized in research programs and activities. Ongoing research programs of interest in NT include: 9 The Distributed IT Systems (DITS) lasts from 1996 to 2000. This pro- gram has a budget of 70 Million Norwegian Kronws (MNOK). The Pro- gram supports basic research within the following three main areas: (1) construction and usage of distributed IT systems, (2) methods for con- struction and maintenance of systems and applications for distributed information handling, and (3) basic software and hardware technology for distributed IT systems. 9 Main goal of the Basic Telecommunication Research Program (GT) is to support strategic and basic telecom research at universities and research institutes in the following four areas: (1) mobile systems, (2) broadband systems, (3) transport networks and end-systems, and (4) telecommuni- cation systems for people with special needs. The program has a budget of 78 MNOK for the period 1997 - 2001. 9 Main goal of the Super Computing program is to provide access to na- tional super computer resources to scientific research projects. These resources include: one Cray J90 and one Cray T3E at NTNU, Silicon Graphics Origin2000 at the University of Bergen, and an IBM SP2 at the University of Oslo. NFR contributes 110 MNOK to the budget in the program periode 1999 - 2003. Furthermore, universities and industry are financially supporting research projects. Following, we briefly describe some research projects in three promi- nent areas: cluster technology, middleware, and multimedia applications. A list of these and further research projects, and references to online documen- tation can be found in the Appendix. Cl ust er t echnol ogy: The SCI (Scalable Coherent Interface) research group at the University of Oslo studies how cluster software and hardware can be 243 created, analyzed, efficiently utilized, and maintained. Especially, I/ O and network access within SCI and ATM, and performance studies of SCI are of interest in this context [13]. At the University of Tromso, two projects are concerned with multicomputer systems, clusters, and distributed operating systems. The primary goal of the MacroScope project is to design and build a multicomputer via a distributed operating system based on distributed shared memory. The experimental hardware development platform consists of eight Hewlett-Packard NetServers, each equipped with four Intel Pentium Pro processors running at 166 MHz. Each NetServer has 128 MB memory and two peer PCI buses. The NetServers are connected via a Myrinet in- terconnect from Myricom. The Myrinet has a peek bandwidth of 2xl.28 Gb/s, and hosts one megabyte of SRAM on the network interface. The Vor- tex operating system is currently running on uniprocessors, 2, 4 and 8-way Pentium II/Pentium Pro based multiprocessors. The current implementa- tion includes support for multithreaded processes, virtual memory, network communication over UDP/ IP (100 Mbit ethernet), gigabit network communi- cation (Myrinet), a RAM based file system, basic synchronization (mutexes, semaphores, etc.), APIC symmetric I/ O mode, a Plan9 like namespace for resources, and other features. Mi ddl ewar e: The TACOMA project (Troms0 And COrnell Moving Agents) focuses on operating system support for agents and how agents can be used to solve problems traditionally addressed by other distributed computing paradigms, e.g., the client/server model [10]. The plug-and-play project is financed by GT and will specify and explore aspects of a self-configuring and self-adapting architecture with plug-and-play functionality for transport and teleservice components. The goal is to develop the technology and to demonstrate the ideas central to this plug-and-play concept. The objective of OMODIS is to create basic research results within the domain of modeling for distributed multimedia systems with emphasis on object-oriented modeling and Quality of Service (QoS) modeling, based on a distributed persistent object architecture [9]. OMODIS is financed by DITS in the periode 1996 - 2001. The ENNCE/MULTE project is described in more detail in Section 5. Appl i cat i ons: The NorTelemed project finishes in 1999. New services for telemedicine, like remote diagnostics and remote consultation, are developed, tested, and evaluated in this project. The focus of the main LAVA project (1995 - 1996) was the delivery of video and audio over ATM, including video compression technology, transport protocols for ATM, and multimedia data- bases. Results of LAVA include a MPEG System Stream player application and a server for delivery of streams [5]. In 1998, a LAVA extension was started, called LAVA Education, which focuses on the use of interactive mul- timedia systems for educational purposes. DOVRE is a project at Telenor R&D. DOVRE is a software platform for developing networked real-time 3D 244 applications. The primary goal of DOVRE is to provide a platform for work, education, entertainment and co-operation in distributed digital 3D worlds [3]. The Electronic Classroom is described as case study in the following Section. 4 Ca s e St udy 1: The El e c t r oni c Cl a s s r oom The first case study we present in this paper discusses an example of an advanced application that is used on top of the research network to provide a reliable service. In the MUNIN project [4] and the MultiTeam project [1], the Center for Information Technology Services at the University of Oslo (USIT), Telenor R&D, and the Center for Technology at Kjeller (UniK) have developed the so-called electronic classroom for distance education. Since 1993, the two electronic classrooms at the University of Oslo are used for teaching reg- ular courses to overcome separation in space by exchanging digital audio, video, and whiteboard information. Currently, four electronic classrooms are established in Norway: two at the University of Oslo, one at the University of Bergen, and one at the Engineering School of Hedmark. Since 1997, the electronic classroom system is commercially available from New Learning AS. The following sections give an overview of the application and describe the system architecture. A detailed analysis of QoS aspects in this system can be found in [15]. 4.1 Application The main goal of the distributed electronic classroom is to make the teach- ing situation in a distributed classroom as similar as possible to an ordinary classroom. Thus, the number of seats for students is limited to maximal 20 in each classroom. During a lecture, at least two electronic classrooms are connected. Teacher and students can freely interact with each other, this is not dependent on whether they are in the same or in different classrooms. This interactivity is achieved through the three main parts of each electronic classroom: electronic whiteboard, audio system, and video system. All par- ticipants can see each other, can talk to each other, and may use the shared whiteboard to write, draw, and present prepared material from each site. The electronic whiteboard, audio, and video system in t urn consist of several components. Figure 3 shows the students in the classroom with the teacher and Figure 4 shows the two whiteboards (one with the picture of the teacher) in the remote classroom. In addition to the ordinary classroom structure t hat is visible on these pictures, i.e., student and teacher area, a technical back room is located behind the classroom. Figure 5 illustrates the basic layout of an electronic classroom. 245 Fig. 3. Electronic classroom with teacher Fig. 4. Remote electronic classroom with students only The electronic whi t eboard is a synonym for a collection of soft ware and hardware el ement s to display and edit lecture not es and t ransparenci es t hat are wri t t en in Hyper t ext Mar kup Language ( HTML) . The whi t eboar d itself is a 10ft' semi -t ransparent shield t hat is used t oget her wi t h a video canon and a mi rror as a second moni t or of an HP 725/ 50 wor kst at i on in t he back room. A light-pen is t he i nput device for t he whi t eboard. A di st ri but ed appl i cat i on has been devel oped t hat can be charact eri zed as Worl d-Wi de Web ( WWW) browser with editing and scanning features. When a WWW page is displayed, l ect urer and st udent s in all connect ed cl assrooms can concurrent l y write, draw, and erase comment s on it by using t he light-pen. Thus, floor cont rol 246 o o o 9 9 9 cn 9 9 9 [} O O O B Studonls Teach er Back Room Fi g. 5. Basic layout of an electronic classroom is achi eved t hr ough t he soci al pr ot ocol - as in an or di na r y cl as s r oom - a nd is not enforced by t he syst em. Fur t he r mor e , a s canner can be used t o s can and di spl ay on t he fly new mat er i al , like a page f r om a book, on t he s ha r e d whi t eboar d. The ent i r e appl i cat i on can be ma n a g e d f r om a wor ks t a t i on in t he cl assr oom. The vi deo s ys t em compr i ses t hr ee camer as , a vi deo swi t ch, a set of mon- i t ors, and a H.261 codi ng/ decodi ng devi ce (codec) t o gener at e a compr es s ed di gi t al vi deo s t r eam. One c a me r a is a ut oma t i c a l l y fol l owi ng and f ocusi ng on t he l ect urer. The ot her t wo c a me r a s c a pt ur e all event s ha ppe ni ng in t he t wo sl i ght l y over l appi ng par t s of t he s t udent ar ea in t he cl assr oom. The audi o s ys t em det ect s t he l ocat i on in t he cl as s r oom wi t h t he l oudest audi o sour ce, i.e., a st udent or t he t eacher t ha t is t al ki ng. A vi deo swi t ch sel ect s one of t he t hr ee camer as t ha t capt ur es t hi s l ocat i on and per s on in or der t o pr oduc e t he out goi ng vi deo signal. Two cont r ol moni t or s ar e pl aced in t he back of each cl assr oom. The uppe r moni t or di spl ays t he i ncomi ng vi deo s t r e a m, i.e., pi ct ur es f r om t he r e mot e cl assr oom, and t he l ower moni t or di spl ays t he out - goi ng vi deo s t r eam, i.e., vi deo i nf or mat i on f r om t he l ocal cl assr oom. Thus , t he t eacher can see t he s t udent s in t he r e mot e cl as s r oom and can cont r ol t he out goi ng vi deo i nf or mat i on whi l e faci ng t he l ocal s t udent s . The s t udent s in t ur n can see t he r e mot e cl as s r oom on a second l ar ge scr een whi ch is al so assembl ed out of a whi t eboar d, a vi deo canon t h a t is connect ed t o t he out put of t he H.261 codec, and a mi r r or in t he back r oom. The audi o s ys t e m i ncl udes a set of mi cr ophones t h a t ar e mo u n t e d on t he ceiling. The mi cr ophones ar e evenl y di s t r i but ed in or der t o c a pt ur e t he voice of all t he par t i ci pant s and t o i dent i f y t he l ocat i on of t he l oudest audi o si gnal in t he cl assr oom. Fur t her mor e, t he t eacher is equi pped wi t h a wi rel ess 247 microphone. To generate a digital audio stream, two codecs are available: the audio codec from the workstation and the audio codec in the H.261 codec. Thus, one of the three coding schemes can be selected: 8 bit 8 Khz PCM coding (64 Kbit/s), 8 bit 16 Khz PCM coding (128 Kbi t / s), and 16 bit 16 Khz linear coding (256 Kbit/s). Speakers are mounted at the ceiling to reproduce the audio stream from the remote site. 4 . 2 P l a t f o r m The aim of the electronic classroom system is to be an open system. There- fore, standardized internet protocols have been used as far as possible (see Figure 6). There are four streams, which are using IPv4 as network protocol: management, audio, video, and whiteboard stream. The management part of the classroom, e.g., setting up a session, is performed in a point-to-point manner and utilizes the reliable TCP protocol. The dat a exchange (audio, video, and whiteboard stream) during a lecture requires multicast capable protocols, because more than two classrooms can be interconnected. There- fore, UDP is used on top of IP multicast. The audio and video streams have stringent timing requirements, and audio and video packets are time-stamped with the RTP protocol. For both streams, software modules are used in the application to adapt the streams from the codecs to the RTP protocol, i.e., fill a certain number of samples or parts of video frames into a protocol data unit (PDU). In contrast to audio and video, which tolerate errors, the white- board application cannot tolerate errors and is therefore placed on top of a proprietary multicast error control protocol (based on retransmissions) on top of UDP. W h i t e b o , A p p , i c a , i o n 1 I Video I Management of Audio [ Whiteboard Application Multicast Error RTP Control UDP IP Multicast TCP IP Fig. 6. Protocol stacks used in the electronic classroom The network topology between the electronic classrooms is basically de- fined by the research network. The classrooms are connected either via a local ATM switch (ForeRunner 200) or via dedicated ethernet, routers and a FDDI ring to the research network. Addressing and routing of traffic on the 248 IP layer is mainly performed from t he workstations in t he back rooms and the routers t hat are directly at t ached t o t he research network. As a backup solution, six ISDN lines can be used t o interconnect two classrooms. 5 Ca s e S t u d y 2: T h e E NNCE / MUL T E P r o j e c t In this section, we discuss the ENNCE/ MULTE proj ect , because it est ab- lishes a Metropolitan-Area Network (MAN) with Gi gabi t / s capaci t y and uti- lizes the particular features of the t est network. 5 . 1 O v e r v i e w The ENNCE/ MULTE project is a collaboration proj ect bet ween t he Univer- sity of Tromsr University of Oslo, FFI, Thomson-CSF Norcom, and UniK. The project is funded by the Norwegian Research Council under t he Basic Telecommunication Research Program in t he periode 1997 - 2001. The need for QoS, real-time behavior, and high performance in distrib- ut ed multimedia applications like the electronic classroom, or command-and- control systems is the starting point for t he ENNCE/ MULTE proj ect . On a first glance, it seems t hat the necessary technology t o build an appropri at e system platform for such applications is already commercially available: ATM networks t hat offer high bandwi dt h and (guaranteed) QoS t o higher layer pro- tocols, and implementations of the di st ri but ed obj ect comput i ng middleware st andard Common Obj ect Request Broker Architecture ( CORBA) from the Obj ect Management Group (OMG). Obj ect Request Brokers (ORBs) repre- sent the heart of CORBA and enable t he invocation of met hods of remot e objects, despite their location, underlying network and t r anspor t protocols, and end-system heterogeneity [20]. However, nearly all CORBA implemen- tations (e.g., IONA' s Orbix and Visigenic' s VisiBroker) are based on the communication protocols TCP/ I P. It is well known t hat TCP/ I P is not able t o support the wide range of multimedia requirements, even if it runs over high-speed networks like ATM. Furthermore, CORBA itself is not well-suited for performance sensitive multimedia and real-time applications, because it lacks streams support , st andard QoS policies and mechanisms, real-time fea- tures, and performance optimizations [16]. The main hypothesis of the ENNCE/ MULTE proj ect is t hat satisfying the broad range of requirements of current and fut ure di st ri but ed multimedia applications requires flexible and adapt abl e middleware t hat can be dynam- ically tailored t o specific application needs. A further hypot hesi s is t hat a flexible protocol syst em is an appropri at e basis for the const ruct i on of flexible and adapt abl e middleware. Based on these hypotheses, t he MULTE proj ect breaks down into t he following areas of concern: 9 Analysis of application requirements, based on mul t i medi a applications t hat are developed from the proj ect partners: video j ournal i ng at the 249 University of Troms0, command-and-cont rol systems on naval vehicles at FFI, and t he electronic classroom at UNIK [15]. 9 Low l at ency high t hroughput transmission is based on t he Gi gabi t ser- vice offered from the Gigabit ATM Network Kit from t he Washi ngt on University in St. Louis, an SCI network, and Gigabit Et hernet . 9 St ream binding and enhanced interoperable multicast for het erogeneous environments requires appropri at e abst ract i ons at t he upper API and configuration and management of filters at intermediate and end-syst ems [ 7 ] . 9 Flexible connection management t hat comprises mechanisms t o adapt connection set-up and release mechanisms t o QoS requi rement s and t o make t hem independent of t he particular protocol functionality. 9 Flexible protocol systems t hat perform t he communi cat i on t asks of t he ORB-core. In the following subsections, we describe t he architecture of t he first pr ot ot ype of a flexible multimedia ORB and t he MULTE Gigabit net work over which the ORB will be used. 5. 2 Fl e x i bl e Mu l t i me d i a ORB A flexible protocol syst em allows dynamic selection, configuration and recon- figuration of protocol modules t o dynamically shape t he funct i onal i t y of a protocol t o satisfy specific application requirements and/ or adapt t o chang- ing service properties of the underlying network. The basic idea of flexible end-to-end protocols is t hat t hey are configured to include only t he necessary functionality required to satsify the application for the part i cul ar connection. This might even include filter modules t o resolve incompatibilities among st ream flow endpoints and/ or to scale st ream flows due t o different net work technologies in intermediate networks. The goal of a particular configuration of protocol modules is to support t he required QoS for request ed connec- tions. This will include point-to-point, point-to-multipoint, and mul t i poi nt - to-multipoint connections. As a starting point, we use the Da CaPo (Dynami c Configuration of Protocols) syst em [14] t o build a flexible mul t i medi a ORB. Ov e r v i e w of Da CaPo: Da CaPo splits communication syst ems into t hree layers denot ed A, C, and T. End-syst ems communi cat e via t he t r anspor t in- frast ruct ure (layer T), representing the available communi cat i on infrastruc- t ure with end-to-end connectivity (i.e., T services are generic). In layer C the end-to-end communication support adds functionality to T services such t hat at the AC-interface, services are provided to run di st ri but ed applica- tions (layer A). Layer C is decomposed into protocol functions i nst ead of sublayers. Each protocol function encapsulates a typical prot ocol t ask like error detection, acknowledgment, flow control, encrypt i on/ decrypt i on, etc. Dat a dependencies between protocol functions are specified in a prot ocol 250 graph. T layer modules and A layer modules terminate the module graph of a module configuration. T modules realize access points to T services and A modules realize access points to layer C services. Both module types "con- sume" or ' ~roduce" packets. For example, in a distributed video application a frame grabber and compression board produces video data. Applications specify their requirements within a service request and Da CaPo configures in real-time layer C protocols that are optimally adapted to application require- ments, network services, and available resources. This includes determining appropriate protocol configurations and QoS at runtime, ensuring through peer negotiations that communicating peers use the same protocol for a layer C connection, initiates connection establishment and release, and handles er- rors which cannot be treated inside single modules. Furthermore, Da CaPo coordinates the reconfiguration of a protocol if the application requirements are no longer fulfilled. The main focus of the Da CaPo prototype is on the relationship of functionality and QoS of end-to-end protocols as well as the corresponding resource utilization. Applications specify in a service request their QoS requirements in form of an objective function. On the basis of this specification, the most appropriate modules from a functional and resource utilization point of view are selected. Furthermore, it is ensured t hat sufficient resources are available to support the requested QoS without decreasing the QoS of already established connections (i.e., admission control within layer C). I nt egrat i on of Da CaPo i n COOL: At UniK, we develop a new multi- threaded version of Da CaPo on top of the real-time micro-kernel operating system Chorus, that takes full advantage of the real-time support of Chorus [11]. Furthermore, we integrate Da CaPo into the CORBA implementation COOL such t hat the COOL-ORB is able to negotiate QoS and utilizes opti- mized protocol configurations instead of TCP/ I P [2]. Figure 7 illustrates the architecture of the extended COOL-ORB on top of Chorus. The COOL communication subsystem is split in two parts to separate the message protocol, i.e., the Inter-ORB Protocol (IIOP) and the proprietary COOL Protocol, from the underlying transport protocols, i.e., TCP/ I P and Chorus Inter-Process Communication (IPC). A generic message protocol pro- vides a common interface upwards, thus generated IDL stubs and skeletons are protocol independent. A generic transport protocol provides a common interface for the different transport implementations. There are to alternatives to integrate Da CaPo in this architecture: (1) Da CaPo represents simply another transport protocol. This alternative is our first prototype implementation for Da CaPo in COOL, accompanied with an extended version of IIOP called QoS-IIOP, or QIOP. QIOP encapsulates QoS information from application level IDL interfaces and conveys this informa- tion down to the transport layer and performs at the peer system the reverse operation. Da CaPo uses this information for configuration of protocols. COOL Communication Subsystem I Gesmic P~,~:,col Masa~ IIOP COOL Protocol QIOP Gene~c Trsmlx~ Protocol ] TCP/IP Chonls IPC Da CaPo O) Da CaPo (ii) 251 Chorus Operating System Fig. 7'. Integration of Da CaPo into COOL and Chorus The next step is to implement the second alternative, where Da CaPo additionally configures a message protocol. The message protocols are then Da CaPo modules formatting requests for marshaling and demarshaling in stubs and skeletons. 5. 3 MULTE Gi gabi t Ne t wor k At FFI and UNIK, we currently establish a metropolitan area network t hat combines traditional network technologies, like 100 Mbit/s Ethernet and 155 Mbit/s ATM, with the following Gigabit network technologies [12]: 9 The Gigabit ATM Network Kits are based on technology t hat has been developed at the Washington University in St. Louis [6]. The ATM switches support several different link speeds up to 2.4 Gb/s. The ATM Network Interface Cards (NICs) operate at up to 1.2 Gb/s. 9 SCI is standardized by ANSI-IEEE (Std 1596-1992). SCI provides dis- tributed shared memory to a cluster of nodes, e.g., workstations, mem- ory, disks, high speed network interfaces etc. Hardware-supported shared memory can he used in various applications, ranging from closely synchro- nized parallel programming to LAN support. The aggregated bandwidth of the SCI ring used at FFI is 1.2 Gbit/s. 9 Gigabit Ethernet supports data transfer rates of 1 Gbit/s and is stan- dardized in the IEEE 802.3z standard. Figure 8 illustrates the network topology and infrastructure. Two Gigabit ATM switches, one at FFI and one at UNIK, t hat are connected with a 1.2 252 Gi gabi t / s link build the core of t he network. At FFI, five PCs are connect ed with PCI cards t o this switch and addi t i onal l y t o an SCI ring. At UNIK, we connect SunUl t ra workstations and PCs t o t he Gi gabi t switch, t o Giga- bit Et hernet , and t o the local 155 Mbi t / s ATM net work from ForeSyst ems. The available access t o different t ypes of gigabit networks in one end-syst em enables us to directly compare these technologies. Especially, we will experi- ment al l y evaluate with the flexible mul t i medi a ORB the possibility t o select an appropri at e net work service on t he fly. The gigabit net work is di rect l y connect ed to the test network via the Fore ATM switch. Thi s enables t he University of Tromsr t o access t he gigabit net work in Kjeller, and we are current l y using t he flexibility of Da CaPo and t he possibilities of t he t est network t o st udy the influence of various prot ocol configurations combi ned with different network reservations ont o t he QoS of st reamed video t ransfer between Kjeller and Tromsr (distance of approx. 1500 km). Gigabit Ethernet Fore ATM Gigabit ATM UniK ~ I W l Gigabit ATM FFI Fig. 8. Gigabit network at Kjeller 6 Conc l udi ng Re ma r ks The aim of this paper is twofold, on the one hand we intend t o give an overview of gigabit networking and related areas in Norway. Thus, the first part desribes t he Norwegian academic networking i nfrast ruct ure, research institutions, programs, and projects. On the ot her hand, we want t o provi de a more detailed description of two exempl ary activities, t he electronic classroom and t he ENNCE/ MULTE project. In this cont ext , the definition of relevant, interesting, and important activities is always based t o a cert ai n amount on subjective measures, even if it is i nt ended t o present an obj ect i ve selection of activities and projects. Therefore, we apologize if we have missed i mpor t ant activities. 253 Ac k n o wl e d g e me n t s : I wish t o t ha nk Tom Kr i st ensen for a l ot of hel p in pr epar i ng t he paper . Fur t her mor e, I woul d like t o acknowl edge Pe t t e r Kongshaug and Ol av Kvi t t em f r om UNI NETT for pr ovi di ng det ai l s a bout t he research and t est net work. Re f e r e n c e s 1. Bakke, J. W., Hestnes, B., Martinsen, H. (1994) Distance Educat i on in t he Electronic Classroom. Technical Report Telenor Research and Development, TF R 20/94, Kjeller (Norway) 2. Blindheim, R. (1999) Extending the Object Request Broker COOL with flexible Quality of Service Support", Master Thesis at t he University of Oslo, Depart ment of Informatics, February 1999 3. Bot t ar, E. (1997) Telepresence t hrough Distributed Augment ed Reality. Scien- tific Report R&D 44/97 Telenor 4. Bringsrud, K., and Pedersen, G. (1993) Distributed Electronic Classrooms with Large Electronic White Boards. Proceedings of 4th Joint European Networking Conference (JENC 4), Trondheim (Norway), May 1993, 132-144 5. Bryhni, H., Lovett, H., Maartmann-Moe, E., Solvoll, T., Sorensen, T. (1996) On-demand regional television over the Internet. Proceedings of t he ACM Mul- timedia 96 Conference, Boston, November 1996, 99-108 6. Chaney, T., Fingerhut, A., Flucke, M., Turner, J. (1997) Design of a Gigabit ATM Switch. Proceedings of IEEE Infocom, April 1997 7. Eliassen, F., Mehus, S. (1998) Type Checking Stream Flow Endpoints. Proceed- ings of Middleware'98, Chapman & Hall, Sept. 1998 8. Gallup (1998) Intertrack December 1998 (in Norwegian). available at: ht t p: / / www. gal l up. no/ menu/ i nt ernet t / defaul t . ht m 9. Goebel, V., Plagemann, T., Berre, A.-J., Nyg~rd, M. (1996) OMODIS - Object- Oriented Modeling and Database Support for Distributed Systems. Proceedings of Norsk Informatikk Konferanse (NIK' 96), Alta (Norway), November 1996, 7-18 10. Johansen, D., Schneider, F. B., van Renesse, R. (1998) What TACOMA Taught Us. To appear, Mobility, Mobile Agents and Process Migration - An edited Col- lection, Milojicic D., Douglis, F., Wheeler, R. (Eds.), Addison Wesley Publishing Company 11. Kristensen, T. (1999) Extending t he Object Request Broker COOL with flexible Quality of Service Support (in Norwegian), Master Thesis at t he University of Oslo, Depart ment of Informatics, in progress 12. Macdonald, R. (1998) End-to-end Quality of Service Architecture for t he TDF. ENNCE/ WP2 Technical report TR02/ 98F 13. Omang, K. (1998) Performance of a Cluster of PCI Based UltraSpaxc Work- stations Interconnected with SCI. Proceedings of Network-Based Parallel Com- puting, Communication, Architecture, and Applications, CANPC' 98, Las Vegas, Nevada, Jan/ Feb 1998, Lecture Notes in Comput er Science, No.1362, 232-246. 14. Plagemann, T. (1994) A Framework for Dynamic Protocol Configuration", PhD Thesis, Swiss Federal Institute of Technology Zurich (Diss. ETH No. 10830), Zurich, Switzerland, September 1994 15. Plagemann, T., Goebel V. (1999) Analysis of Quality-of-Service in a Wide-Area Interactive Distance Learning System. To appear in Telecommunication Systems Journal, Balzer Science Publishers 254 16. Schmidt, D. C., Gokhale, A. S., Harrison, T. H., Parulkar, G. (1997) A High- Performance End System Architecture for Real-Time CORBA, I EEE Commu- nications Magazine, Vol. 35, No. 2, February 1997, 72-77 17. UNI NETT (1993) Research Network and Internet2 (in Norwegian), UNI NyTT hr. 1 1993, electronically available at: ht t p: / / www. uni net t . no/ UNI NyTT- 1- 93.html 18. UNI NETT (1998) Research Network and Internet2 (in Norwegian), UNI NyTT hr. 1/2 1998, electronically available at: ht t p: / / www. uni net t . no/ UNI NyTT/ 1- 98 19. UNI NETT (1998) Research Network Estabished! (in Norwegian), UNI NyTT hr. 3 1998, electronically available at: ht t p: / / www. uni net t . no/ UNI NyTT/ 3- 98 20. Vinkoski, S. (1997) CORBA: Integrating Divers Applications Wi t hi n Distrib- ut ed Heterogeneous Environments, I EEE Communications Magazine, Vol. 35, No. 2, February 1997, 46-55 A Appe ndi x 255 Table 2. Institutions University of Bergen, Department of Information Science University of Bergen, Department of Informatics University of Oslo, Department of Informatics UniK - Center for Technology at Kjeller University of Tromsr Department of Computer Science NTNU, Department of Telematics NTNU, Department of Computer Science and Information Science Norwegian Defence Research Es- tablishment Norwegian Computing Center SINTEF Telecom and Informatics NORUT IT Telenor Research and Development Ericsson Alcatel Thomson CSF Norcom UNINETT http://www.ifi.uib.no/index.html http://www.ii.uib.no/index_ e.shtml http://www.ifi.uio.no/ http://www.unik.no http://www.cs.uit.no/EN/ http://www.item.nt nu.no/index- e.html http://www.idi.ntnu.no/ http:www.ffi.no http://www.nr.no/ekstern/engelsk/ www.generelt.html http://www.informatics.sintef.no/ http://www.norut.no/itek/ htt p://www.fou.telenor.no/english/ http://www.ericsson.no http://www.alcatel.no/telecom/ http://www.thomson-csf.no/ http://www, uninett, no/index.en, ht ml 256 Table 3. Programs, projects, and activities DITS GT Supercomputing program Telecom 2005 - Mobile com- munication http://www.ifi.uio.no/ dits/translate/ in- dex.html http://www.sol.no/forskningsradet/program/ profil/gtele/ http://www.sol.no/forskningsradet/program/ tungregn/index.htm http://www.item.ntnu.no/~tc2005/ ADAPT-FT CAGIS ClustRa DOVRE ENNCE ENNCE/MULTE GDD GOODS INSTANCE IMiS LAVA MacroScope Mice-nsc MMCL Multimedia Databases Network Management Plug and Play (PAP) project I SCI activities at the Univer- sity of Oslo Swipp (Switched Intercon- nection of Parallel Proces- sors) TACOMA Vortex Wirac http://www.ifi.uio.no/,-~adapt / http://www.idi.ntnu.no/~cagis/ http://www.clustra.com/ http://televr.fou.telenor.no/html/dovre.html http://www.unik.no/~paal/ennce.html http://www.unik.no/~plageman/multe.html http://www.gdd.cs.uit.no/gdd/ http://www.ifi.uio.no/~goods/ http://www.unik.no/,,-plageman/inst ance.html http: //www.informatics.sintef.no/www/prosj / imis/imis.htm http://www.nr.no/lava/ http://www.cs.uit.no/forskning/DOS/ MacroScope http://sauce.uio.no/mice-nsc/ http://www.ifi.uio.no/~mmcl/ http://www.idi.ntnu.no/grupper/DB- grp/projects/multimedia.html http://www.item.ntnu.no/--,netman/ http://www.item.ntnu.no/,-,plugandplay/ http://pi.nta.no/indexe.html http://www.ifi.uio.no/~sci http://www.ifi.uio.no/~swipp/ http://www.tacoma.cs.uit.no/ http://www.vortex.cs.uit.no/vortex.ht ml http://www.tele.ntnu.no/wirac/ Low s p e e d ATM over ADS L a nd t h e Ne e d for Hi g h S p e e d Ne t wo r k s A case s t udy i n G6 t t i ng e n Ge r ha r d J. A. Schnei der Gesellschaft ffir wissenschaftliche Dat enverarbei t ung GSttingen, Am Fassberg, D-37077 GSttingen, Germany Ab s t r a c t . The use of modern technology from non-st andard sources allows I T centres to find t empor ar y solutions for operational needs, for exampl e to provide access quickly to a networking i nfrast ruct ure for t he local scientific communi t y. This paper describes the experiences of GWDG with ADSL equi pment . Al t hough originally a consumer technology it can also be used in a scientific envi ronment . Despite of having some clear limitations, ADSL technology can be used quite well if other means are not readily available. In addition various networking issues t hat arise in a scientific envi ronment are discussed, using the situation in GSttingen as an example. 1 I nt r oduc t i on GWDG, t he Gesellschaft fiir wissenscha#liche Datenverarbeitung GSttingen, is t he j oi nt I T cent r e of t he Uni ver si t y of G6t t i ngen and t he Ma x Pl a nc k Society. Fi ve ma j o r r esear ch i ns t i t ut es of t he Soci et y ar e s i t ua t e d in t he G6t t i ngen ar ea. Whi l e four of t he m ar e wi t hi n t he ci t y boundar i es , t he fifth is s ome 30 kms away. I n or der t o pr ovi de a de qua t e access t o t he net wor k i nf r as t r uct ur e al so for t hi s i nst i t ut e, a da r k fi bre l i nk has been i nst al l ed in cooper at i on wi t h t he local wat er s uppl y company. I t t ur ne d out t h a t t hi s sol ut i on was cheaper t ha n a 3 year l ease of a hi gh speed l i nk f r om a ny of t he t el ecommuni cat i on carri ers. Apa r t f r om doi ng r esear ch in appl i ed c omput e r sci ence, t he ma j o r t as ks of GWDG ar e t o pr ovi de s t r at egi c servi ces for i t s cus t omer s , as well as t he oper at i on of l ocal mi dr ange par al l el c omput e r s and of t he hi gh s peed d a t a backbone G6 NET in t he G6t t i nge n ar ea. 2 The WAN i nf ras t ruct ure The I nt er net connect i vi t y for t he Ge r ma n science c o mmu n i t y is pr ovi de d by t he Deutsches Forschungsnetz DFN ([1]). I t ope r a t e s B- Wi N, a nat i on- wi de ATM ba c kbone (whi ch is physi cal l y pa r t of t he net wor k of Deutsche Telekom) wi t h access poi nt s of 34 Mb i t / s and 155 Mbi t s / s . Thi s ne t wor k will mi gr a t e t o Gi gabi t speed in t he year 2000 and it will t hen offer access s peeds of up 258 to 622 Mbi t / s and later even more. The main B-Wi N nodes are current l y housed on t he premises of the Deutsche Telekom. Cust omers have ei t her own access points (pricing is i ndependent of the location and varies from EUR 400000 p.a. for 34 Mbi t / s t o EUR 600000 p.a. for 155 Mbi t / s) or connect t o t he nearest access point via leased or private lines. While t he second opt i on allows t he sharing of an access point, it may add t he overhead of t he cost for leased lines. Although prices for leased lines have st ar t ed to fall since t he liberalisation of t he t el ecommuni cat i on market in earl y 1998, t hey still provide a problem for remot e sites. The ATM backbone is primarily used t o t r anspor t IP traffic bet ween mem- ber sites. Thus PVCs exist between the various rout ers in t he mai n nodes. For detailed i nformat i on about the current i nfrast ruct ure of t he net work, in- cluding an up-t o-dat e map, see [2]. The map in Fig. 1 reflects t he si t uat i on in April 1999. Fig. 1. High-speed network B-WiN of the Deutsches Forschungsnetz 259 In addi t i on t o pl ai n I P traffic, it is also possi bl e t o or der PVCs and qui t e r ecent l y SVCs bet ween i ndi vi dual sites. Pr i ces for such connect i ons ar e ver y moder at e and t her ef or e are not prohi bi t i ve. The r easons for t hese qual i t y of servi ce connect i ons can be speci al demands f r om r esear ch gr oups - like pr i or i t y access t o s uper comput er s - or vi deo conferences. I t shoul d be added t ha t t he maj or i t y of such i ndi vi dual PVCs also car r y onl y I P t raffi c, but in a guar ant eed envi r onment . Thus ma ny aspect s of t he cur r ent di scussi on on qual i t y of servi ce connect i ons over t he i nt er net have successful l y been sol ved wi t hi n t he Ger man science net wor k by usi ng a ppr opr i a t e t r a ns por t t echnol ogi es. Whi l e ATM to the desktop may still be di scussed and per haps never ar r i ve, ATM is cer t ai nl y well sui t ed as a backbone t echnol ogy, especi al l y wi t h r espect t o qual i t y of service. The B- Wi N also offers connect i vi t y t o t he US net wor ks, cur r ent l y at 155 Mbi t / s , wi t h anot her upgr ade t o 310 Mbi t / s due in J ul y 1999, vi a i t s Han- nover node. The US connect i vi t y hi ghl i ght s vari ous pol i t i cal pr obl ems ma ny Eur o- pean net wor k provi ders t o t he scientific communi t i es ar e facing: al t hough t her e is a si gni fi cant flow of da t a f r om Eur ope t o t he US, commer ci al US pr ovi der s r ej ect t he i dea of cof undi ng t r ans at l ant i c lines. In addi t i on t he di- verse pr ovi der i nf r ast r uct ur e in t he Uni t ed St at es basi cal l y forces Eur ope a ns t o bui l d t hei r own di st r i but i on net wor k i nf r as t r uct ur e in t he US t o allow for adequat e connect i vi t y wi t h di fferent l eadi ng I P subnet wor ks, in or der t o achi eve decent t hr oughput r at es t o US uni versi t i es and ot her r esear ch par t - ners. So in essence, while US sites are benef i t t i ng f r om Eur ope a n da t a sources, t he Eur opean net wor ks have t o pay for t hi s. DFN' s B- Wi N is also par t of t he Eur ope a n ATM net wor k TEN- 155 whi ch provi des i nt er connect i vi t y bet ween t he di fferent Eur ope a n sci ence net wor ks. Rat her t ha n rel yi ng on an obscur e peer i ng vi a CIXes, t he ATM net wor k allows for bi l at er al agr eement s bet ween t he vari ous i nst i t ut i ons. I t seems t ha t t hi s model is super i or t o t he US model wi t h its shor t comi ngs descr i bed above, at l east f r om t he poi nt of view of i nt er nat i onal access. Exchange wi t h t he commer ci al I nt er net in Ge r ma ny is ens ur ed vi a a 34 Mbi t / s link 1 t o t he DE- CI X in Fr ankf ur t . Al t hough l oad on t hi s link is heavy, ma ny commer ci al provi ders cur r ent l y seem unwi l l i ng or unabl e t o al l ow an i ncrease of t he link speed, as t hei r net wor ks ma y be unabl e t o cope wi t h t he flow of da t a r equest ed by t hei r users f r om t he servers in t he sci ence communi t y. As a resul t , some commer ci al I nt er net pr ovi der s or der ed di r ect links t o t he Ger man Science Net wor k DFN. 3 T h e L a n d e s wi s s e n s c h a f t s n e t z No r d In t he past cent uri es it was cus t omar y t o f ound uni versi t i es off t he mai n pol i t i cal cent res, t o keep t he i nfl uence of r i ot ous st udent s away f r om pol i t i cs. 1 this link will be upgraded to 68 Mbi t / s in May 1999 260 In fact , GSt t i ngen is a per f ect exampl e, since it was l ocat ed at t he f ar s out her n end of t he Ki ngdom of Hannover in 1737. Similarly, in t he mi d sevent i es in t hi s cent ur y many newl y f ounded uni versi t i es and pol yt echni cs were pl aced in r emot e areas, mai nl y t o pr ovi de some local i nf r as t r uct ur e in ot her wi se poor areas. Smal l er sites which do not have t he bandwi dt h r equi r ement s or t he fi nan- cial power t o j ust i f y a B- Wi N node now face t he addi t i onal cost for l eased lines t o access t he near est B- Wi N si t e and t he i nt er net . Thus t he t wo Ge r ma n st at es of Lower Saxony (Ni edersachsen) and Br emen j oi ned forces t o i mpr ove t he si t uat i on for such r emot e sites by pr ovi di ng a st at ewi de i nf r as t r uct ur e for t el et eachi ng and vi deo conferences as well as t el emedi ci ne. I t still seems uncl ear whet her t el esemi nars will be t he choi ce of t he f ut ur e for uni ver s i t y educat i on. In any case it is necessar y t o t r ai n st udent s t o use t hese t ool s, whi ch no doubt will become par t of t hei r worki ng life l at er on. In addi t i on t he expor t of knowl edge f r om uni versi t i es t o compani es ma y become an addi - t i onal challenge. Tel et eachi ng met hods coul d be used t o enabl e di r ect t r ans f er of i deas and moder n devel opment s t o i ndus t r y in or der t o pr ovi de a compet - i t i ve advant age. Si mi l ar ar gument s hol d for t he medi cal sect or . The dense popul at i on in Ger many may not r equi r e t he needs for t el eoper at i ons, but de- vel opi ng and pr ovi di ng t he necessar y t ool s and met hods ma y event ual l y be i mpor t ant for t he expor t or i ent ed medi cal i ndust r y. The new s t at e net wor k LWN (LandesWissenschaftsnetz Nord) became oper at i onal in Mar ch 1999. I t consists of a 155 Mbi t / s ATM ri ng (see fig. 2) connect i ng t he maj or i nst i t ut i ons as well as access lines for smal l er r emot e sites oper at i ng wi t h at l east 2 Mbi t / s . Thus access t o I nt er net t echnol ogy at r at es r equi r ed by moder n devel opment s is now guar ant eed for all s t at e i nst i t ut i ons in hi gher educat i on. Since mor e and mor e l ocal school s connect t o t he near est uni ver si t y or pol yt echni c (at t hei r own cost ) t he avai l abi l i t y of appr opr i at e connect i vi t y also has a posi t i ve effect on s econdar y educat i on. The LWN is fully compat i bl e wi t h DFN' s Sci ence Net wor k B- Wi N and i nt er connect s vi a t he t hr ee sites in Br emen, Hannover and GSt t i ngen. Each i nt er connect oper at es at 155 Mbi t / s . The pri ce s t r uct ur e of t he DFN r esul t ed in par t of t he f undi ng of t he St at e net wor k comi ng f r om mer gi ng vari ous 34 Mbi t / s access poi nt s, by benef i t t i ng f r om t he economy of scale. Tha nks t o t he compat i bi l i t y, PVCs vi a LWN and B- Wi N do not pr esent any pr obl ems. In par t i cul ar t hi s ensures t ha t par t i ci pant s of t he LWN ar e not cut off f r om f or t hcomi ng devel opment s but on t he cont r ar y can par t i ci pat e much be t t e r t ha n before. The ri ng i nf r ast r uct ur e allows bet t er use of t he avai l abl e r esour ces as t her e are always t wo pat hs bet ween any t wo i nst i t ut i ons. In addi t i on it pr ovi des an obvi ous faul t t ol er ance in accessi ng t he t hr ee i nt er connect ed sites. In par t i cu- lar, in GSt t i ngen t he lines for LWN and B- Wi N physi cal l y ar r i ve at di fferent l ocat i ons on campus. 261 Fig. 2. Landeswissenschaftsnetz Nord 4 Di al - i n The classical situation with respect to providing dial-in support for users required universities t o purchase appropri at e equi pment and t o lease access lines from Deutsche Telekom. The ongoing liberalisation forces carriers t o generat e traffic to compensat e for declining revenue. As a result Deutsche Telekom is now placing rout ers on university premises and is providing t he necessary lines at no cost for the scientific institutions. Thus t he dial-in ca- pacity has been boost ed significantly in the past months. As t he Telekom in- frast ruct ure is very modern this means t hat rat her t han large modem banks, a few S2m ISDN t runk lines provide the requi red capaci t y bot h for ISDN as well as for analogue connections, including V.90. Since connect i on charges are identical for ISDN and analog users, more and more users switch t o ISDN because of its faster and more reliable performance. Connect i ng t o t he univer- sity for one hour at 64 kbi t / s current l y costs less t han EUR 1 duri ng off-peak. Since t he basic ISDN So subscriber access offers two (virtual) 64 kbi t / s lines, connections at 128 kbi t / s are also possible, but cost twice as much. Since connections are charged based on t i me intervals (which may be up t o 4 min- utes long), demand-dri ven aut omat i c dialing of t he second connect i on of t he So t r unk may not be wise in all cases. Al t hough 64 kbi t / s don' t seem t o be much in these days of high speed networking, especially with respect t o t he 155 Mbi t / s WAN technology, t he 262 number of connections may present a challenge. GWDG is currently operat- ing some 10 S2m dial-in trunks which may result in a demand of up to 20 Mbit/s on WAN performance, in addition to the LAN traffic. Fortunately dial-in demand is typically at its peak in the evening and night hours and compensates nicely daytime LAN demand. The dial-in characteristics are best shown by the following diagram - see fig. 3 - which seems typical for a German science institution, but also reflects the pricing structure of the carrier (currently rates for local calls are cheaper after 18:00 and cheapest after 21:00). It also shows that scientists and students tend to work late. Fig. 3. usage of dial-in lines over 24 hours 5 LAN While access to WAN technologies can now be aquired at relatively short notice due to the abundance of fibre optic cables with the long distance carriers, bringing the local LAN onto modern technologies is a time and money consuming exercise. Shortage of funds in the public sector mean t hat many construction plans have to be postponed. GSttingen is a nice in-town University, with many departments housed in old and picturesque buildings, including C. F. Gauss' original observatory. Although this does boost the academic atmosphere and makes the Univer- sity very attractive, it turns into a nightmare for networkers. Connecting a building to the backbone not only means installing the cabling locally while conforming to the requirements of conservation laws but also digging across roads and public premises. Only recently has the legal framework been lib- eralised in this respect. As a result the backbone infrastructure in GSttingen is up-to-date with a 622 Mbit/s ATM backbone as well as an FDDI ring connecting the various central points of the university. Most science faculties now have either 100 Mbit/s or 10 Mbit/s access to the backbone. 263 Ar t s and Soci al Sciences ar e t ypi cal l y not pl aced hi gh on t he list of pr i - ori t i es si nce t he need for net wor ki ng was not obvi ous for t hes e di sci pl i nes when pr i or i t i es were set 10 year s ago. Thus GWDG is now f aced wi t h r i si ng de ma nds f r om a new gr oup of users but wi t h l i t t l e or no e x t r a f unds t o me e t t hi s specific demand. However t he Uni ver si t y owns a l ar ge and ext ensi ve c oppe r ne t wor k whi ch was i nst al l ed t oget her wi t h t he PABX in t he l at e 1960s. Bas i cal l y each of- rice is on t hi s t el ephone net wor k and t her e ar e pl ent y of s par e wi r es i nt o each bui l di ng. Whi l e t he PABX i t sel f is now mor e of hi st or i c i nt er es t a nd up for r epl acement , t he copper net wor k still seems t o be in excel l ent condi t i on. Al t hough cl assi cal mo d e m connect i ons pr ovi de a fi rst way t o access t he back- bone over t hese wires, speed is not a de qua t e even for mode s t r e qui r e me nt s f r om science. 6 ADSL In ear l y 1998, GWDG t e a me d up wi t h Ericsson t o i nves t i gat e t he possi bi l i t y of pr ovi di ng hi gher speeds over t hi s copper net wor k. The t hen newl y r el eased Ericsson ANxDSL equi pment was t o be in t he cent r e of t he i nves t i gat i on. Anal ysi s of t he equi pment showed at a ver y ear l y s t age t h a t it offered s ome i nt er est i ng advant ages over ot her sol ut i ons in t he ADSL s ect or whi ch ma d e it par t i cul ar l y i nt er est i ng for depl oyment in a LAN envi r onment . Th e mos t appeal i ng f eat ur e is t ha t Ericsson's ANxDSL is del i ver i ng na t i ve ATM t o t he c us t ome r pr emi ses. The net wor k t e r mi na t i ng e qui pme nt offers t wo nat i ve ATM pl ugs ( ATM 25 Mbi t / s as a ma t t e r of f act ) as well as an Et h e r n e t por t t o car r y I P LAN t raffi c over ATM. Thi s por t is br i dge- t unnel ed accor di ng t o RFC 1483. Ther ef or e it is ver y eas y t o t r a n s p o r t at l east t wo di fferent LANs, e.g. a VLAN for a dmi ni s t r a t i ve pur pos es as well as t he s t a n d a r d LAN i nf r as t r uct ur e for science. Thus t he t ypi cal pa r a noi a of t he a dmi ni s t r a t i ve sect or wi t h r es pect t o I P t raffi c can be over come at no e x t r a cost . Ot he r i nst i t ut i ons in t he St at e were forced t o i nst al l a s e pa r a t e a dmi ni s t r a t i ve LAN, and si nce f unds ar e avai l abl e onl y once t hi s me a nt t h a t essent i al l y sci ent i fi c needs had t o be sacri fi ced t o a c c o mmo d a t e a dmi ni s t r a t i ve de ma nds . The mai n ATM hub for t he ADSL equi pment was pl aced ne xt t o a ma i n ATM swi t ch on t he c a mpus net wor k. Thus a seaml ess i nt egr at i on be c a me possi bl e. Since copper lines are r eadi l y avai l abl e it be c a me possi bl e t o del i ver a connect i on t o t he GSNET t o ma n y si t es al mos t i mmedi at el y. Th e t i me for wai t i ng for a LAN connect i on was r educed f r om sever al year s t o sever al days. Al t hough ADSL is pr i mar i l y a cons umer t echnol ogy, offeri ng hi gh ba nd- wi dt h t o t he cus t omer si t e and c ompa r a t i ve l y l i t t l e ba ndwi dt h in t he oppos i t e di r ect i on, it t ur ns out t ha t it offers a ful l y f unct i onal sol ut i on al so in an hi gh per f or mance LAN envi r onment . I t enabl es users t o exper i ence t he possi bi l i - 264 ties of high speed i nt ernet so t hat skills will have been devel oped when t he proper connection will be installed in the future. The actual ADSL system consists of two parts: 9 At t he cust omer site a network t ermi nat i ng device ANxDSL- NT (fig. 4)is installed. This device has 3 ports: two port s offering nat i ve 25.6 Mbi t / s ATM access and a twisted pair port for direct Et her net access. Fig. 4. network termination at customer site A filter in t he NT allows the splitting of plain old t el ephony service ( POTS) and ADSL on the cust omer premises. A similar spl i t t er for t he Eu- ropean ISDN i nfrast ruct ure will soon be available. Thus POTS and ISDN are not affected by a power failure, while t he NT requires electrical power t o operate. 9 At the central site a line t ermi nat i ng device ANxDSL- LT (fig. 5) is installed. This consists of a shelf holding up to 15 cards (two port s each, at t he time of writing 4-port-cards were about to be released), connect ed via the backplane t o a 155 Mbi t / s STM-1 interface. Fig. 5. AnXDSL equipment overview Ericsson also provides a concent rat or allowing t he connect i on of 16 such shelves ont o one 155 Mbi t / s STM-1 interface. In principle, up t o 480 ADSL 265 lines can t hus be connect ed t o t he WAN, dependi ng on t raffi c and per f or - mance r equi r ement s. The ori gi nal set up at GWDG consi st ed of one ANxDSL- LT l ocat ed close t o t he cent r al PBAX of t he uni ver si t y and 30 ANxDSL- NT devi ces. Now al most 50 lines are in oper at i on. The t el ephone f unct i onal i t y was not t est ed. I t is cl ear, however, t ha t cur r ent ADSL t echnol ogy c a nnot be seen as a r epl acement for t r adi t i onal LAN t echnol ogy, especi al l y wi t h r es pect t o mul - t i medi a appl i cat i ons and hi gh end cent r al services, like a cent r al i zed backup for l arge dat a. In par t i cul ar t he core syst em is not desi gned t o handl e massi ve LAN t raffi c but r at her t o s uppor t t he occasi onal di al -i n f r om ma ny users at di fferent t i mes wi t h a r esul t i ng moder at e de ma nd on net worki ng. The coupl i ng of LANs vi a ADSL however t ends t o cr eat e a cont i nuous demand for bandwi dt h, especi al l y when s t udent dor ms are on t he net . In our t ri al , t he 30 lines gener at ed a t heor et i cal bandwi dt h of 240 Mbi t / s and a pr act i cal peak de ma nd of al most 100 Mbi t / s . 6. 1 E x p e r i e n c e s Pr ovi ded t he available copper wire is of r easonabl e qual i t y, t he ADSL t ech- nol ogy works amazi ngl y well and is ver y easy t o set up. Under r eas onabl y good condi t i ons, t he bandwi dt h achi eved wi t h t hi s t echnol ogy is i ndeed up t o t he promi ses made in t he manual s. Yet ADSL put s a seri ous de ma nd on a net wor k and it seems t hat car r i er s who are t hi nki ng of offeri ng t he t echnol ogy t o consumer s may under es t i mat e t he necessar y upgr ades in t he backbone. Most of t he per f or mance pr obl ems obser ved dur i ng t he t r i al s or i gi nat ed f r om pr evi ousl y unnot i ced defect s in t he copper wires. Apar t f r om pr i vat e copper lines, GWDG also exper i ment ed wi t h a 64 kbi t / s leased line f r om Deutsche Telekom (in f act a cheap copper wire) t o r un ADSL, whi ch at t he begi nni ng was not (officially) known t o Deutsche Telekom. The resul t s were j ust as promi si ng. As a consequence Deutsche Telekom is now moni t or i ng t he progress in GSt t i ngen and is a bout t o sign a cont r act wi t h GWDG. Thi s cont r act will enabl e GWDG t o r ent addi t i onal copper lines at a ver y moder at e cost (well bel ow t he t r adi t i onal cost of l eased lines) t o connect r emot e sites as well as t he homes of some Uni ver si t y st af f member s t o GSNET. Al t hough Deutsche Telekom is a bout t o l aunch t hei r own ADSL pi l ot s in vari ous Ge r ma n cities, t hi s cont r act will al l ow for a special i nf r as t r uct ur e in GSt t i ngen, offering mor e f unct i onal i t y and opt i ons because of t he scientific i nt er est behi nd t he set up. To hi ghl i ght t he exper i ences t he following t abl e 1 gives an over vi ew of some of t he speeds obt ai ned over t he net work. Lengt h of cabl e is cer t ai nl y one l i mi t i ng fact or, but t her e ar e obvi ousl y ot her s whi ch we coul d not di scover due t o t he lack of measur i ng equi pment . 266 Tabl e 1. Line lengths and link speeds of ADSL: Location line l ength downl i nk speed upl i nk speed (meter) (kbi t / s) (kbi t / s) Kolosseum 1700 7968 640 Studentendorf 1900 7968 640 Neuer botanischer Gart en 2750 7712 832 Gerichtsmedizin 5400 2304 640 Studienzentrum Subtropen 2000 7968 800 Studienzentrum U. of California 3600 4320 608 Medizinische Physik 2000 7456 640 Botanischer Gart en 2900 6688 608 ZENS 2000 7264 736 Volkskunde 3500 6592 544 VSlkerkundemuseum 3200 6560 224 Sprachlehrzentrum 2900 5824 704 Anthropologie 5100 3872 640 Ibero-Amerika-Institut 2600 6720 672 ]Umweltgeschichte 2700 7968 160 Heizkraftwerk 800 4352 448 Akademie der Wissenschaften 3650 5536 544 Restaurierungsstelle 3500 4736 832 7 City of GSttingen interconnected In anot her a t t e mpt t o speed up t he bui l di ng of t he GSNET, t al ks s t a r t e d wi t h t he Ci t y of GSt t i ngen wi t h t he ai m of j oi nt l y usi ng t he avai l abl e i nf r ast r uc- t ure. Bot h ATM and secure end- t o- end encr ypt i on pr ovi de t he t echnol ogy t o r un LANs wi t h cont r adi ct i ng secur i t y r equi r ement s over t he same cabl e i nf r ast r uct ur e, t hus pr ovi di ng t he pot ent i al for t he cost -effect i ve shar i ng of resources. The Ci t y of GSt t i ngen also owns a fi bre opt i c net wor k (used t o cont r ol traffic lights and t o connect sites like publ i c swi mmi ng pool s t o t he cent r al admi ni st r at i on in t he Ci t y Hall) as well as a copper net wor k. The copper net wor k is of i nt er est t o offer pe r ma ne nt i nt er net connect i ons t o pr i ma r y and secondar y schools, cur r ent l y at mode m speed. It t ur ned out t ha t at some specific l ocat i on GSNET and t he Ci t y net wor k are j ust 30m apar t . As a resul t , t he Uni versi t y, t he Ci t y and GWDG si gned a cont r act in l at e 1998 and deci ded t o j oi n forces. Aft er t he net works became physi cal l y connect ed in Mar ch 1999, vari ous uni versi t y buildings can now easily be r eached vi a exi st i ng fibres of ci t y net - work as some of t hem are close t o publ i c sites or t raffi c lights. The posi t i ve experi ences at GWDG wi t h ADSL equi pment has t r i gger ed t he deci si on t o 267 connect all local schools in GSttingen via ADSL to a central site in the City Hall and from there via a 2 Mbit/s PVC over ATM directly to the German Science Network. Access to GSNET will be at a higher speed, so t hat local schools may also gain insight into the paradigm changes caused by high speed networking. 8 S u mma r y Modern telecommunication systems allow the rapid deployment of currently adequate bandwidth to a large number of sites. Protocols like ATM as well as encryption permit the operation of different LANs over the same infras- tructure. In addition, issues concerning quality of services like guaranteed or restricted bandwidth can be solved easily with ATM, both locally and on a nationwide basis. The sudden decline in prices for WAN connection leads to an inverse networking pyramid: While backbone and WAN are capable of delivering the bandwidth required by modern communication, the local infrastructure both to and in the buildings does not keep up with this development, due to funding issues. ADSL provides a way to quickly connect sites at reasonable speed and to bridge the time gap until fibre is installed. In fact the Deutsche Forschungsgemeinschaft (DFG) has acknowledged this inverted networking phenomenon and is working on a memorandum to highlight the need for additional resources for local networks. Re f e r e n c e s 1. Verein zur FSrderung eines deutschen Forschungsnetzes http://www.dfn.de 2. B-WiN-Karte, http://www.dfn.de/b-winkarte.html 3. ADSL-Projekt der GWDG, http://www.gwdg.de/adsl D F N , Berlin, Th e NRW Me t a c o mp u t i n g I n i t i a t i v e * Uwe Schwi egel shohn I and Ra mi n Ya hya pour 1 Comput er Engineering Inst i t ut e, University Dor t mund, 44221 Dort mund, Ger many A b s t r a c t . In this paper the Nort hrhi ne-West phal i an met acomput i ng initiative is described. We st ar t by discussing various general aspects of met acomput i ng and explain the reasons for founding t he initiative with t he goal t o build a met acomput er pilot. The initiative consists of several subproj ect s t hat address met acomput i ng applications and the generation of a suitable infrastructure. The l at t er includes the component s user interface, security, di st ri but ed file-systems and t he management of a met acomput er. At last, we specifically discuss the aspect of job scheduling in a met acomput er and present an approach t hat is based on a brokerage and t radi ng concept. 1 The Need for Hi gh Perf ormance Comput i ng Hi gh Per f or mance Comput i ng (HPC) has be c ome an i mp o r t a n t t ool for re- sear ch and devel opment in ma n y di fferent ar eas [2,14]. Ori gi nal l y, s uper com- put er s were mai nl y used t o addr ess pr obl ems in physi cs. Then, t he t e r m Grand Challenges has been i nt r oduced in t he ei ght i es t o descr i be a var i et y of t echni cal pr obl ems whi ch r equi r e t he avai l abi l i t y of si gni fi cant c omput i ng power. Today t he numbe r of fields, whi ch ar e not in need of comput er s , are r api dl y decr easi ng. I n addi t i on t o t he core ar eas of physi cs and engi neeri ng, hi gh pe r f or ma nc e c omput e r equi pment is essent i al for e.g. t he desi gn of new drugs, an accur at e weat her f or ecast [13] or t he cr eat i on of new movi es. Ot her new appl i cat i ons especi al l y in t he field of educat i on ar e cur r ent l y under de- vel opment . But HPC is not j us t a necessi t y for a few compani es or i nst i t ut i ons. For i nst ance, ma n y compani es in var i ous ar eas of engi neer i ng ar e faced t oda y wi t h t he t as k t o cons t ant l y desi gn new compl ex s ys t ems and br i ng t he m i nt o t he pr oduct i on line as soon as possi bl e. Time to market has become an i mpor t a nt p a r a me t e r for hi gh t ech compani es t r yi ng t o grow in t he gl obal mar ket . Ther ef or e, ma n y of t hose compani es use s ys t e m si mul at i on as a key el ement for r api d pr ot ot ypi ng t o r educe devel opment cycles. On t he ot her hand t he compl exi t y of ma n y pr oduc t s in t he fields of t el ecommuni cat i on and i nf or mat i on t echnol ogy makes t he avai l abi l i t y of l ar ge comput i ng r esour ces i ndi spensabl e. Hence, t he access t o hi gh pe r f or ma nc e c omput i ng for a br oa d r ange of users ma y be a key f act or t o f ur t her s t i mul at e i nnovat i on [10]. * Support ed by the NRW Met acomput i ng grant 270 In r ecent year s t he c omput e r i ndus t r y cons t ant l y i ncr eased t he c omput - ing power of t hei r pr oduct s . Whi l e a t op- of - t he- l i ne PC was equi pped wi t h a 33MHz pr ocessor and 4MB DRAM in 1991 [7] a si mi l ar c omput e r in 1998 i ncl uded a 450MHz pr ocessor and 128MB me mo r y [8]. Not e t h a t t hi s does not consi der t he addi t i onal t echni cal advances in t he ar chi t ect ur e of t he pr o- cessor. But t he de ma nd is growi ng at an even f as t er pace. For i nst ance, t he numbe r of comput er s in an aver age c ompa ny has i ncr eased si gni fi cant l y f r om 1991 t o 1998. The s ame devel opment can al so be obser ved for HPC. We cl ai m t ha t no ma t t e r how much t echnol ogy advances t her e will al ways be a non negl i gi bl e numbe r of appl i cat i ons whi ch r equi r e t he f ast est c omput i ng equi pment avai l abl e. Unf or t unat el y, t he ment i oned shor t devel opment cycl es in i nf or mat i on t echnol ogy equi pment resul t in severe dr awbacks for a HPC pr ovi der . By defi ni t i on, HPC uses t echnol ogy at t he l eadi ng edge. Ther ef or e, HPC c ompo- nent s ar e expensi ve. I n addi t i on, t odays HP C equi pment will cer t ai nl y not fall i nt o t hi s cat egor y five year s f r om now unl ess it is f r equent l y upgr a de d dur i ng t hi s t i me s pan [19]. Thi s resul t s in hi gh cost s and a si gni fi cant ma i nt e na nc e effort. In or der t o bal ance t hose cost s a hi gh degr ee of ut i l i zat i on is a mus t for HPC resources. For i nst ance, few commer ci al owners of s upe r c omput e r s can afford t o see t hei r machi nes idle dur i ng t he ni ght or t he weekend as it is c ommon pl ace wi t h mos t PCs. Smal l compani es ma y t her ef or e face a di l emma. Whi l e t he access t o HP C r esour ces is necessar y for t he devel opment of new pr oduc t s t hey do not have enough appl i cat i ons t o efficiently use such equi pment . I t cer t ai nl y does not pay off t o r un s ecr et ar y pr ogr ams like wor d pr ocessi ng on a s upe r c omput e r . Thos e users woul d need a c ompa ny or an i ns t i t ut i on wher e t hey can easi l y get access t o HPC r esour ces when needed. I n t he academi c envi r onment t he c omput e r cent ers as s ume such a role for t he var i ous r esear ch l abs wi t hi n a uni versi t y. As it cannot be expect ed t ha t all pot ent i al users ar e l i vi ng in close pr oxi mi t y f r om each ot her and f r om HPC r esour ces, powerful net wor ks ar e an essent i al c ompone nt of a sui t abl e HPC i nf r as t r uct ur e. Fur t her , t her e ar e si gni fi cant ar chi t ect ur al di fferences bet ween t oda ys hi gh per f or mance comput er s [12], like par al l el vect or c omput e r s (e.g. Fu- j i t su VPP700 [20]), l arge s ymme t r i c mul t i pr oces s or s (e.g. SGI Power Chal - l enge [16]), t i ght l y coupl ed paral l el c omput e r s wi t h di s t r i but ed me mo r y (e.g. I BM RS/ 6000 SP [3]) and l arge cl ust er s of wor ks t at i ons (e.g. Beowul f, [4,17]). Also, machi nes f r om di fferent ma nuf a c t ur e r s t ypi cal l y r equi r e or s uppor t dif- ferent soft ware. On t he ot her hand, some HPC appl i cat i ons need a speci fi c ar chi t ect ur e as 9 t hey ar e not por t a bl e for hi st ori c r easons, 9 t hey ar e opt i mi zed for t hi s machi ne or 9 t hey can make best use of t he avai l abl e ar chi t ect ur al pr oper t i es. I t is t her ef or e unl i kel y t ha t a single s upe r c omput e r will be sufficient for all pot ent i al users in a regi on. But if no ot her HPC r esour ces ar e l ocal l y avai l abl e 271 some users face the choice to run their application on the local equi pment or to ask for an account at anot her location. The first approach results in a decreased efficiency while the second approach is typically quite cumbersome. 2 Me t a c omput i ng I nf r a s t r uc t ur e Such a HPC infrastructure may be based upon a single HPC center which provides all required resources, t hat is all HPC equipment is concent rat ed and maintained at one location. Access to resources is rented to users for their applications. Therefore, HPC users only pay for their actual usage while t hey are not forced to care for support and mai nt enance of the system. Unfort u- nately, this approach also has a few drawbacks: 9 All HPC use needs network bandwidth. Therefore, large investments in a dedicated network st ruct ure are necessary. 9 The center may either be a potential single point of failure or special care must be taken to prevent situations like the disruption of the whole infrastructure by a single power failure. 9 The center is completely decoupled from the applications. This may be a disadvantage for some users like e.g. those designing new applications. In addition, a single HPC center requires central planning and may show little flexibility. Alternatively, the concept of a distributed heterogeneous supercomput er can be used. Such an infrastructure is also called a met acomput er. It consists of geographically distributed HPC installations which are linked by an effi- cient network. The location of a HPC component will depend of the demand of local users. A suitable distribution of HPC resources allows a significant reduction of the network load in comparison to the central approach. Further, HPC resources from different providers may be included into the infrastruc- ture and can compete for customers. This absence of a single institution controlling all HPC resources may be a significant advant age especially for commercial users. In addition, the failure of any single component will not lead to a breakdown of the whole met acomput er. While met acomput i ng offers a variety of promising prospects it is not clear whether this concept is actually feasible. To this end several questions must be addressed: 9 What are the technological requirements for met acomput i ng? 9 Will this concept find acceptance in the user communi t y including po- tentially new users from i ndust ry? 9 Which problems will arise in the management of a met acomput er ? 9 What will be the performance of a met acomput er in compari son to a large installation of a supercomput er? 9 What are the costs for building and maintaining a met acomput er ? 272 2.1 Met acomput i ng Scenari os Before finding a me t hod t o answer t hose quest i ons it is neces s ar y t o pr eci sel y define t he use of a me t a c omput e r . I n gener al t her e ar e t hr ee scenar i os wi t h different degrees of user i nvol vement and wi t h di fferent s ys t e m r equi r ement s . Si ngl e Si te Appl i cat i on I n t hi s scenar i o each j ob is execut ed on a si ngl e HPC component in t he me t a c omput e r . I f any c ompone nt has not enough resources, like e. g. pr ocessor s, t o execut e a j ob compl et el y, i t will al so not r un par t s of t ha t j ob. Of course, a j ob ma y be assi gned in par al l el t o sever al HP C component s for r easons of per f or mance, t ha t is t o i ncr ease t he pr oba bi l i t y t ha t t he j ob will be compl et ed at a gi ven deadl i ne. But in t hi s case all copi es of t he j ob ar e i ndependent f r om each ot her . For si ngl e si t e appl i cat i ons t he ma x i mu m j ob size for t he me t a c o mp u t e r is de t e r mi ne d by t he size of t he l ar gest component . The user need not modi f y any of his appl i cat i ons. I t is onl y neces s ar y t o speci fy t he execut i on r equi r ement s of his j ob like e.g. t he a mount of me mo r y or t he mi ni mal numbe r of pr ocessor s or t he neces s ar y sof t war e. Taki ng t he r equi r ement s and possi bl y addi t i onal r est r i ct i ons i nt o account t he me t a c o m- put er picks i t s bes t sui t ed c ompone nt for t he execut i on of t he j ob (location transparency). Even if all HPC r esour ces in t he me t a c o mp u t e r ar e wor ki ng at full load, t he me t a c o mp u t e r can i ncr ease overal l efficiency by r unni ng j obs on t he HPC c ompone nt bes t sui t ed for t hem. Homogeneous Mul t i Si t e Appl i cat i ons I n addi t i on t o si ngl e si t e appl i - cat i ons, some j obs ma y al so be execut ed in par al l el on di fferent HPC c ompo- nent s of t he s ame t ype, e.g. several I BM RS/ 6000 c omput e r s ar e combi ned t o j oi nt l y r un a l arge j ob. I n a l arge me t a c o mp u t e r t hi s scenar i o si gni f i cant l y expands t he numbe r of HPC r esour ces whi ch ar e pot ent i al l y avai l abl e t o a single j ob by a f or mi ng a virtual supercomputer. As t he cost for mos t t ypes of s uper comput er s grows super l i near l y wi t h t he size, t hi s a ppr oa c h ma y be an i nt er est i ng opt i on for all cases wher e such bi g j obs mus t onl y be execut ed once in a while. However, mul t i si t e appl i cat i ons requi re t he concur r ent avai l abi l i t y of sev- eral HPC component s . Thi s i ncl udes t he net wor k t h a t links t he c omput e com- ponent s. Ther ef or e, ma n a g e me n t of such a s ys t e m becomes mor e difficult. I n addi t i on, t he user will not recei ve t he s ame c ommuni c a t i on pe r f or ma nc e as on a single l arge s uper comput er . Hence, she mus t desi gn her appl i cat i ons ac- cordingly. Fur t her , s ome pr obl ems wi t h l ar ge r a n d o m c ommuni c a t i on pa t t e r s ma y not r un as a mul t i si t e appl i cat i on or t ake a huge pe r f or ma nc e hi t . Nev- ert hel ess, t her e ar e numer ous appl i cat i ons t ha t r equi r e l i mi t c ommuni c a t i on over head and can t her ef or e benefi t f r om a mul t i si t e execut i on. Thi s is espe- cially t r ue when appl i cat i ons ar e devel oped wi t h a me t a c o mp u t i n g s ys t e m in mi nd. 273 He t e r oge ne ous Mul t i Si t e Appl i cat i ons Thi s scenar i o f ur t her expands t he homogeneous mul t i site concept by allowing t ha t pot ent i al l y all HPC re- sources of a me t a c omput e r ar e used for t he execut i on of a single j ob. However, i t is not necessar y t ha t all t hose component s ar e act ual l y r unni ng t he same execut abl e. It is also possi bl e t ha t a j ob is aut omat i cal l y pi ped f r om one set of HPC component s t o t he next . Nevert hel ess, t hi s will r esul t in a subst an- t i al coor di nat i on effort. Fur t her , t he workflow of t he j ob must be careful l y pl anned t aki ng i nt o consi der at i on vari ous r esour ce const r ai nt s like net wor k bandwi dt h or t he size of di fferent machi nes in t he me t a c omput e r . Thi s re- qui res a new pr ogr ammi ng par adi gm and si gni fi cant addi t i onal user effort. On t he ot her hand, t he pr ospect of a gi gant i c vi r t ual s upe r c omput e r may be well wor t h t he work. 2. 2 Re qui r e me nt s for a Me t a c o mput i ng Pi l ot Pr oj e c t The best appr oach t o answer t he pr evi ousl y posed quest i ons is t he est abl i sh- ment of a pi l ot pr oj ect i ncl udi ng some appl i cat i ons. Thi s will hel p t o det er - mi ne t he act ual t echni cal pr obl ems and sui t abl e sol ut i ons for t hem. Ear l y user par t i ci pat i on will pr ovi de a hel pful f eedback for t he syst em designers. Such a close cooper at i on bet ween devel oper s and users of a me t a c omput e r is an essent i al el ement of a pi l ot pr oj ect . Unf or t unat el y, any new HPC i nst al l at i on l eads t o hi gh i ni t i al costs. I t is t her ef or e hi ghl y advi sabl e t o sel ect l ocat i ons for t hi s pi l ot s t udy where most if not all of t he r equi r ed HPC equi pment is al r eady in place. Taki ng all t hese r equi r ement s i nt o consi der at i on t he Ge r ma n s t at e of Nor t hr hi ne West phal i a (NRW) offers an excel l ent basis for t he r eal i zat i on of t he pr oj ect . It host s t he l argest concent r at i on of uni versi t i es and ot her r esear ch i nst i t ut i ons in Ger many. Most of t hese i nst i t ut i ons al r eady own HPC equi pment in t hei r comput i ng cent ers whi ch oper at e i ndependent l y. Thi s equi pment includes al most all common HPC pl at f or ms like e.g. Sun Ent er pr i se 10000, I BM RS/ 6000 SP, Cr ay T3E, SGI Ori gi n and ot her s. The i ncl usi ons of ma ny di fferent pl at f or ms guar ant ees t he desi red degree of flex- ibility. I t also makes t he new me t a c omput e r at t r act i ve for a wide r ange of users as al most ever yone can find her f avor i t e HPC har dwar e in t he syst em. In addi t i on, t he s t at e has a powerful net wor k i nf r as t r uct ur e whi ch links t hese i nst i t ut i ons. It is pr esent l y based on an ATM backbone whi ch allows Qual i t y- of-Servi ce f eat ur es and vi r t ual channel s ( PVC/ SVC) . Thi s all t oget her con- st i t ut es a sui t abl e syst em i nf r as t r uct ur e for met acomput i ng. Not e f ur t her t ha t t he l arge number of r esear ch i nst i t ut i ons f r om ma ny areas also guar ant ees a di ver si t y of r esear ch pr oj ect s r equi r i ng HPC. Finally, several t echnol ogy cent ers wi t h smal l hi gh- t ech compani es are also l ocat ed in NRW r esul t i ng in a l arge pool of pot ent i al users. Ther ef or e, all r equi r ement s for a met acomput i ng pi l ot pr oj ect are met in Nor t hr hi ne West phal i a. On t he ot her hand, t he set up of a me t a c omput e r may pr ovi de si gni fi cant benefi t s for t he economy and r esear ch pr oj ect s in Nor t hr hi ne West phal i a. 274 3 T h e NRW Me t a c o mp u t i n g I n i t i a t i v e Based on t hese t hought s, t he NRW Met acomput i ng I ni t i at i ve was pr opos ed by A. Bachem, B. Moni en and F. Ra mme in 1996 ([1]). It s t ar t ed in J ul y 1996 and is pl anned t o concl ude in J une 1999. The pr oj ect is coor di nat ed by B. Moni en of Uni versi t y Pader bor n. It is j oi nt l y f unded by t he s t at e of Nor t hr hi ne- West phal i a and t he par t i ci pat i ng r esear ch i nst i t ut i ons whi ch ar e named below: 9 Pader bor n Cent er for Paral l el Comput i ng ( PC 2, Uni ver si t y of Pa de r bor n) 9 Uni versi t y of Col ogne 9 Uni versi t y of Dor t mund 9 Techni cal Uni versi t y ( RWTH) Aachen 9 Cent r al I ns t i t ut e for Appl i ed Mat hemat i cs (ZAM), For s chungs zent r um Jfilich 9 GMD Nat i onal Research Cent er for I nf or mat i on Technol ogy, Bonn Besides gener at i ng a worki ng me t a c omput e r t he i ni t i at i ve has t he goal t o find answers t o t he following specific quest i ons: 9 Wha t are t he syst em r equi r ement s for HPC component s in a met acom- pur er ? 9 Does t he me t a c omput e r gener at e a need for a new t ype of HPC compo- nent or for significant modi fi cat i ons of t he exi st i ng ones? 9 Whi ch appl i cat i ons can benefi t most f r om a me t a c omput e r ? The i ni t i at i ve consists of several syst em and appl i cat i on pr oj ect s t ha t work on di fferent aspect s of met acomput i ng, see Fig. 1. 3. 1 Ap p l i c a t i o n P r o j e c t s The inclusion of appl i cat i on pr oj ect s f r om t he begi nni ng had t he goal of suppor t i ng a const ant communi cat i on process bet ween users and s ys t em de- signers. These appl i cat i ons can f ur t her be used for t est and eval uat i on of t he me t a c omput e r pilot. Thi s includes f unct i onal i t y and per f or mance aspect s. The first user pr oj ect s may also give a i ndi cat i on about t he char act er i st i c pr oper t i es of f ut ur e mul t i site appl i cat i on r egar di ng 9 communi cat i on pat t er ns, 9 net wor k r equi r ement s, and 9 soft ware adapt at i ons. The subj ect s of t he appl i cat i on pr oj ect s ar e Mol ecul ar Dynami c Si mul a- t i on, Traffic Si mul at i on, and Weat her Forecast . However, in t hi s descr i pt i on we will pr i mar i l y focus on t he syst em design of t he me t a c omput e r and t her e- fore not go i nt o t he det ai l s of t hose pr oj ect s. 275 Fig. 1. Projects of the Initiative 3. 2 Sys t e m Pr oj e c t s As al ready ment i oned the met acomput er uses existing HPC installations. This includes bot h hardware and syst em software (operat i ng syst ems, local management software). In order to combi ne those resources into a worki ng met acomput er the following probl ems must be addressed: 9 Coordi nat ed management 9 Interfaces 9 Security These probl ems became t he subj ect of several proj ect s in t he initiative. In the next sections all those syst em proj ect s are briefly descri bed while t he proj ect Schedule is discussed in more detail. 4 Dat a Di st ri but i on and Aut hent i cat i on wi t h DCE/ DFS As met acomput i ng in this initiative is done over t he public Int ernet , insecure channels are used for communi cat i on. Also, comput er s of different political admi ni st rat i on domai ns are part of t he met acomput er . Thi s requires aut hen- t i cat i on of remot e users. Finally, hardware and software of t he HPC com- ponent s must be prot ect ed from unaut hori zed access. Hence, t here is need for secure aut hent i cat i on and secure communi cat i on. On t he ot her hand, it is i mpor t ant t o limit t he resulting overhead for users and admi ni st r at or s t o achieve a high degree of accept ance and part i ci pat i on. 276 In t he i ni t i at i ve, it was deci ded t o use t he s t andar di zed Di s t r i but ed Com- mon Envi r onment ( DCE) as an exi st i ng and pr oven sof t war e sol ut i on. DCE allows secure aut hent i cat i on and c ommuni c a t i on as well as cross a ut he nt i c a - t i on bet ween cells, t ha t is s e pa r a t e a dmi ni s t r a t i ve domai ns. Ther ef or e, user login or j ob s t a r t up is possi bl e wi t hout t he need t o s uppl y a pas s wor d for ev- ery machi ne. Fur t her mor e, t he Di s t r i but ed Fi l es ys t em Sys t e m ( DFS) is used t o gener at e a s har ed file s ys t e m t ha t pr ovi des a dedi cat ed home or pr oj e c t di r ect or y on ever y pl at f or m. As DFS uses DCE f eat ur es for e nc r ypt i on and aut hent i cat i on, s ys t e m and user files ar e secured. DCE / DF S has f ur t her t he advant age of bei ng avai l abl e for mos t c ommon pl at f or ms . Thi s s ys t em pr oj ect has t he goal t o set up DCE / DF S cells for var i ous NRW i nst i t ut i ons and t o pr ovi de cross aut hent i cat i on bet ween t he m for me t a - comput i ng users. Fur t her , mechani s ms ar e devel oped t o ent er t he a ut he nt i c a - t i on cells f r om out si de t he DCE / DF S f r amewor k. Thi s al l ows j ob s ubmi s s i on f r om machi nes t ha t ar e not usi ng DCE. The pr oj ect f ur t her i ncl udes per f or - mance me a s ur e me nt s for DFS and t he avai l abl e net wor k i nf r as t r uct ur e. The resul t s show a si gni fi cant s peedup in compar i s on t o NFS. Never t hel ess, d a t a pr ef et chi ng is still benefi ci al for d a t a i nt ensi ve appl i cat i ons. 5 Me t a c o mp u t i n g Us e r I n t e r f a c e Thi s pr oj ect deals wi t h t he devel opment of a user i nt er f ace t o t he me t a - comput er [21]. To achi eve t r a ns pa r e nc y and a hi gh degr ee of usabi l i t y t he i nt er f ace shoul d be uni que and avai l abl e for all pl at f or ms . Thus , t he i nt er f ace is wr i t t en in J a va and is abl e t o r un over t he net on all c ommon J a v a Vi r t ual Machi nes in e.g. web browsers. Ther ef or e, new versi ons of t he i nt er f ace ar e i nst ant l y avai l abl e t o all users who downl oad it f r om t he web on s t a r t u p as a J a va appl et . The i nt er f ace allows t he set t i ng of ma n d a t o r y and vol unt ar y p a r a me t e r s for a j ob. I t provi des s t at us i nf or mat i on a bout j obs and avai l abl e machi nes. To mai nt ai n secur i t y for passwor ds and j obs, t he c ommuni c a t i on is e nc r ypt e d vi a t hi r d- pa r t y sof t war e ( Cr ypt i x) . Si gned appl et s ensur e t h a t onl y t he au- t hor i zed appl et f r om t he ori gi nal si t e is used. The J a va User i nt er f ace connect s t o t he HPCM ma n a g e me n t of t he NRW me t a c omput e r . I t t r a ns mi t s j ob r equest s and a ut he nt i c a t i on i nf or mat i on. I f t he user is not wor ki ng f r om a DFS enabl ed host s, t he a ppl e t can upl oa d appl i cat i on da t a i nt o t he DFS cell. 6 Ma n a g e me n t Ar c h i t e c t u r e HP CM The HPCM pr oj ect pr ovi des t he i nf r as t r uct ur e for t he me t a c o mp u t i n g ma n- agement . I t consi st s of a ma na ge me nt da e mon and several coupl i ng modul es which communi cat e pl at f or m specific i nf or mat i on t o t he HP CM l ayer. Th e ma na ge me nt da e mon execut es on t he HPCM ser ver machi ne and recei ves 5;t*b~t~i I J ob ~ T | :!!i j ii:!:i~:ii::~: ~:~:~iiiiii~iiiiiiiiii~:ii!iiii?iiiiiiiiiiiiiiiiiiiiii~iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~iiiiiiii:,i~iiiii1iiiiiiiN St ~d=r t Error,: i V,tev/r, ull i ~:..222~.32:.::::L..>..::..2. . . . . . . . . . . . . . .... ~ int~r~cti~' ~ l ob i i : )i: :;: ; : : : :;:::::;:i:,: ii::i::iiii i~i~iii!ii:.ii:~ii:ii == Fi g. 2. Screenshot of the Java User Interface for HPCM 277 request s from tile Java user i nt erface. It is t he admi ni st r at i ve i nst ance t hat generat es t he global view of t he met acomput er wi t h its component s and t he par t i ci pat i ng users. Not e t hat similar mul t i -t i er ar chi t ect ur es can be f ound in ot her met acomput i ng proj ect s, as e.g. in Gl obus [6]. The coupl i ng modul e is an i nt erface t o vari ous comput i ng pl at f or ms. Be- sides abst r act i ng t he available i nf or mat i on and access met hods f r om t he man- agement it i nt eract s wi t h t he available local management of t he HPC com- ponent . The cur r ent i mpl ement at i ons of coupl i ng modul es in t he i ni t i at i ve r ange f r om NQE [5], CCS [9] t o LoadLevel er. Addi t i onal modul es as e.g. for LSF [22] can easily be deri ved from t he exi st i ng i mpl ement at i ons. 7 Metacomputing Scheduling Typi cal l y, owners of HPC i nst al l at i ons are onl y willing t o i ncl ude t hei r re- sources i nt o a met acomput er , if t he per f or mance of t hei r component s will not degr ade in t he new envi r onment . Thi s is especi al l y t r ue for commer ci al own- ers. Similarly, users expect a bet t er per f or mance for t hei r j obs. Not e t hat t he expressi on performance has not been defined as different peopl e may at t ach a different meani ng t o it. In most cases however, an owner want s a hi gh syst em 278 load for her machine while a user is interested in a short response time for his job or at least a / a i r resource allocation. Therefore, job scheduling and resource allocation are one of the core prob- lems in the met acomput i ng architecture. As existing system software is used, the met acomput er scheduler must interact with the local schedulers on all HPC components. To avoid the bottleneck of a centralized scheduler and to increase flexibility a distributed approach is employed. Therefore, the paradigms for job schedulers of parallel comput ers and for met acomput er schedulers differ significantly, see Table 1. Tabl e 1. Different Scheduling Paradigms Parallel Computer Scheduling Metacomputer Scheduling The network is ignored. Load information is instantly available. Homogeneous system environment Mostly first-come-first-serve scheduling Central scheduler The network is a resource. Load information must be collected. Heterogeneous system environment Resource reservation for future time frames Distributed scheduler To implement a distributed met acomput i ng scheduler we use an archi- tecture which is based upon so called Met aDomai ns . All Met aDomai ns of a met acomput er form a redundant network. Typically, a Met aDomai n is asso- ciated with local HPC resources. That is all HPC resources at one site are connected to a single MetaDomain. For local HPC access any user can choose to either submit her job directly to the local component or to use the local MetaDomain. Therefore, the met acomput er does not require exclusive access to a HPC resource. The logical st ruct ure of such a scheduler is described in Fig. 3. This network can be dynamically extended or altered. Such a propert y is advan- tageous as, for instance, individual HPC component s may be t emporari l y unavailable due to maintenance or new HPC resources may be i nt roduced into the met acomput er. The presented architecture guarantees a high degree of flexibility. MetaDomains communi cat e among one anot her by t ransmi t t i ng or re- questing information about resources and jobs. To this end a Met aDomai n inquires local schedulers about system load and job status. A Met aDomai n can also allocate local HPC resources to requests. The distributed schedul- ing itself is based upon a brokerage and t radi ng concept which is executed between the MetaDomains. In detail, a Met aDomai n tries to 9 satisfy local demand if possible, 279 Fig. 3. Logical Infrastructure 9 ask ot her Met aDomai ns for resources, if t he local demand cannot be satisfied, 9 offer local HPC resources to ot her Met aDomai ns for sui t abl e r emot e jobs, and 9 act as an i nt ermedi ary for r emot e requests. Not e t hat we did not address t he act ual j ob submission. Thi s process is not necessarily a t ask of t he scheduler. Once a sui t abl e al l ocat i on of HPC resources (including net work resources) to a j ob has been found t he act ual submi ssi on is i ndependent of t he scheduler. Also, t he scheduling obj ect i ves are not specified. As al ready ment i oned t here may not be a single scheduling obj ect i ve in a met acomput er . Each HPC component can define its specific objectives. Similarly, each user may associ at e specific const rai nt s with his j ob like a deadline or a cost limit. It is t he t ask of t he t radi ng syst em to find mat ches bet ween requests and offers. This way not all users and all component s are forced to fit into a single framework as it is usually done in convent i onal scheduling. Now, it is t hei r responsibility to define their own objectives. The i mpl ement at i on of t he met acomput i ng scheduler must only provi de t he framework for such a definition and it must be able t o compar e any request with any offer t o find a mat ch. In our met acomput er scheduling concept only t he local HPC scheduler is responsible for t he load di st ri but i on on t he correspondi ng HPC resource. Therefore, it can also accept j obs from sources ot her t han t he met acomput er . The met acomput er scheduler only addresses t he load i mbal ance bet ween dif- ferent HPC resources. To execut e mul t i site appl i cat i ons however, t he concur- rent availability of different HPC resources and sufficient net work bandwi dt h bet ween t hem becomes necessary as al ready described in Sec. 2.1. For reasons of efficiency this requires resource reservat i on for fut ure t i me frames and t he concept of guarant eed availability. Al t hough most HPC schedulers do not present l y suppor t such an appr oach it can be i mpl ement ed by using pre- 280 empt i on (a checkpoint-restart facility) while still mai nt ai ni ng a hi gh s ys t em load. In t he pr oj ect S CHEDULE [18] of t he i ni t i at i ve a me t a c omput e r sched- ul er was desi gned usi ng CORBA [15] t o allow t r ans par ent and l anguage in- dependent access t o di st r i but ed management i nst ances. For t he eval uat i on of di fferent schedul i ng met hods a si mul at i on f r amewor k has f ur t her been i mpl e- merited. I t is used t o compar e di fferent schedul i ng al gor i t hms r egar di ng t hei r appl i cabi l i t y for a met acomput i ng net wor k. The benefi t of possi bl e t echnol - ogy enhancement s, like for exampl e pr eempt i on, t o t he qual i t y of t he schedul e is also det er mi ned wi t h t he help of t he si mul at or. As al r eady ment i oned com- muni cat i on bet ween resources dur i ng a mul t i site j ob execut i on mus t be t aken i nt o account as well. To this end t he avai l abl e net wor k mus t be consi der ed as a l i mi t ed r esour ce t hat is managed by t he schedul ers in t he Met aDomai ns . The i ncl usi on of this obj ect i ve i nt o t he schedul er is pa r t of t he f ut ur e work. 8 St a t us The NRW- Met acomput i ng i ni t i at i ve has devel oped a f unct i oni ng manage- ment syst em t ha t has been depl oyed in a pi l ot i nst al l at i on t o connect t he par - allel comput er s of t he par t i ci pat i ng i nst i t ut i ons, t ha t are namel y a Cr ay T3 E in Jfilich, I BM RS6000/ SP in Dor t mund, Sun Ent er pr i se 10000 in Col ogne and a Par s yt ec CC in Pader bor r i . In this t est - phase t he appl i cat i on pr oj ect s of t he i ni t i at i ve has been used t o show t he benefi t s of such a met acomput er . Thi s pr oj ect s r epr esent s t ypi - cal exampl es of probl ems t hat ar e well sui t ed for met acomput i ng. The y can easily be por t ed t o di fferent ar chi t ect ur es and pr ovi de a small net wor k com- muni cat i on f oot pr i nt . The devel oped HPCM management sof t war e is goi ng i nt o pr oduct i on use in 1999, pr ovi di ng publ i c access t o all users. The Schedul i ng pr oj ect provi des a worki ng i nt er f ace t o t he ment i oned syst em t ypes. Cur r ent l y onl y single si t e appl i cat i ons ar e s uppor t ed as t he pr esent i mpl ement at i on does not i ncl ude r eser vat i on. However, si mul at i ons have been execut ed t o eval uat e whet her t he backfilling s t r at egy [7] can be used t o consi der reservat i ons. These si mul at i ons have yi el ded pr omi si ng re- sults. As many commer ci al local schedul ers ar e al r eady usi ng backfilling, onl y small changes t o t hese schedulers are requi red. 9 Co n c l u s i o n I t is t he pr i mar y goal of met acomput i ng t o pr ovi de users wi t h easy access t o mor e HPC resources. Thi s includes a pl at f or m i ndependent si mpl e user i nt erface. The user hersel f has t he abi l i t y t o expl oi t t he fl exi bi l i t y of t he syst em t o her advant age by cl earl y speci fyi ng t he r esour ce r equi r ement s of her j ob. In or der t o benefi t from mul t i si t e comput i ng she may need t o appl y new pr ogr ammi ng paradi gms. 281 Th e owne r of HP C c o mp o n e n t s i n a me t a c o mp u t e r mu s t onl y f ocus on t he ma i n t e n a n c e of a si ngl e pl a t f or m. He mu s t not s t r i ve t o s at i s f y all l ocal us er s wi t h a l i mi t ed b u d g e t as s ome speci f i c d e ma n d s can be f o r wa r d t o o t h e r HP C i ns t a l l a t i ons wi t hi n t he net wor k. Al so, t he r e is no need t o ma i n t a i n a s e p a r a t e user i nt er f ace. Wi t h an i n d e p e n d e n t us er i nt er f ace t h e i n t e g r a t i o n of new r es our ces will b e c o me easi er. On t he o t h e r h a n d t he owne r ma y f ace s ome pr e s s ur e t o i ncr eas e s t a n d a r d i z a t i o n . Al t h o u g h t he a p p r o a c h g u a r a n t e e s a hi gh degr ee of f l exi bi l i t y he ma y al so l ose s ome c ont r ol over t he a l l oc a t i on of l ocal r es our ces . Th e ma n u f a c t u r e r s of HP C r es our ces ma y see a de c r e a s e i n sal es of r eal l y bi g machi nes , whi l e i t will b e c o me mo r e c o mmo n t o b u y mi ds i ze s y s t e ms a n d i nt e gr a t e t h e m i nt o an i n f r a s t r u c t u r e of exi s t i ng r es our ces . Th e over al l us er c o mmu n i t y will i ncr eas e as mo r e user s will gai n access t o t hes e r es our ces . I n t o d a y s s ys t e ms ma n a g e a b i l i t y a n d o p e n i nt er f aces t o o t h e r ma n a g e me n t s y s t e ms ar e not a s t r o n g sel l i ng a r g u me n t . Thi s ma y c h a n g e i f mo r e l ar ge s y s t e ms b e c o me p a r t of a he t e r oge ne ous me t a c o mp u t i n g e n v i r o n me n t . Re f e r e n c e s 1. A. Bachem, B. Monien, F. Ramme. Der For schungsver bund NRW- Met aeomput i ng - verteiltes HSchst l ei st ungsrechnen (1996), ht t p : / / www. uni -paderborn. de / pc2 / nr w- mc / ht ml _rep / h t ml ~ e p . h t ml 2. Hi gh Per f or mance Comput i ng and Communi cat i on (1997). NSTC. ht t p: / / www. hpc c . gov/ pubs / bl ue 97 3. I BM RS/ 6000 SP Pr oduct Line. ht t p: / / www. r s 6000. i bm. com/ har dwar e/ l ar ges cal e/ 4. D. Becker, T. Sterling, D. Savarese, J. Dor band, U. Ranawak, C. Packer. Beowulf: A parallel workst at i on for scientific comput at i on (1995), Proceedi ngs, I nt er nat i onal Conference on Parallel Processi ng 5. I nt r oduci ng NQE. (1998), Cr ay Research Publ i cat i on, Silion Graphi cs, Inc. 6. I. Foster, C. Kesselman. Globus: A met acomput i ng i nf r ast r uct ur e t ool ki t (1997), The I nt er nat i onal Jour nal of Super comput er Appl i cat i ons and Hi gh Per f or mance Comput i ng, 11(2), pp 115-128 7. Intel Microprocessors, Volume I I (1991) Handbook. Intel Cor por at i on 8. Pent i um I I Xeon[t m] Processor Technol ogy Brief (1998), Intel Cor por at i on 9. A. Keller, A. Reinefeld. CCS Resource Management in Net worked HPC Sys- t ems (1998), I n Proceedi ngs Het erogenous Comput i ng Wor kshop ( HCW) at I P P S / S P DP ' 9 8 10. S. Kari n, S. Gr aham. The Hi gh Per f or mance Cont i nuum (Nov 1998), Commu- ni cat i ons of t he ACM, pp. 32 - 35 11. D. A. Lifka. The ANL/ I BM SP Scheduling Syst em, Spri nger LNCS 949, Pr o- ceedings of t he Job Scheduling Strategies for Parallel Processi ng Wokshop, I PPS' 95, pp. 295 - 303 12. P. Messina, D. Culler, W. Pfeiffer, W. Mart i n, J. Oden, G. Smi t h. Ar chi t ect ur e (Nov 1998), Communi cat i ons of t he ACM, pp. 36 - 44 13. Rober t C. Malone, Ri char d D. Smi t h, and John K. Dukowicz. Cl i mat e, t he Ocean, and Parallel Comput i ng (1993), Los Al amos Science, No.21 282 14. Grand Challenges, National Challenges, and Multidisciplinary Challenges (1998), NSF Report ht tp://www.cise.nsf.gov/general/workshops/nsf_gc.ht ml 15. Object Management Group Document. The Common Object Request Broker: Architecture and Specification (1998), Revision 2.2 16. SGI PowerChallenge XL Product Line htt p: //www.sgi.com/remanufactured/challenge, SGI 17. T. Sterling. Applications and Scaling of Beowulf-class Clusters (1998), Work- shop on Personal Computers based Networks Of Workstations, IPPS' 98 18. U. Schwiegelshohn, R.Yahyapour. Resource Allocation and Scheduling in Meta- systems, Springer LNCS 1593, Proceedings of the Distributed Computing and Metacomputing Workshop, HPCN'99, Amsterdam, pp. 851 - 860 19. J. Dongarra, H. Meurer, E. Strohmaier. (Nov. 1998), TOP500 Supercomputing Sites, http://www.top500.org 20. Fujitsu VPP700E, htt p://www.fujitsu.co.j p/index-e.html 21. V. Sander, D. Erwin, V. Huber. High-Performance Computer Management Based on Java (1998), Proceedings of High Performance Computing and Net- working Europe (HPCN), Amsterdam, pp. 526 - 534 22. S. Zhou. LSF: load sharing in large-scale heterogeneous distributed systems (1992), In Proceedings Workshop on Cluster Computing De s i gn and Eval uat i on of Pa r a St a t i o n2 Th o ma s M. War schko and J oa c hi m M. Bl um and Wal t er F. Ti chy Inst i t ut ffir Programmst rt t kt uren und Datenorganisation, Fakult/it f/h" Informat i k, Am Fasanengarten 5, Uuiversit/it Karlsruhe, D-76128 Karlsru_he, Germany S u mma r y . ParaSt at i on is a communications fabric to connect off-the-shelf work- stations into a supercomputer. This paper presents ParaStation2, an adapt i on of the ParaSt at i on syst em (which was build on top of our own hardware) to the Myrinet hardware. The main focus lies on the design and i mpl ement at i on of ParaSt at i on2' s flow control protocol to ensure reliable dat a transmission at network interface level, which is different to most other proj ect s using Myrinet. One-way latency is 14.5ps to 18ps (depending on the hardware pl at form) and t hroughput is 50 MByt e/ s to 65 MByt e/ s, which compares well to ot her approaches. At application level, we were able to achieve a performance of 5.3 GFLOP running a mat ri x multiplication on 8 DEC Alpha machines (21164A, 500 MHz). 1. I n t r o d u c t i o n Par aSt at i on2 is a c ommuni c a t i on s ubs ys t e m on t op of Myr i c om' s Myr i net har dwar e [BCF+95] t o connect of f - t he- shel f wor ks t at i ons and PCs i nt o a par - allel s uper comput er . The appr oach is t o combi ne t he benefi t s of a hi gh- s peed MPP net wor k wi t h t he excel l ent pr i c e / pe r f or ma nc e r at i o and t he s t a nda r d- ized p r o g r a mmi n g i nt erfaces (e.g. Uni x socket s, PVM, MPI ) of convent i onal wor kst at i ons. Wel l - known p r o g r a mmi n g i nt erfaces ensur e por t a bi l i t y over a wi de r ange of di fferent syst ems. The i nt egr at i on of a hi gh- speed MPP net wor k opens up t he oppor t uni t y t o el i mi nat e mos t of t he c ommuni c a t i on over head. Pa r a St a t i on was or i gi nal l y devel oped for t he Pa r a St a t i on har dwar e, a sel f - r out i ng net wor k wi t h a ut onomous di s t r i but ed swi t chi ng, har dwar e flow- cont rol at link-level combi ned wi t h a back- pr essur e mechani s m, and a r el i abl e and deadl ock- f r ee t r ans mi s s i on of var i abl e sized packet s (up t o 512 byt e) . Thi s base s ys t e m is now bei ng adopt ed t o t he Myr i net har dwar e, whi ch has a ful l y p r o g r a mma b l e net wor k i nt er f ace and a much be t t e r base pe r f or ma nc e t ha n t he classic Pa r a St a t i on har dwar e (see sect i on 2.). The ma j o r di fference is t he absence of rel i abl e da t a t r ans mi s s i on, whi ch has t o be i mpl e me nt e d at net wor k i nt er f ace level on t he Myr i net har dwar e (see sect i ons 3. and 4. ). Pa r a St a t i on offers as p r o g r a mmi n g i nt erfaces wel l - known and s t a nda r d- ized pr ot ocol s (e.g. T CP and UDP Uni x socket s) and p r o g r a mmi n g envi r on- ment s (e.g. MPI and PVM) at a r easonabl e and accept abl e pe r f or ma nc e level (see sect i on 5.) r at her t han squeezi ng t he mos t out of t he har dwar e usi ng a c ommuni c a t i on l ayer wi t h nons t a nda r d semant i cs. 284 2. P a r a S t a t i o n v s . Myrinet Tabl e 2.1 present s a br i ef compar i son bet ween t he Par aSt at i on [WBT96] and t he Myr i net [BCF+95] har dwar e. Tabl e 2.1. Comparison between ParaStation and Myrinet ParaStation Myrinet Technology Topology Bandwidth Flow control Flow control policy Error detection Error management Interface Processor PCI-Bus adapt er 2D-Torus 128 Mbit/s Link level back-blocking Parity Fatal FIFO none (FPGA) PCI-Bus adapt er & Switches hierarchical crossbar 1.28 Gbi t / s Link level back-blocking & discard CRC implementation dependent SRAM 32bit RISC (LanAI) Par aSt at i on wi t h i t ' s two i ncomi ng and t wo out goi ng links nat ur al l y uses a 2D t or us as its net work t opol ogy. The necessary swi t chi ng el ement s (bet ween t he X and Y di mensi on) are l ocat ed on each Par aSt at i on adapt er and t her ef or e no cent ral swi t ch is needed. Myr i net i nst ead uses cascadabl e swi t chi ng el ement s (8 or 16 way crossbars) and t her ef or e has no l i mi t at i ons on t he net wor k t opol ogy bui l t . In t er ms of t r ansmi ssi on speed t her e is also a cl ear advant age for Myr i net . Bot h syst ems i mpl ement flow cont r ol at link level but wi t h different policies. Wher eas Par aSt at i on i mpl ement s a st ri ct back- bl ocki ng mechani sm bet ween all nodes, Myr i net onl y blocks for a while and t hen s t ar t s di scardi ng packets. Thi s behavi our helps t o keep t he net wor k alive even in case of f aul t y component s, but also forces t he i mpl ement at i on of a hi gher- level flow cont rol pr ot ocol t o guar ant ee rel i abl e t r ansmi ssi on. Pa r a St a t i on si mpl y blocks if t her e is not enough buffer space in t he next node on t he way t o t he final dest i nat i on and wai t s unt i l t he recei ver st ar t s accept i ng messages. Reference [War98] proves, t hat t hi s behavi our is deadl ock free and rel i abl e, as long as t he receiver keeps consumi ng packets. As a consequence Pa r a St a t i on does not need any higher-level flow cont r ol mechani sm. Anot her maj or difference bet ween Par aSt at i on and Myr i net is t he pro- gr ammi ng interface. Par aSt at i on provi des a si mpl e FI FO i nt erface t o send and receive messages al ong wi t h some flags descri bi ng t he st at us of t he in- comi ng and out goi ng FI FO. Pr i or t o a send oper at i on t he sender checks t he flags t o ensure t hat t he sender FI FO can accept a compl et e packet t . I f t her e is enough space, it wri t es t he compl et e packet i nt o t he FI FO and Pa r a St a - t i on' s flow cont r ol mechani sm ensures t ha t t he packet will event ual l y make i t s way t o t he receiver. On t he recei vi ng side t he st at us flags i ndi cat e whet her a compl et e packet has arri ved in t he receiver FI FO. Thus, t he recei ver is 1 A packet is up to 512 byte long. 285 able to receive the whole packet at once rather than polling for individual flits. Writing to and reading from the transmission FIFO is done by the CPU (PIO 2) rather than using the DMA 3 engines. The Myrinet board uses a 32bit RISC CPU called LanAI , fast SRAM memory (up to 1 MByte) and three programmable DMA engines - two on the network side to send and receive packets and one as interface to the host. The LanAI is fully programmable (in C / C++) and the necessary development kit (especially a modified gcc compiler) is available from Myricom. The kit opens up the opport uni t y to implement and test a much broader design space for high speed transmission protocols than with the ParaStation system. In fact this capability in addi- tion to the high performance of the Myrinet hardware was the main criteria to choose Myrinet as the hardware platform for ParaStation2. 3. De s i gn c ons i de r a t i ons f or Pa r a St a t i on2 The major questions to answer is how to interface the Myrinet hardware to the rest of the ParaStation software, especially the upper layers with their va- riety of implemented protocols (Ports, Sockets, Active Messages, MPI, PVM, FastRPC, Java Sockets and RMI). There are three different approaches: 1. Emulating ParaStation on the Myrinet adapter: Simulating ParaStation' s transmission FIFO with a small LanAI program running on the Myrinet adapter would not be a problem. But as ParaStation is using programmed I/O to receive incoming packets this approach would lead to unacceptable performance (see [BRB98b]). 2. Emulating ParaStation at software level: As the ParaStation system al- ready has a small hardware dependent software layer called HAL (hard- ware abstraction laycr), this approach allows the use of all Myrinet spe- cific communication features as well as a simple interface to the upper protocol layers of the ParaStation system. 3. Designing a new system: This approach would lead to an ideal system and probably the best performance, but we would have to rewrite or redesign most parts of the the ParaStation system. Because of its simplicity, we choose the second strategy to interface the ex- isting ParaStation software to the Myrinet hardware. The second question to answer is how to guarantee reliable transmission of packets with the Myrinet hardware. As said before, the original ParaStation hardware offers reliable and deadlock free packet transmission as long as the receiver keeps accepting packets. Myrinet instead discards packets (after blocking a certain amount of time) which may happen when the receiver is running out of resources or is unable to receive packets fast enough. Additionally the Myrinet hardware 2 Programmed I/O 3 Direct Memory Access 286 seems to lose packets under certain circumstances, e.g. in heavy bidirectional traffic with small packets. The upper layers of the ParaStation system rely on a reliable data transmission, so a low level flow control mechanism - either within the Myrinet control program running on the LanAI processor or as part of the HAL interface - is required. 4. I mpl e me nt a t i on of Pa r a St a t i on2 The goal of this section is to explain the basics of the ParaStation2 protocol. Most parts of the protocol are implemented in a Myrinet control program (MCP) running on the Myrinet adapter. The protocol guarantees a reliable dat a transmission so that only minor changes to the HAL have to be made and it is possible to use all upper layers of the ParaStation system without any changes. 4. 1 Ba s i c o p e r a t i o n Figure 4.1 shows the basic operation during message transmission of the ParaStation2 protocol. The basic protocol has four independent parts: (a) the interaction between the sending application and the sender network interface (NI), (b) the interaction between the sending and the receiving NI, (c) the interaction between the receiving NI and the receiving host, and (d) the interaction between the receiving application and the host. First, the sender checks if there is a free send buffer (step 1). This is accomplished by a simple table lookup in the host memory, which reflects the status of the buffers of the send ring located in the fast SRAM of the network interface (Myrinet adapter). If there is buffer space available, the sender copies (step 2) the data to a free slot of the circular send buffer located in the network interface (NI) using programmed I/O. Afterwards the NI is notified (a descriptor is written) that the used slot in the send ring is ready for transmission and the buffer in host memory is marked as in transit. A detailed description of the buffer handling is given in section 4.2. In step (3), the NI sends the data to the network using its DMA engines. When the NI receives a packet (step 4) it stores the packet in a free slot of the receive ring using its receive DMA engine. The flow control protocol ensures that there is at least one free slot in the receive ring to store the in- coming packet. Once the packet is received completely and if there's another free slot in the receive ring, the flow control protocol acknowledges the re- ceived packet (step 5). The flow control mechanism is discussed in section 4.3. As soon as the sender receives the ACK (step 6), it releases the slot in the send ring and the host is notified (step 7) to update the status of the send ring. In the receiving NI the process of reading data from the network is com- pletely decoupled from the transmission of data to the host memory. When 287 M ~ e t i nt erface ~ ] Myrl net i nt er f ace 9 ~ ~ 1 IDLE X (2) 2 In Tra~it ~j~ \ copy data send ring 3 \ \ & notify - - \ \ 7 (1) 9 8 c h e c k buffer 9 ( A ) (4) ~ (5) Jr c h e c k buffer receive ~, send | data ' ~ ACK[ / ' I ' I D L E r e c e i v e r i ng ~ I 8 I \ (B) ~ r e c e i v e t i n copy data & notify 'q~ ~ . , ~ / " , ~/ \ c h e c k for n e w ~ ~ / I packets ( D ) / receive data ~1 $ Host A ( s ender} Holt B ( r e c e i v e r ) Fig. 4.1. Data transmission in ParaStation2 a complete packet has been received from the network, the NI checks for a free receive buffer in the host memory (step A). If there is no buffer space available, the packet will stay in the NI until a host buffer becomes available. Otherwise the NI copies the data into host memory using DMA and notifies the host about the reception of a new packet by writing a packet descriptor (step B). Concurrently, the application software checks (polls) for new pack- ets (step C) and eventually, after a packet descriptor has been written in step (B), the data is copied to application memory (step D). Obviously, the data transmission phases in the basic protocol (step 2, 3, 4, and B) can be pipelined between consecutive packets. The ring buffers in the NI (sender and receiver) are used to decouple the NI from the host processor. At the sender, the host is able to copy packets to NI as long as there is buffer space available although the NI itself might be waiting for acknowledgements. The NI uses a transmission window to allow a certain amount of outstanding acknowledgements which must not necessarily equal the size of the send ring. At the receiver the NI receive ring is used to temporarily store packets if the host is not able to process the incoming packets fast enough. 4.2 Buffer handl i ng Each buffer or slot in one of the send or receive rings can be in one of the following states: 288 IDLe.: The buffer is e mpt y and can be used t o st or e a packet ( up t o 4096 byt e) . INTRANSIT: Thi s buffer is cur r ent l y i nvol ved in a send or recei ve oper at i on, which has been s t ar t ed but whi ch is still act i ve. READY: Thi s buffer is r eady for f ur t her oper at i on ei t her a send t o t he recei ver NI (if i t ' s a send buffer) or a t r ansf er t o host me mor y (if i t ' s a recei ve buffer). Re.TRANSMIT: Thi s buffer is mar ked for r et r ansmi ssi on, because of a negat i ve acknowl edgement or a t i meout (send buffer onl y). Fi gure 4.2 shows t he st at e t r ansi t i on di agr ams for bot h send and receive buffers in t he net work i nt erface. s e nd buf f e r h a n d l i n g r e ve i ve buf f e r h a n d l i n g receive A C K A C K rcceivtxl, N A C K received ~ C error, data (out of sequence) send ~ sequence) Fi g. 4.2. Buffer handling in sender and receiver At t he sender t he NI waits unt i l a send buffer becomes READY, whi ch is ac- compl i shed by t he host aft er it has copi ed t he da t a and t he packet descr i pt or t o t he NI (st ep 2 in figure 4.1). Aft er t he buffer becomes READY t he NI st ar t s a send oper at i on (net work DMA) and mar ks t he buffer INTRANSIT. Whe n an acknowl edgement (ACK) for t hi s buffer arri ves (st ep 6 in figure 4. 1), t he buffer is released (st ep 7) and mar ked IDLE. If a negat i ve acknowl edgement ( NACK) arri ves or t he ACK does not ar r i ve in t i me (or get s l ost ) t he buffer is mar ked for r et r ansmi ssi on (Re.TRANSMIT). The next t i me t he NI t ri es t o send a packet it sees t he Re.TRANSMIT buffer and resends t hi s buffer, changi ng t he st at e t o INTRANSIT again. Thi s Re.TRANSMIT - INTRANSIT cycle ma y happen several t i mes unt i l an ACK arri ves and t he buffer is mar ked IDLe.. At t he recei ver t he buffer handl i ng is qui t e si mi l ar (see figure 4. 2). As soon as t he NI sees an i ncomi ng packet it st ar t s a receive DMA oper at i on and t he st at e of t he associ at ed buffer changes f r om IDLe. t o INTRANSIT (see st ep 4 in figure 4.1). Assumi ng t hat t he received packet cont ai ns user dat a, is not cor r upt ed, and has a valid sequence number 4 t he NI checks for anot her 4 For a discussion of the ACK/ NACK protocol see section 4.3. 289 free buffer in t he receive ri ng. I f t her e is anot her free buffer it sends an ACK back t o t he sender and t he buffer is ma r ke d READY. Ot her wi s e a NACK is sent , t he packet di scar ded and t he buffer rel eased i mme d i a t e l y ( ma r ke d IDLE). The check for a second free buffer in t he recei ve ri ng ensur es t ha t t here is at l east one free buffer t o recei ve i ncomi ng packet s a nyt i me , because any packet eat i ng up t he l ast buffer will be di scar ded. Whe n t he recei ved packet cont ai ns pr ot ocol da t a ( ACK or NACK) , t he NI processes t he packet and rel eases t he buffer. In case of an er r or ( CRC) t he buffer is ma r ke d IDLE i mme di a t e l y wi t hout f ur t her processi ng. I f t he recei ved d a t a packet does not have a val i d sequence number , t he packed is di scar ded and t he sender is not i fi ed by sendi ng a NACK back. Thus , t he recei ver refuses t o accept d a t a out of sequence and wai t s unt i l t he sender will resend t he mi ssi ng packet . 4. 3 F l o w c o n t r o l p r o t o c o l Par aSt at i on2 uses a flow cont r ol pr ot ocol wi t h a fixed sized t r ans mi s s i on wi ndow and 8 bi t sequence numbe r s ( r el at ed t o i ndi vi dual s ender / r ecei ver pai rs), where each packet has t o be acknowl edged (ei t her wi t h a pos i t i ve or a negat i ve acknowl edgement ) in c ombi na t i on wi t h a t i me out and r et r ans mi s - sion me c ha ni s m in case t ha t an acknowl edgement get s l ost or does not ar r i ve wi t hi n a cer t ai n a mount of t i me. The pr ot ocol s assumes t he ha r dwa r e t o be u n r e l i a b l e and is abl e t o deal wi t h any numbe r of cor r upt ed or l ost pack- et s ( cont ai ni ng ei t her user d a t a or pr ot ocol i nf or mat i on) . Tabl e 4.1 gi ves an overvi ew of possi bl e cases wi t hi n t he pr ot ocol , an expl anat i on of each case as well as t he r esul t i ng act i on i ni t i at ed. Ta bl e 4. 1. Packet processing within receiver packet t ype sequence check explanation < lost ACK DATA = ok > lost dat a < ACK = > NACK none CRC none duplicate ACK ok previous ACK lost resulting action resend ACK check buffer space (see fig 4.2) ignore & send NACK ignore packet release buffer ignore packet mark buffer for retransmission error detected ignore packet When a d a t a packet is recei ved, t he NI c ompa r e s t he sequence n u mb e r of t he packet wi t h t he as s umed sequence numbe r for t he sendi ng node. I f t he number s are equal , t he recei ved packet is t he one t ha t is expect ed and t he NI cont i nues wi t h i t s r egul ar oper at i on. A recei ved sequence numbe r s mal l er t han expect ed i ndi cat es a dupl i cat ed d a t a packet caused by a l ost or l at e ACK. Thus t he correct act i on t o t ake is t o resend t he ACK, because t he 290 sender expect s one. Is t he recei ved sequence numbe r l ar ger t ha n expect ed, t he packet wi t h t he cor r ect sequence numbe r has been cor r upt ed ( CRC) or lost. As t he pr ot ocol does not have a sel ect i ve r et r ans mi s s i on me c h a n i s m t he packet is s i mpl y di scar ded and t he sender is not i fi ed wi t h a negat i ve acknowl edgement ( NACK) . Thus, t hi s packet will be r e t r a ns mi t t e d l at er , ei t her because t he sender got t he NACK, or because of a t i meout . As t he mi ssi ng packet al so causes a t i me out at t he sendi ng side, t he packet s will event ual l y ar r i ve in t he cor r ect order. On t he r ecept i on of an ACK packet , t he NI al so checks t he sequence numbe r and i f it is ok it cont i nues pr ocessi ng and rel eases t he acknowl edged buffer. I f t he recei ved sequence numbe r is smal l er t ha n as s umed, we' ve re- cei ved a dupl i cat ed ACK because t he sender r an i nt o a t r ans mi s s i on t i me o u t before t he correct ACK was recei ved and t he recei ver has resent an ACK upon t he arri val of an al r eady acknowl edged d a t a packet 5. The r esponse in t hi s case is s i mpl y t o i gnore t he ACK. A recei ved sequence n u mb e r l ar ger t han what is expect ed i ndi cat es t ha t t he cor r ect ACK has been c or r upt e d or lost. Thus t he act i on t aken is t o i gnore t he ACK, but t he associ at ed buffer is ma r ke d for r et r ans mi s s i on t o force t he recei ver t o resend t he ACK. The buffer associ at ed wi t h t he as s umed ( and mi ssi ng) ACK will t i me out and be resent which al so forces t he recei ver t o resend t he ACK. A recei ved NACK packet does not need sequence checki ng; t he as s oci at ed buffer is mar ked for r et r ans mi s s i on as l ong as it is in t he INTRM~ISIT s t at e. Ot her wi se t he NACK is i gnored (t he buffer is in R~.TANSHIT s t at e anyway) . I n case of a CRC er r or t he packet is dr opped i mme di a t e l y and no f ur t her act i on is i ni t i at ed, because t he pr ot ocol is unabl e t o det ect er r or s in t he pr ot ocol header. The r esul t i ng pr ot ocol is abl e t o handl e any numbe r of cor r upt ed or l ost packet s cont ai ni ng ei t her user d a t a or pr ot ocol i nf or mat i on, as l ong as t he NI and t he connect i on bet ween t he i ncor por at ed nodes is worki ng. The pr ot ocol was devel oped t o ensure rel i abi l i t y of d a t a t r ans mi s s i on at NI level, not t o handl e har dwar e fai l ures in t er ms of f aul t t ol er ance. The pr ot ocol i t sel f can be opt i mi zed in some cases (e.g. bet t er handl i ng of ACK' s wi t h a l ar ger sequence number ) , but t hi s is left t o f ut ur e i mpl ement at i ons . In compar i s on t o exi st i ng pr ot ocol s, t hi s pr ot ocol can r oughl y be classified as a var i at i on of t he T CP pr ot ocol . 5. Ba s i c p e r f o r ma n c e o f t h e p r o t o c o l h i e r a r c h y In t abl e 5.1, per f or mance figures of all sof t war e l ayers in t he Pa r a St a t i on2 sys- t em are pr esent ed 6 . The var i ous levels pr esent ed are t he har dwar e a bs t r a c t i on 5 This case may sound strange, but we' ve seen this behaviour several times. 6 For a detailed discussion of the ParaStation2 protocol hierarchy, have a look at our paper called ParaStation User Level Communication in this proceedings. 291 l ayer (HAL), which is t he lowest l ayer of t he hi erarchy, t he so cal l ed p o r t s and TCP layers, which are bui l d on t op of t he HAL, and s t andar di zed communi - cat i on l i brari es such as MPI and PVM, whi ch are opt i mi zed for Pa r a St a t i on2 and bui l d on t op of t he por t s layer. Lat ency is cal cul at ed as r ound- t r i p/ 2 for a 4 byt e pi ng- pong and t hr oughput is measur ed usi ng a pai rwi se exchange for large messages (up t o 32K). N/ 2 denot es t he packet size in byt es when hal f of t he ma xi mum t hr oughput is reached. The per f or mance da t a is gi ven for t hr ee di fferent host syst ems, namel y a 350MHz Pent i um II r unni ng Li nux (2.0.35), a 500MHz and a 600MHz Al pha 21164 syst em r unni ng Di gi t al Uni x (4. 0D). Tabl e 5.1. Basic performance parameters of ParaStation2 Programming System Measurement Pentium II Latency [psi 14.5 350 MHz Throughput [MByte/s] 56 N/ 2 [Byte] 256 Alpha 21164 Latency [psi 17.5 500 MHz Throughput [MByte/s] 65 N/ 2 [Byte] 512 Alpha 21164 Latency [psi 18.0 600 MHz Throughput [MByte/s] 64 N/ 2 [Byte] 350 interface Port s TCP MPI PVM 18.7 20.2 25 48 51 43 500 500 400 24 24 30 29 55 57 50 49 500 500 500 1000 24 25 25 28 56 59 51 48 700 700 500 700 The l at ency at HAL level of 14. 5ps t o 18ps is s omewhat hi gher t ha n for compar abl e syst ems such as LFC ( l l . 9 p s ) or FM (13.2/~s) [BRB98a]. Thi s is because nei t her LFC nor FM copies t he da t a it receives t o t he appl i cat i on and second, bot h LFC and FM i ncor r ect l y assume Myr i net t o be rel i abl e. The ma xi mum t hr oughput wi t h 56 MByt e / s t o 65 MByt e / s of Pa r a St a - t i on2 is bet ween t he t hr oughput of FM (40.5 MByt e/ s ) and LFC ( up t o 70 MByt e/ s ) . If LFC or FM st ar t s copyi ng t he recei ved da t a t o t he appl i ca- t i on (as Pa r a / - St a t i on does) t he t hr oughput decreases for l arge messages t o 30 - 35 MByt e / s [BRB98a] whereas Par aSt at i on2' s t hr oughput keeps qui t e st abl e close t o ma xi mum level ( ~ 50 MByt e / s ) . Swi t chi ng f r om a si ngl e- pr ogr ammi ng envi r onment (HAL) t o mul t i - pr o- gr ammi ng envi r onment s ( upper layers) resul t s in a slight per f or mance degr a- dat i on r egar di ng l at ency as well as t hr oughput . The reason for i ncr easi ng l at enci es is due t o locking over head t o ensure cor r ect i nt er act i on bet ween compet i t i ve appl i cat i ons. The decreased t hr oughput is caused by addi t i onal buffering and a compl ex buffer management . 292 6. Pe r f or manc e at appl i cat i on level Focusing onl y on l at ency and t hr oughput is t oo narrow for a compl et e eval- uat i on. It is necessary t o show t hat a low-latency, hi gh- t hr oughput com- muni cat i on subsyst em also achieves a reasonabl e appl i cat i on efficiency. For this reason we i nst al l ed t he widely used and publ i cl y available ScaLAPACK 7 l i brary [CDD+95], which uses bot h BLACS s [DW95] and MPI as commu- ni cat i on subsyst em on Par aSt at i on2. The benchmar k we use is t he paral l el mat r i x mul t i pl i cat i on for general dense mat r i xes from t he PBLAS l i brary, which is part of ScaLAPACK. Table 6.1 shows t he performance in MFLOP' s r unni ng on our 8 processor DEC- Al pha cluster (500 MHz, 21164A). Tabl e 6.1. Parallel matrix multipfication on ParaStation2 Problem size (n) 1000 2000 3000 4000 Uaiprocessor MFlop (Eft.) 782 (100%) 785 (100%) 790 (100%) 772 (100%) 1 Node 2 Nodes 4 Nodes 6 Nodes 8 Nodes Performance in MFlop (Efficiency) 731 1276 2304 3243 3871 (93.5%) (81.6%) (73.6%) (69.1%) (61.9%) 743 1359 2546 3582 4683 (94.6%) (86.6%) (81.1%) (76.1%) (74.6%) 755 1396 2700 3908 4887 (95.6%) (88.4%) (85.4%) (82.4%) (77.3%) 1398 2694 4044 5 3 3 7 (90.5%) (87.2%) (87.3%) (86.4%) Fi rst , we measured the uniprocessor performance of a hi ghl y opt i mi zed mat r i x mul t i pl i cat i on (cache aware assembler code), which act s as a reference t o cal cul at e the efficiency of the parallel versions. A uniprocessor per f or mance of 772 to 790 MFLOP on a 500 MHz processor proves t hat t he pr ogr am is hi ghl y opt i mi zed (IPC of more t han 1.5). Obvi ousl y t he parallel version execut ed on an uniprocessor has to be somewhat slower, but t he measured efficiency of 93.5% t o 95.6% is very high. Using more nodes, t he absol ut e per f or mance in MFLOP increases st eadi l y while t he efficiency decreases smoot hl y. The maxi mum performance achieved was 5.3 GFLOP using 8 nodes which is qui t e good compared to t he 10.1 GFLOP of t he 100 nodes Berkeley NOW cl ust er 9. 7. Re l at e d work There are several approaches which use Myr i net as a har dwar e i nt erconnect t o build parallel syst ems: Active Messages and t he Berkeley NOW cluster, espe- cially Active Messages-II [CMC97], Illinois Fast Messages (FM) [PLC95], t he Se.._.~alable Linear Algebra Package. 8 B_asic Linear --Algebra Communication S._ubroutines 9 see ht t p : / / now. cs . ber kel ey. edu 293 basic interface for parallelism from the University of Lyon (BIP) [PT97], the link-level flow control protocol (LFC) [BRB98a] from the distributed ASCI supercomputer, PM [TOHI98] from the Real World Computing Partnership in Japan, the virtual memory mapped communication VMMC and VMMC- II [DBLP97] from Princeton University, Hamlyn [BJM+96], the user-level network interface U-Net [vEB+95], and Trapeze [YCGL97]. The major difference between these projects and ParaStation2 is twofold. First, ParaStation2 focuses on a variety of standardized programming inter- faces, such as UNIX sockets (TCP and UDP), MPI, PVM, and Java sockets and RMI with a reasonable performance at each level rather than a single purpose, nonstandard, proprietary interface which squeezes the most out of the hardware for a specific application. The second difference due to reliability assumptions of the Myrinet hard- ware (see figure 7.1). Assume Myri net is reliable? Reliablility strategy prevent buffer ove r f l ow~ . . . . \ recovery application N I host Reliability protocol? (unreliablenO//~y/\es AM-il VMMC-2 ParaStation2 Fig. 7.1. Myrinet and Reliability (from [BRB98b]) Most approaches assume Myrinet to be reliable or pass the unreliability on to the application layer. Only AM-II, VMMC-2 and ParaStation2 accept the unreliability of Myrinet and provide mechanisms to ensure reliable dat a transmission. The reason why most projects assume Myrinet to be reliable is mainly due to the rather low error rate at hardware level. We have ob- served that the link-level flow control mechanism seems to fail by overwriting or dropping complete packets under certain circumstances. The only way to detect this behaviour is to count packets or to use sequence numbers within packets, because the hardware neither blocks the transmission nor signals 294 any error. Fur t her mor e, t he har dwar e does not di st i ngui sh bet ween d a t a and cont rol packet s while dr oppi ng one of these. Thus , a si mpl e flow con- t rol pr ot ocol t o pr event buffer overflow assumi ng t ha t cont r ol packet s will be del i vered rel i abl y is not sufficient t o ensure rel i abl e t r ansmi ssi on. Al t hough AM-II [CMC97] and VMMC2 [DBLP97] do not expl i ci t l y s t at e pr obl ems wi t h t he Myr i net har dwar e, t hey i nt r oduced a pr ot ocol t o ensur e rel i abl e communi cat i on as t hey swi t ched f r om AM t o AM- I I or VMMC t o VMMC2 respectively. The same hol ds for Par aSt at i on2 whi ch also s t ar t ed usi ng st r i ct back-bl ocki ng unt i l serious pr obl ems arose. 8. Conc l us i on and f urt her work In t hi s paper we' ve present ed t he design of Par aSt at i on2, especi al l y t he ACK/ NACK r et r ansmi ssi on pr ot ocol t o ensure rel i abl e da t a t r ansmi ssi on at net wor k i nt erface level. The advant age of t hi s appr oach was t ha t we coul d reuse t he Par aSt at i on code wi t h mi nor changes and get t i ng t he compl et e f unct i onal i t y of t he Par aSt at i on syst em (especi al l y t he var i et y of s t andar d- izes and well-known interfaces) for free. The eval uat i on of Par aSt at i on2 shows t hat Par aSt at i on2 compar es well wi t h ot her appr oaches in t he cl ust er c ommuni t y usi ng Myr i net . Pa r a St a t i on2 is not t he fast est syst ems in t er ms of pur e l at ency and t hr oughput , but in cont r ast t o most ot her appr oaches it offers t he rel i abl e i nt erface which is - in our experi ence - mor e i mpor t a nt t o t he user t han an ul t r a hi gh-speed, but unrel i abl e i nt erface. At t he level of appl i cat i on per f or mance t he 5.3 GF LOP s has not been achi eved before wi t h t hi s smal l number of nodes. The f ut ur e pl ans for Par aSt at i on2 are t o opt i mi se t he i nt er f ace bet ween t he soft ware and t he Myr i net har dwar e t o get even mor e per f or mance out of t he syst em. Second, por t s t o ot her pl at f or ms such as Spar c/ Sol ar i s and Al pha / Li nux are on t he way. Re f e r e nc e s [BCF+95] Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Ku- lawik, Charles L. Seitz, Jarov N. Seizovic, and Wen-King Su. Myrinet: A Gigabit-per-Second Local Area Network. I EEE Micro, 15(1):29-36, February 1995. [BJM+96] G. Buzzard, D. Jacobson, M. MacKey, S. Marovich, and J. Wilkes. An Implementation of the Hamlyn Sender-Managed Interface Architecture. In The 2nd USENI X Syrup. on Operating Syst ems Design and Implementation, pages 245-259, Seattle, WA, October 1996. [BRB98a] Raoul A. F. Bhoedjang, Tim Rfihl and Henri E. Bal. LFC: A Communi - cation Substratefor Myrinet. Fourth Annual Conference of the Advanced School for Computing and Imaging, June 1998, Lommel, Belgium. 295 [BRB98b] Raoul A. F. Bhoedjang, Ti m Rfihl and Henri E. Bed. User-Level Network Interface Protocols. IEEE Comput er, 31(11), pp. 52 - 60, November 1998. [CDD+95] J. Choi, J. Demmel, I. DhiUon, J. Dongarra, S. Ostrouchov, A. Pet i t et , K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A port abl e linear alge- bra library for distributed memory comput ers - design issues and performance. Technical Report UT CS-95-283, LAPACK Working Note #95, University of Tennessee, 1995. [CMC97] B. Chung, A. Mainwaring, and D. Culler. Virtual Network Transport Protocols for Myrinet. In Hot Interconnects'97, Stanford, CA, April 97. [DBLP97] C. Dubnicki, A. Bilas, K. Li, and J. Philbin. Design and Impl ement at i on of Virtual Memory-Mapped Communication on Myrinet. In 11th Int. Parallel Processing Symposium, pages 388-396, Geneva, Switzerland, April 1997. [DW95] J. Dongarra and R. C. Whaley. A user' s guide to the blacs vl . 0. Technical Report UT CS-95-281, LAPACK Working Note #94, University of Tennessee, 1995. [PLC95] S. Pakin, M. Lauria, and A. Chien. High Performance Messages on Work- stations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing '95, San Diego, CA, December 1995. [PT97] L. Prylli and B. Tourancheau. Protocol Design for High Performance Net- working: A Myrinet Experience. Technical Report 97-22, LIP-ENS Lyon, July 1997. [TOHI98] H. Tezuka, F. O' Carrol , A. Hori, and Y. Ishikawa. Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication. In 12th Int. Parallel Processing Symposium, pages 308-314, Orlando, FL March 1998. [YCGL97] K. Yocum, J. Chase, A. Gallatin, and A. Lebeck. Cut - Thr ough Delivery in Trapeze: An Exercize in Low-Latency Messaging. In The 6th Int. Syrup. on High Performance Distributed Computing, Portland, OR, August 1997. [vEB+95] T. von Eicken, A. Basu, V. Buch, and W. Vogel. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proc. of the 15th Syrup. on Operating Syst em Principles, pages 303-316, Copper Mountain, CO, December 1995. [WBT96] Thomas M. Warschko, Joachim M. Blum, and Walter F. Tichy. The ParaSt at i on Project: Using Workstations as Building Blocks for Parallel Com- puting. In Proceedings of the International Conference on Parallel and Dis- tributed Processing, Techniques and Applications ( PDPTA '96}, pages 375-386, Sunnyvale, CA, August 9-11, 1996. [War98] Thomas M. Warschko. Effiziente Kommuni kat i on in Parallelrechnerar- chitekturen. Dissertation, Universits Karlsruhe, Fakults f/Jr Informat i k. Pub- lished as: VDI Fortschrittberichte, Reihe 10: Informat i k / Kommuni kat i onst ech- nik Nr. 525. ISBN: 3-18-352510-0. Ms 1998. Br oadc as t Co mmu n i c a t i o n in ATM Co mp u t e r Ne t wo r ks and Ma t he ma t i c a l Al g o r i t hm De v e l o pme nt Mi chael Wel l er Inst i t ut e for Experimental Mathematics, Ellernstr. 29, 45326 Essen, Ger many Ab s t r a c t . This article emphasizes the i mport ance of collective communi cat i on and especially broadcasts in mat hemat i cal algorithms. It points out t hat algorithms in discrete mat hemat i cs usually t ransmi t higher amount s of dat a fast er t han typical floating point algorithms. It describes the o.tel.o ATM Test bed of the GMD Na- ti onal Research Center for Information Technology, St. Augustin, and the Inst i t ut e of Experimental Mathematics, Essen, and the experiences with di st ri but ed comput - ing in this network. It turns out t hat the current i mpl ement at i ons of IP over ATM and libraries for distributed computing are not yet suited for high performance com- puting. However, ATM itself is able to perform fast broad- or multicasts. Hence it might be worthwhile to design a message passing library based on nat i ve ATM. 1 Br o a d c a s t s i n ma t h e ma t i c a l a l g o r i t h ms Di s t r i but ed and paral l el c omput i ng pl ays an i mp o r t a n t rol e in chemi st r y, physi cs, engi neer i ng and t hus numer i cal ma t he ma t i c s . However , t her e are par t s of pur e and di scret e ma t h e ma t i c s like c r ypt ogr a phy, c o mp u t a t i o n a l gr oup t heor y and r epr esent at i on t heor y whi ch can al so benefi t f r om usi ng a c omput e r . In general , t he al gor i t hms t r a ns f or m t he or i gi nal pr obl e m t o a huge pr obl e m in l i near al gebr a over a fi ni t e field [2]. Dense equat i on s ys t ems in 300,000 or mor e var i abl es [3,4] or t he c o mp u t a - t i on and enumer at i on of hundr eds of mi l l i ons of vect or s or even s ubs paces [5,6] show up easily. However, an ent r y of a ma t r i x or vect or can of t en be real i zed by a smal l numbe r of bits. Oper at i ons on t he machi ne wor ds r epr es ent i ng smal l vect or s of such ent ri es are ei t her done by i nt eger addi t i ons, l ogi cal bi t oper at i ons, or (in t he mor e compl ex cases) by t abl e l ookups. For t he har dwar e i nvol ved in sol vi ng such pr obl ems t hi s me a ns t h a t t her e are ma n y t r i vi al (hence fast ) ar i t hmet i cal oper at i ons one has t o per f or m. I n r et ur n, t he pr obl ems are big t hemsel ves and par al l el i zat i on onl y makes sense if each c omput i ng node receives a s ubs t ant i al a mount of d a t a t o deal wi t h. For t he c ommuni c a t i on in a di s t r i but ed appl i cat i on t hi s means t ha t it usual l y has t o t r ansf er a huge a mount of d a t a fast . Ther ef or e t he c omput i ng nodes mus t have l arge a mount s of me mo r y wi t h a hi gh me mo r y ba ndwi dt h t o sat i sf y t he speed r equi r ement s of t he CPU handl i ng ma n y t r i vi al i nt eger oper at i ons . I f t abl e l ookups are i nvol ved in t he pr ogr a m, a cache is of t en unabl e t o hi de t he lack of act ual me mo r y bandwi dt h f r om t he appl i cat i on pr ogr a m. 298 Numerical applications typically involve slower operations on floating point numbers and no table lookups and reduce the unbalance between mem- ory, communi cat i on and CPU speed. However, there are no probl ems with numerical stability in discrete mat hemat i cs. There are many interesting comput at i onal problems in discrete mat he- matics requiring access to parallel supercomput ers. Since they are rare, it seems interesting to couple smaller parallel machines and workst at i on clus- ters of different institutions over a WAN to perform these comput at i onal tasks. Many of the linear algebra al gori t hms use broadcast s and ot her group communications like the algorithm sketched in Fig. 1. 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ' Cur r e nt node * I Solve : History ',p i I I i I .I llpivot[] ~ I pivot[] :I <""'i | 9 CLI ~roadcast Fig. 1. Data flow of parallel gaussian elimination over a fm.ite field: One of the processes solves a few columns of the equation system residing in its memory. It broadcasts instructions to the other nodes which perform the same operations as the sender on the remaining columns. Then control passes to the next node. The processes perform the pre-solving and broadcasting in a round-robin fashion to achieve a better load distribution. The algorithm is described in [4] in more detail and was also used in a world record computation achieved in [3] 299 Speci al net wor ks desi gned for par al l el c omput e r s like t he CM- 5 by Thi nk- ing Machi nes are abl e t o pe r f or m br oadcas t s and r educt i on oper at i ons in har dwar e. Cur r ent l y it seems not possi bl e t o pe r f or m r educt i ons usi ng s t an- dar d net wor k har dwar e. Also, br oadcas t s a ppe a r mor e of t en in our pr obl e ms and are even used dur i ng t he i ni t i al i zat i on phas e of a l gor i t hms whi ch do not use t he m af t er war ds. 2 I mp l e me n t a t i o n o f Br o a d c a s t s Ideal l y, t he appl i cat i on p r o g r a mme r shoul d not car e a b o u t t he i mpl e me n- t at i on of br oadcas t s . Thi s ought t o be done by a c ommuni c a t i on l i brary. Typi cal i mpl e me nt a t i ons are shown in Fi g. 2. Of t en a t r ee like i mpl e me n- t at i on is used. Each node whi ch al r eady recei ved a copy of t he d a t a hel ps di s t r i but i ng it. Whi l e t hi s reduces t he br oadcas t t o l og2(n ) st eps of poi nt t o poi nt c ommuni c a t i ons t her e is mor e t ha n one node sendi ng at t he s a me t i me. Thi s causes collisions in any non swi t ched net wor k connect i ng t he nodes and l eads t o net wor k congest i on, if, for exampl e, several host s in t he s a me cl ust er send d a t a t o host s in a cl ust er at a not he r l ocat i on over a WAN connect i on. Tree structured broadcast: Simple (PVM) broadcast: Cyclic broadcast: O 1 0 1 9 Non-switched Network / Ethemet: Fi g. 2. Different implementations of broadcasts. From left to right: El aborat e im- plementation using a tree like distribution of dat a in log2(n ) steps but causing collisions in non switched networks (some MPI, POE on old IBM SP), simple im- pl ement at i on for non switched networks (PVM), cyclic commtmication done by the application J us t sendi ng t he da t a in a l oop t o all n - 1 r eci pi ent s avoi ds such con- gest i on, but is much slower. I t is wor t h not i ng t ha t t hi s way is still f ast er t ha n usi ng t he t ree br oadcas t on a s har ed me di a Et her net . Not onl y mus t t he t ree br oadcas t degener at e t o a sequent i al scheme here, but t he collisions 300 reduce the efficiency even more. We tried several t ypes of broadcast in our experi ment shown in Fig. 1. Using sequential broadcast f r om t he begi nni ng on shared medi a 10Mbps Et hernet required 8h 43m where the at t empt to use a tree broadcast required l l h 44m. In t hat experi ment we obt ai ned good results using a cyclic pat t er n for t he sequential broadcast . The mast er only sends the dat a to a successor, t hen resumes his comput at i onal work. Thi s has benefits as t he mast er al ready had to do some pr ecomput at i on to find the dat a to be broadcast ed. That is, in a sense he is al ready behind the ot her nodes and must not be delayed any further. It s successor then sends the dat a to the next node and so on, until all nodes received the dat a. As we are using a round robin met hod to move the mast er node in this al gori t hm anyway this i nt eract s nicely with this al gori t hm. Thi s observat i on was also made with numeri cal solvers of floating poi nt equat i on systems. 3 ATM a nd di s t r i but e d c omput i ng Workstations Fig. 3. The Essen - - St. Augustin o.tel.o ATM-Testbed The Inst i t ut e of Experi ment al Mat hemat i cs (IEM) is connect ed to an ATM Test bed as shown in Fig. 3. The carrier o.tel.o provides a 100km 301 155Mbps connect i on t o t he GMD in St. Augus t i n and a s hor t 622Mbps con- nect i on t hr ough t he ci t y of Essen i nt o t he bui l di ng of t he ma i n c omput e r cent re ( HRZ) of t he uni ver si t y of Essen. Cl assi cal I P over ATM is used t o r un T CP / I P over t hi s net wor k. I t t ur ne d out t ha t any r out er s in t hi s net wor k have t o be avoi ded and LAN Emul a t i on is not t o be used t o achi eve sensi bl e per f or mance. Thi s way we can r each up t o 80Mbps file t r ans f er s (each node has a 155Mbps adapt er ) . I t is now possi bl e t o use t he s t a nda r d l i br ar i es for di s t r i but ed c omput i ng over T CP / I P for par al l el c omput i ng in such an e nvi r onme nt . However , we onl y obt ai ned bad per f or mance t hi s way. Onl y a Pe a k - Ba n d wi d t h of 7. 5Mbps coul d be meas ur ed when usi ng PVM. MPI - CH usi ng P4 for c ommuni c a t i on onl y achi eved 0. 6Mbps of ma x i mu m t r ans mi s s i on ba ndwi dt h. The r e al so ex- ists a package PLUS [1] whi ch i nt er connect s MPI i mpl e me nt a t i ons of differ- ent wor ks t at i on cl ust ers or par al l el c omput e r s over T CP / I P . I t coul d not be t est ed in t he exper i ment of Fig. 3 as it is not yet avai l abl e for AI X, but i t s cr eat or s al r eady t est ed it in a 34Mbps ATM envi r onment . Scal i ng t he per- f or mance t her e t o 155Mbps we woul d achi eve onl y l l . 4 5 Mb p s whi ch is still less t ha n a pl ai n T CP / I P file t r ans f er can achi eve. Af t er t he per i od of low-level ATM t est s in t he j oi nt pr oj ect wi t h t he l ab of our car r i er o. t el . o and s uppor t of Si emens and GN Net t es t descr i bed bel ow t he dar k fibre t o t he ma i n c omput e r cent r e of Essen Uni ver si t y was bui l t and we di d our exper i ment s agai n. Thi s t i me r unni ng t h e m on several nodes of all t hr ee i nst i t ut i ons of Fig. 3 at t he s a me t i me. Doi ng so, we f ound no si gni fi cant , t echni cal difference in r unni ng our appl i cat i on on nodes of one, t wo or t hr ee i nst i t ut i ons. However, as mor e peopl e and i ns t i t ut i ons were i nvol ved, it be c a me much mor e di ffi cul t t o ensur e a cor r ect s et up of r out i ng, host names and net wor k for t he c omput a t i ons . Also, as s ome t i me had passed, f i r mwar e upgr ades t ook pl ace on t he swi t ch of t he i ns t i t ut e (V. 3. 1. 0 t o V. 3. 2. 1), and t he GMD machi nes were upgr aded t o AI X 4.3 f r om 4.2. Ot her ATM- r el at ed fixes of I BM were appl i ed t o t he AI X 4.2 machi nes in Essen. Thi s way, we were abl e t o meas ur e a peak ba ndwi dt h under MPI - CH usi ng P4 of 93Mbps dur i ng t he i ni t i al i zat i on of a par al l el appl i cat i on act ual l y per f or mi ng an HPI _Bc a s t . We used t he s ame pr ogr a m, even t he s a me bi nar y, used for t he f or mer t est s. Onl y swi t ch f i r mwar e and t he ope r a t i ng s ys t e m on s ome of t he nodes had changed. Thi s pr ogr a m used t o need 47 mi nut es (!) t o br oadcas t 75MB of da t a t o s i x- ei ght nodes. Af t er t he upgr ades t hi s t i me was r educed t o 30 seconds. Scat t er i ng an equat i on s ys t e m of 800MB on 6 nodes st i l l r equi r ed mor e t han 8000 seconds. But t hi s d a t a is s cat t er ed ma nua l l y by t he pr ogr a m in mos t l y smal l chunks. Hence it mi ght not profi t f r om t he advant ages of t he newer f i r mwar e. On t he ot her hand, usi ng nat i ve ATM c ommuni c a t i on, we were abl e t o t r ansf er up t o 123Mbps ( AAL and ot her over head is al r eady s ubt r act ed) f r om 302 poi nt to poi nt in this network. The highest per f or mance was achieved using large packets of 40KB which was the largest size t he AI X oper at i ng syst em was able to handle. Q O Fi g. 4. Single to Multipoint ATM communication: Based on a point-to-point con- nection (solid line) from the lower left to the upper left ATM allows to add further recipients (dashed lines). Still data is only transmitted once over the WAN lines between the three switches. The receiving switches to the right and upper left dis- tribute the data locally to all receiving parties In addition, there is anot her interesting feat ure of ATM. As sketched in Fig. 4 ATM is able to broadcast . Al t hough ATM per se is a st ri ct l y poi nt to poi nt connection oriented protocol, it is possible to specify mul t i pl e receivers of the same dat a st r eam which splits at the ATM switches as close to the receivers as possible. When connecting two clusters of workst at i ons over an ATM WAN link, this means t hat dat a will only be t r ansmi t t ed once t hrough the WAN link and t hen di st ri but ed aut omat i cal l y by the switch in the r emot e ATM cluster. In our experi ment s, we were able to achieve t he same peak t ransmi ssi on rat e of 123Mbps (not counting any ATM or SDH overhead) even when dis- t ri but i ng dat a to the GMD and the IEM. Unf or t unat el y we experi enced a very high rat e of 1-2% dat a loss. Lat ency or delay vari at i on could not be measured reliably with the workst at i ons as t hey had no exact synchronous t i me source. In a j oi nt proj ect (Fig. 5) with the lab of our carrier o.tel.o and suppor t of Siemens and GN Net t est we found t hat there is act ual l y no dat a loss in t he network or the switches. It appears t hat these are either due to failure of the workst at i ons to be unabl e to accept the dat a fast enough or not being N oMDst.Au ustin ..... i i i i i "" "S'i'ngle 155Mbps physical link 303 IEMEssen Fig. 5. Testing a single to multipoint connection: The data was sent through a unidirectional PVC to a loop on the remote switch and from there back to points B & C at once (using a single to multipoint connection). As the bandwidth on the [EM-GMD link was limited, it is guaranteed that the data was duplicated at the switch at the IEM, not at the other side able not to overcommit a specified link bandwi dt h. As the dat a loss was even higher for very slow links, it appears t hat the latter mi ght be the reason. Wi t h GN Nettest equipment provided by the carrier and Siemens we could measure network latencies and delay variations which we found to be very small and below a millisecond even in the WAN segment (see also Fig. 6). Definitely these should be of no relevance for distributed comput i ng. How- ever, ATM is connection oriented and the time required to setup connections cannot be ignored. We found t hat a typical ATM LAN switch can st and at most bursts of about 100 connection setups per second (assuming there is no other traffic). Therefore, for a typical distributed application, it will be too slow to initiate the necessary connections when a dat a packet is to be sent. One can consider initiating all required connections at the beginning of the program, but there will be many such connections and the resources of the ATM switches are usually limited to a few t housand connections per interface. Thus, an interesting approach could be to initiate each connec- tion in advance, before the dat a to be sent is actually available. Of course, 304 Fig. 6. Delay variation and throughput measured on one leave of a single to multi- point connection. The GN Nettest Interwatch does not allow to measure absolute delays in this configuration as the data was not received on the card which gener- ated the traffic. Hence there are no sensible cell delay values reported. The values for delay variation and throughput are reported correctly. We had no access to a synchronous clock driving the switches, SDH equipment and traffic generators. This might result in increased jitter effects. The values measured on the other leaves did not differ significantly. this requires t hat the application can foresee the recipient at such an early point and t hat the application pr ogr ammi ng interface of the message passing library allows to prepare sending messages in such a way. The tests also included the use of network management systems. These are an indispensable tool to administer any large ATM network. Wi t hout t hem, a connection has to be configured separately on each switch it crosses. Using such a management system it is possible j ust to specify endpoi nt s and have the system find a route t hrough the network. However, di st ri but ed comput i ng would require switched virtual connections which are normal l y not handled by the management system and were not a maj or part of this project. Still, the management system in question allows a network carrier to assign some backbone bandwi dt h to a customer which can then use this bandwi dt h at his own disposal. He can also book ATM-links in advance which are then made available on aut omat i c at a later point (for example at night, when the traffic is low and the ATM connection is cheaper), maybe for a bat ch comput at i on. 305 In concl usi on, it appear s t o be wor t hwhi l e t o p e r f o r m di s t r i but ed hi gh per f or mance c omput i ng over an ATM WAN net wor k. However , cur r ent l y a message passi ng l i br ar y ut i l i zi ng nat i ve ATM connect i ons is not avai l abl e but a ppe a r s t o be necessar y t o achi eve t he r equi r ed d a t a r at es. I n addi t i on, such a l i br ar y mus t be abl e t o deal wi t h t he d a t a loss whi ch c a nnot be avoi ded when usi ng ATM, al t hough t he net wor k i t sel f a ppe a r s t o be ver y rel i abl e in t hi s r espect , as l ong as t he ATM i nt er f aces of t he c o mp u t i n g nodes are f ast enough and don' t exceed t he t raffi c cont r act s. 4 Ac kno wl e dg e me nt s The a ut hor ki ndl y acknowl edges f i nanci al s uppor t by t he Mi ni s t r y of Sci- ence and Educat i on Nor t h Rhi ne West f al i a, Essen Uni ver s i t y and o. t el . o Diisseldorf. Hi s work was al so s uppor t e d by t he DFG- NSF exchange pr ogr a m, DFG g r a n t # Mi - 89/ 24- 1. He is al so gr at ef ul t o Si emens Essen for pr ovi di ng access t o a wi de ar ea swi t ch and a ma n a g e me n t s ys t em, and t o GN Net t es t Muni ch for t echni cal s uppor t . Fi nal l y he woul d like t o t ha nk t he GMD Na t i ona l Resear ch Cent er for I nf or ma t i on Technol ogy, St. Augus t i n, and t he Co mp u t e r Cent er of t he Uni - ver si t y of Essen for t he per mi s s i on t o use t hei r r esour ces and t he hel p of t hei r ma i nt e na nc e st af f in set t i ng up t he s ys t e ms for t he wi de- ar ea di s t r i but ed com- put at i ons . Re f e r e nc e s 1. Mat t hi as Brune, JSrn Gehring, and Alexander Reinefeld. Communi cat i ng across parallel message-passing environments. Prepri nt submi t t ed to Elsevier, February 1998. 2. P. Fleischmann, G. O. Michler, P. Roelse, J. Rosenboom, R. Staszewski, C. Wag- ner, and M. Weller. Li near Algebra over Smal l Fi ni t e Fi el ds on Paral l el Ma- chi nes, volume 23 of Vorl esungen aus dem Fachberei ch Mat hemat i k. University of Essen, 1995. 3. Pet er Roelse. Factoring high-degree polynomials over F2 with Niederreiter' s algorithm on the IBM-SP/ 2. Mat h. Comp. to appear. 4. M. Weller. Parallel gaussian efimination over small finite fields. In Procedings of t he 9th I nt er nat i onal Conf erence on Paral l el and Di st ri but ed Comput i ng Sys- t ems. ISCA, Sept ember 1996. 5. Michael Weller. Construction of large per mut at i on representations for mat ri x groups. In E. Krause and W. Js editors, High Pe r f or manc e Comput i ng in Sci ence and Engi neeri ng '98, pages 430-452. HLRS St ut t gar t , Springer-Verlag Berlin, Heidelberg, New York, 1998. 6. Michael Weller. Construction of large per mut at i on representations for mat ri x groups | I. Submitted, 1999. Hi ghl y Available Di st ri but ed Storage Syst ems Li hao Xu 1 and J e hos hua Br uck 2 1 Depar t ment of Comput er Science, Washington University, Campus Box 1045, Saint Louis, MO 63130, USA. Emaih lihao@cs.wustl.edu. Thi s work was done while this aut hor was at the California Inst i t ut e of Technology. 2 Depar t ment of Electrical Engineering, California Inst i t ut e of Technology, Mail Stop 136-93, Pasadena, CA 91125, USA. Emaih bruck@paradise.caltech.edu. 1. I n t r o d u c t i o n I nf or mat i on is gener at ed, pr ocessed, t r a ns mi t t e d and s t or ed in var i ous f or ms: t ext , voice, i mage, vi deo and mul t i me di a t ypes. Her e all t hese f or ms will be t r e a t e d as gener al dat a. As t he need for d a t a i ncreases exponent i al l y wi t h t he pas s age of t i me and t he i ncr ease of c omput i ng power, d a t a s t or age becomes mor e and mor e i mpor t a nt . Fr om scientific comput i ng t o busi ness t r a ns a c - t i ons, d a t a is t he mos t pr eci ous par t . How t o st or e t he d a t a r el i abl y a nd efficiently is t he essent i al issue, t ha t is t he focus of t hi s chapt er . As wi t h di s t r i but ed comput i ng, di s t r i but ed st or age is comi ng of age as a good sol ut i on t o achi eve scal abi l i t y, f aul t t ol er ance and efficiency. Hi st or i cal l y, since t he speed of st or age devi ces, such as t apes and disks, is much sl ower t ha n t he speed of c omput i ng devi ces, e.g., CPUs , I / O is a bot t l eneck in com- put i ng syst ems. To i mpr ove t he d a t a t hr oughput of s t or age devi ces, RAI D (Redundant Array of Independent Disks) was proposed[14][16] t o st or e d a t a over mul t i pl e s t or age devi ces in a di s t r i but ed way, so t ha t t he t ot al I / O band- wi dt h is s um of t he bandwi dt hs of t he i ndi vi dual s t or age devi ces. Th a t was t he s t a r t of di s t r i but ed ( net wor ked) st or age. Since t hen, s t or age t echnol ogi es have been advanci ng r api dl y; t he capaci t y of magnet i c devi ces cont i nuous l y i ncreases and access speed cons t ant l y i mpr oves. But as wi t h CPUs , t her e ar e physi cal l i mi t s t o t he densi t y of di sks, seek t i me and r ot at i onal s peed of t he di sk dri ves. Thes e l i mi t s mean t ha t t he capaci t y and access speed of a si ngl e s t or age devi ce can not be i mpr oved infinitely. The need for s t or age c a pa c i t y and access speed can be met by i mpr ovi ng st or age s ys t ems at t he ar chi t ec- t ur al level, i.e., usi ng mul t i pl e di s t r i but ed st or age devi ces connect ed vi a a f ast net wor k, such as t he Fiber Channel, whi ch r educes d a t a l at ency i ncur r ed over t he net wor k t o much less t ha n t he l at ency t i me of a single s t or age devi ce. A di s t r i but ed s t r uct ur e not onl y can i ncr ease t he capaci t y and speed of s t or age syst ems, but al so can br i ng f aul t t ol er ance and scal abi l i t y. As wi t h comput i ng, f aul t t ol er ance (or reliability) is i ncr easi ngl y i mpor - t a nt in s t or age syst ems. Some cri t i cal da t a shoul d be avai l abl e and s ome servi ces shoul d be pr ovi ded even when f aul t s occur in s t or age uni t s. Besi des, a s t or age s ys t e m t ha t al l ows s ome f aul t y uni t s can be r epl aced on- t he- f l y woul d have gr eat val ue for busi ness t r ans act i ons , such as a i r por t ma n a g e me n t , 308 banki ng syst ems, and i nt er net servi ce pr ovi der syst ems. Nat ur al l y, rel i abi l i t y of st orage syst ems can be achi eved mor e easi l y usi ng di s t r i but ed st r uct ur e. Scal abi l i t y is anot her nat ur al f eat ur e of di s t r i but ed syst ems: addi t i on or re- pl acement of component s is much mor e flexible in a di s t r i but ed syst em t ha n in a cent ral syst em. Thus di s t r i but ed st or age syst ems can adapt be t t e r t o dynami c and growi ng da t a demands. In this chapt er , t he reliability, efficiency and scal abi l i t y of di s t r i but ed st or - age syst ems are all consi dered aspect s of availability. A hi ghl y avai l abl e st or - age syst em has hi gh rel i abi l i t y (or can t ol er at e mor e faul t s), hi gh efficiency (or per f or mance) and scalability. Achi evi ng hi gh avai l abi l i t y in di s t r i but ed st orage syst ems is t he mai n t opi c of t hi s chapt er . Thi s chapt er mai nl y consists of t wo par t s. The first par t discusses t he rel i abi l i t y issue. The rel i abi l i t y is usual l y achi eved by i nt r oduci ng da t a re- dundancy i nt o a st or age syst em. The second par t shows t ha t t he efficiency of a dat a st or age syst em can be i mpr oved by pr oper l y usi ng t he da t a r edun- dancy in t he syst em. So t he appr oaches of i nt r oduci ng da t a r e dunda nc y ar e very i mpor t ant t o a st or age syst em, for bot h rel i abi l i t y and efficiency. Thi s chapt er will descri be a few MDS array codes, a class of error-control codes t hat are ver y sui t abl e t o be used t o i nt r oduce da t a r e dunda nc y in st or age syst ems. 2. MDS Ar r a y Co d e s f o r Re l i a b i l i t y 2. 1 Ar r a y C o d e s Rel i abi l i t y of st or age syst ems is of t en achi eved by st or i ng r edundant da t a in t he syst ems usi ng er r or - cont r ol codes. Usual l y in st or age syst ems, t he failure of a single st or age uni t can be det ect ed by t he st or age cont r ol l er s and t hen can be masked. Thus erasure-correcting codes ar e of t en used, since t he device failures can be mar ked as erasures. Er as ur e- cor r ect i ng codes ar e a mat hemat i cal means of r epr esent i ng da t a so t hat lost i nf or mat i on can be recovered. Wi t h an (n, k) er asur e- cor r ect i ng code, we r epr esent k symbol s of t he original da t a wi t h n symbol s of encoded da t a ( n - k is cal l ed t he a mount of redundancy or parity). Wi t h an m- er as ur e- cor r ect i ng code, t he ori gi nal da t a can be recovered even if m symbol s of t he encoded da t a are lost[13], and t he distance d of t hi s code is defi ned t o be d = m + 1. A code is said t o be Maxi mum Di st ance Separable (MDS) if m = n - k. An MDS code is opt i mal in t er ms of t he amount of r edundancy versus t he er asur e r ecover y capabi l i t y. The Reed-Sol omon code [13] is a wel l -known exampl e of an MDS code. The compl exi t y of t he comput at i ons needed t o cons t r uct t he encoded da t a (a process called encoding) and t o r ecover t he ori gi nal da t a (a process cal l ed decoding) is an i mpor t a nt consi der at i on for pr act i cal syst ems. Ar r ay codes are ideal in this respect . Ar r ay codes have been st udi ed ext ensi vel y [2][3][4][8]. 309 A c ommon pr ope r t y of t hese codes is t ha t t he encodi ng and decodi ng pr oce- dur es use onl y si mpl e bi nar y excl usi ve- or ( XOR) , whi ch can be i mpl e me nt e d easi l y in ha r dwa r e a n d / o r soft ware; t hus t hese codes ar e much mor e efficient t h a n Reed- Sol omon codes in t e r ms of c omput a t i on compl exi t y and ar e ver y sui t abl e t o be used in st or age syst ems, for bot h rel i abi l i t y and efficiency. In an a r r a y code, t he i nf or mat i on (ori gi nal ) and pa r i t y ( r edundant ) bi t s ar e pl aced in a 2- di mensi onal a r r a y of size 1 x n. I n a di s t r i but ed s t or age s ys t em, t he bi t s in t he s ame col umn ar e s t or ed in t he s ame disk. I f any bi t in a di sk is cor r upt ed, t hen t he di sk is consi der ed t o be a fai l ure di sk and needs r epai r , i.e., t he cor r es pondi ng col umn of t he code is consi der ed t o be an er asur e. Cur r ent RAI D ( Re dunda nt Ar r ay of I ndependent Di sks) s ys t ems can t ol - er at e at mos t one di sk fai l ure at a t i me, i.e., t he code used is onl y a 1- er asur e- cor r ect i ng code. I n mor e and mor e appl i cat i ons, f aul t - t ol er ance of onl y one single di sk is not enough. A s ys t em t h a t can t ol er at e mor e t ha n one fai l ure at t he s ame t i me woul d be mor e r obus t and flexible. For exampl e, when one di sk fails, t he s ys t e m can still have some non- s t op f aul t - t ol er ance capabi l i t y whi l e t he ba d di sk is bei ng r epl aced by a good one. Thi s level of f aul t t ol er ance re- qui res codes wi t h hi gher er as ur e- cor r ect i ng capabi l i t y. A 2- er as ur e- cor r ect i ng code can pr ovi de a much l onger nons t op f unct i oni ng t i me t o a di s t r i but ed s t or age s ys t e m t ha n a 1- er asur e- cor r ect i ng code. Consi der i ng all t he above f act or s, i.e., c omput a t i on compl exi t y, opt i ma l r e dunda nc y and hi gh level f aul t t ol er ance, we will focus on t hr ee cl asses of 2- er asur e- cor r ect i ng MDS ar r ay codes: t he EVENODD code[2][3], t he X- Code[21] and t he B-Code[22]. Thes e codes can be used effect i vel y t o achi eve t he rel i abi l i t y and efficiency of s t or age syst ems. 2. 2 E VE NODD Co d e The EVENODD code has a ver y si mpl e st r uct ur e: all bi t s ar e pl aced in an ar r ay of size (p - 1 ) x ( p + 2), wher e p is a pr i me number , i.e., it has p + 2 col umns. All t he i nf or mat i on bi t s ar e pl aced in t he first p col umns, and t he l ast 2 col umns cont ai n all pa r i t y bi t s. The 2 par i t y col umns ar e cons t r uct ed usi ng t he di agonal s of sl ope 0 and sl ope - 1 , respect i vel y. Det ai l s a b o u t t he cons t r uct i on of t he EVENODD code can be f ound in [2]. Fol l owi ng e xa mpl e shows a cons t r uct i on for a (7,5) EVENODD code: Exampl e 2.1. A (7,5) EVENODD code Table 2.1 shows an encoding rule of a (7, 5) E V E NODD code, and Table 2. 2 is a numeri cal exampl e of Table 2.1. [] I t was pr oven t ha t [2] t ha t t he EVENODD code is a 2- er as ur e- cor r ect i ng a r r a y code, i.e., it is MDS. An al gor i t hm for r ecover i ng 2 er ased col umns for t he EVENODD code and ot her det ai l s can be f ound in [2]. A gener al i zat i on 310 Tabl e 2.1. Encoding of a (7,5) EVENODD code, where s = a5 + b4 -b C3 -~- d2 al a2 a3 a4 a5 bl b2 b3 b4 ba C1 C2 C3 C4 C5 d l d2 d3 d4 d5 al q- a2 + a3 + aa + a5 s q- al + b5 q- c4 + d3 b 1 + b2 -4- ba + b4 + b5 s + a2 + bl + c5 + d4 C1-~" C2 "4" C3 -~ C4 -[- C5 s A- a3 + b2 + c l + d5 dl + d: + d3 + d4 + d5 s + a4 + b3 + c2 + dl Tabl e 2.2. Numerical example of a (7,5) EVENODD code 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 of t he EVENODD code t o recover mor e er asur es whi l e still t o mai nt ai n t he MDS pr oper t y is descr i bed in [3]. 2. 3 Up d a t e Co mp l e x i t y One i mpor t ant par amet er of ar r ay codes is t he aver age number of par i t y bits affect ed by a change of a single i nf or mat i on bit; t hi s pa r a me t e r is cal l ed t he updat e c o mp l e x i t y here. The upda t e compl exi t y is par t i cul ar l y cruci al when t he codes are used in st or age appl i cat i ons t ha t upda t e i nf or mat i on frequent l y. It also measur es of t he encodi ng compl exi t y of t he code. The lower this par amet er is, t he si mpl er t he encodi ng oper at i ons are. If a code is descri bed by a p a r i t y c he c k mat r i x[ 13] , t hen t hi s pa r a me t e r is t he aver age r ow d e n s i t y - - t he number of nonzer o ent ri es in a row - - of t he par i t y check mat r i x. Research has been done t o r educe t hi s pa r a me t e r or t o make t he densi t y of par i t y check mat r i x of codes as low as possi bl e [9][17]. The obvi ous lower bound of t he upda t e compl exi t y of any 2- er asur e- cor r ect i ng code is 2. The updat e compl exi t y of E V E N ODD codes appr oaches 2 as t he l engt h ( number of t he col umns) of t he codes increases. But it was pr oven in [3] t ha t for any l i near ar r ay code wi t h separ at e i nf or mat i on and par i t y col umns, t he updat e compl exi t y is always s t r i c t l y l arger t ha n 2. The n a nat ur al quest i on is whet her t he updat e compl exi t y of 2 is achi evabl e for general ar r ay codes. The answer is, f or t unat el y, yes. The next t wo subsect i ons will descri be t wo classes of codes, called X- Code and B- Code respect i vel y, whose upda t e compl exi t y is exact l y 2. 2. 4 X- Co d e The X- Code is a class of 2- er asur e- cor r ect i ng MDS ar r ay code. It s upda t e compl exi t y is exact l y 2, i.e., it has t he opt i mal encodi ng ( updat e) pr oper t y. It has a ver y simple geomet r i cal const r uct i on st r uct ur e. 311 2 . 4 . 1 S t r u c t u r e o f t h e X- Co d e . I n X- Code, i nf or mat i on bi t s ar e pl aced in an a r r a y of size (n - 2) n. Li ke ot her a r r a y codes [2][3][5][11], par i t y bi t s ar e cons t r uct ed by addi ng t he i nf or mat i on bi t s al ong sever al pari t y check lines or diagonals of gi ven slopes. The addi t i on ope r a t i on is j us t t he bi nar y XOR. But i nst ead of bei ng put in s e pa r a t e col umns, t he pa r i t y bi t s of t he X- Code ar e pl aced in two addi t i onal rows. So t he coded a r r a y is of size n n, wi t h t he first n - 2 r ows cont ai ni ng i nf or mat i on bi t s, and t he l ast t wo rows cont ai ni ng pa r i t y bi t s. Not i ce t ha t each col umn has i nf or mat i on bi t s as well as pa r i t y bi t s, i.e., i nf or mat i on bi t s and pa r i t y bi t s ar e mi xed in each col umn. By t he s t r uc t ur e of t he code, if t wo col umns ar e er ased, t he numbe r of r emai ni ng bi t s is n( n - 2), whi ch is equal t o t he numbe r of ori gi nal i nf or mat i on bi t s, ma ki ng it possi bl e t o r ecover t he t wo er ased col umns. 2 . 4 . 2 E n c o d i n g Ru l e s . The encodi ng rul e of t he X- Code is si mpl e: let Ci,j be t he bi t at t he i t h r ow and j t h bi t , t hen pa r i t y bi t s ar e cons t r uct ed accor di ng t o t he fol l owi ng rules: n- - 3 Cn-2,i = ~ Ck,(~+k+2). k=O n- 3 Cn-l,i = Z Ck'(i-k-2)n (2.1) o 0 a .~ 0 m ~ ~ 0 0 a i 0 ~? m ~ 0 0 a i <> 0 m ~ 0 0 a 8 i 0 9 ~ 0 0 a a .~ 0 m ~ 0 0 0 a ~ 0 m ~ 0 last row being an i magi nary O-row, as follows: +- 1st pari t y row +- i magi nary O-row The second pari t y row is calculated along the diagonals of slope - 1 , with the last row being an i magi nary O-row, as follows: k=0 wher e i = 0, 1, . - - , n - 1, and (x)n = x mod n. Geomet r i cal l y speaki ng, t he t wo pa r i t y rows ar e j us t t he checksums al ong di agonal s of sl opes 1 and - 1 respect i vel y. The fol l owi ng exampl e shows t he encodi ng of a (7,5) X- Code: Exampl e 2.2. A (7, 5) X- Code The f i rst pari t y row is calculated along the diagonals of slope 1, with the 312 [ ] 0 LL i ~ ~ A D ~ A D 0 a A ~ 0 a M +-- 2nd par i t y row +- i magi nar y O-row Table 2. 3 s hows a nume r i c al e x ampl e of a ( 7, 5) X- Code . Ta bl e 2. 3. Numerical example of a (7,5) X-Code 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 [3 Fr om t he cons t r uct i on of t he X- Code, it is eas y t o see t ha t t he t wo pa r i t y rows ar e obt ai ned i ndependent l y, mor e specifically, each i nf or mat i on bi t af- fect s exact l y one par i t y bi t in each pa r i t y row. All par i t y bi t s de pe nd onl y on i nf or mat i on bi t s, but not on each ot her . So, upda t i ng a si ngl e i nf or mat i on bi t r esul t s in upda t i ng onl y t wo pa r i t y bi t s. Thus t he X- Code has t he opt i ma l encodi ng (or upda t e ) pr oper t y, i.e., i t s upda t e compl exi t y of 2 ma t c he s t he l ower bound for any 2- er asur e- cor r ect i ng code. In addi t i on, not i ce t ha t each col umn has t wo par i t y bi t s, each of whi ch is t he checksum of n - 2 i nf or mat i on bi t s. Thus comput i ng pa r i t y bi t s at each col umn needs 2 ( n - 3 ) XORs. Thi s bal anced c omput a t i on pr ope r t y of X- Code is ver y useful in appl i cat i ons t ha t r equi r e evenl y di s t r i but ed c omput a t i ons . I t was pr oven in [21] t ha t T h e o r e m 2. 1. ( MDS P r o p e r t y o f t h e X- Co d e ) X- Co d e can recover up to t wo erased col umn, i. e. , it is MDS , if and onl y if n is a p r i me number . The pr ocedur e t o recover up t o t wo er ased col umns is cal l ed erasure- correct i ng or erasure-decodi ng. A f or mal descr i pt i on and cor r ect ness pr oof of t he er as ur e- decodi ng al gor i t hm for t he X- Code can be f ound in [21]. A pseudo- code descr i pt i on of t he al gor i t hm can al so be f ound in [23]. Mor e det ai l s a bout t he X- Code, i ncl udi ng exampl es of i t s er as ur e- decodi ng al gor i t hm ar e di scussed in [21] and [23]. 2. 5 B- Co d e 313 Now we descr i be a not he r class of 2- er as ur e- cor r ect i ng MDS ar r ay code, cal l ed B- Code. I t s upda t e compl exi t y is al so exact l y 2, i.e., it has t he opt i ma l en- codi ng ( updat e) pr oper t y. Cons t r uct i on for t he B- Code is not as di r ect as for t he X- Code. I t is r el at ed t o a cl assi cal gr a ph t heor y pr obl em. 2. 5. 1 S t r u c t u r e o f t h e B- Co d e . B- Code is of size n l, wher e l -- 2n or 2n + 1. Denot e such a B- Code Bt. As t he X- Code, pa r i t y bi t s ar e pl aced in a row r a t he r t ha n col umns. For B2n, t he first n - 1 rows ar e i nf or mat i on rows, and t he l ast row is a pa r i t y row, i.e., all t he bi t s in t he fi rst n - 1 r ows ar e i nf or mat i on bi t s, while t he 2n bi t s in t he l ast row ar e pa r i t y bi t s. The s t r uc t ur e of B2n+l can be der i ved f r om t ha t of t he B2n s i mpl y by addi ng one mor e i nf or mat i on col umn as t he l ast col umn. Thei r s t r uct ur es ar e shown in Fi gur e 2.1. T n - 1 i 2 n 2n t - Information Parity ~- I n - 1 Information n [ i Parity o (a) (b) Fi g. 2.1. Structures of (a) B2n and (b) B2,~+1. I nt ui t i vel y, if t he roles of t he i nf or mat i on and pa r i t y bi t s of t he B- Code ar e exchanged, i.e., t he par i t y bi t s ar e pl aced in t he ent r i es whi ch or i gi nal l y were for t he i nf or mat i on bi t s and vi ce ver sa, t hen we get t he dual code of t he B- Code. Denot e t he dual B- Code of l engt h l / 3t . A mor e r i gor ous def i ni t i on of t he dual code for gener al a r r a y codes can be f ound in [22]. I t was al so pr oven t ha t [22] T h e o r e m 2. 2. The dual code of an MDS array code is also MDS. So t he dual B- Code is al so an MDS ar r ay code; it has di st ance I - 1, i.e., t he dual B- Code can be r ecover ed f r om any t wo of i t s col umns. Fi gur e 2.2 shows s t r uct ur es of / 32n a nd/ ) 2n+1. 2. 5. 2 A Ne w Gr a p h De s c r i p t i o n o f t h e B- Co d e . Typi cal l y, an a r r a y code is descr i bed by i t s geometrical constructions lines or diagonals[2][3] [5] [11], such as t he X- Code. Cons t r uct i ons of a r r a y codes ar e difficult t o get usi ng t hi s descr i pt i on. Her e we descr i be t he B- Code and i t s dual usi ng a new g r a p h appr oach, whi ch al l ows us t o get t he cons t r uct i on of t he B- Code easily. For any ar r ay code, each pa r i t y bi t is t he s um of s ome i nf or mat i on bi t s. For bi na r y codes, t he addi t i on is j us t t he si mpl e XOR ( bi nar y exclusive OR) 314 2 1 ] Information n - 1 Parity ( a ) 2 I I ~ | = ~- Information P ~- a r i n- 1 Parity t y ( b ) Fi g. 2.2. Structures of (a) /32~ and (b) B2n-F1 oper at i on. Now consi der t he dual B- Code /3t. Si mpl e count i ng [22] shows t ha t each par i t y bi t must be t he sum of exactly 2 i nf or mat i on bits. Thus if we r epr esent an i nf or mat i on bi t as a ver t ex, t hen a par i t y bi t can be r epr es ent ed by an edge, wher e t he par i t y bi t is t he sum of t he t wo i nf or mat i on bi t s whose vert i ces f or m t he edge. Thi s is t he key i dea of descri bi ng t he B- Code and i t s dual wi t h gr aphs. Since t he const r uct i on of / 32n, B2n and B2n+l can be easi l y obt a i ne d f r om t he B2n+l , her e we give a gr aph descr i pt i on of t he /32n+1. Det ai l ed j ust i fi cat i ons of t hi s descri pt i on can be f ound in [22]. De s c r i p t i o n 2. 1. Graph Description of 1~2n+1 Gi ven a compl et e gr aph K2~ wi t h 2n vert i ces, which are l abel ed wi t h i nt egers f r om 1 t o 2n, find an edge l abel i ng scheme such t hat 1) each edge is l abel ed exact l y once by an i nt eger f r om 1 t o 2n+1 2) For any pai r of vert i ces (i , j ) and any ot her ver t ex k, where i , j , k E [1, 2n], t her e is al ways a pat h t o k f r om ei t her i or j , usi ng onl y t he edges l abel ed wi t h i or j . 3) For any ver t ex i and any ot her ver t ex k, wher e i, k E [1, 2n], t her e is always a pat h f r om i t o k, using onl y t he edges l abel ed wi t h i or 2n + 1. Wi t h t he above descri pt i on, it is easy t o see t ha t t he ver t ex and edges wi t h t he l abel i in t he K2~ r epr esent t he i nf or mat i on bi t and par i t y bi t s in t he i t h col umn of t he/ 32n+1. The pr oper t i es 2) and 3) ensur e t ha t any t wo col umns of t he code can recover t he i nf or mat i on bi t s in all ot her col umns, t hus t he code is of col umn di st ance 2n. Fi gur e 2.3 shows such a l abel i ng of K4 and t he cor r espondi ng/ ~5, wher e al t hr ough a4 ar e t he i nf or mat i on bi t s. 2. 5. 3 Co n s t r u c t i o n o f t h e B- Co d e . As al r eady descr i bed above, con- s t r uct i ng t he B- Code amount s t o t he same pr obl em as desi gni ng an edge l abel i ng scheme such as in Descr i pt i on 2.1 for a compl et e gr aph K2n. For- t una t e l y t hi s can be r el at ed t o anot her gr aph t heor y pr obl em, namel y t he perfect one-factorization ( P1F) pr obl em. De f i n i t i o n 2. 1. [19] Let G=(V,E) be a gr aph. A factoror spanning subgraph of G is a subgr aph wi t h ver t ex set V. In par t i cul ar a one-factor is a f act or 1 ~ 4 ~ 2 4 w 2 w 3 ( a ) i o , i o 2 i a 3 I ~ 1 ~ 1 7 6 a 2 Jr- a 3 a 3 -t- a 4 a4 -t- a l a l + a 2 a2 -t- a 4 (b) Fi g. 2. 3. (a) graph and (b) array representations of 1)5 315 whi ch is a r egul ar gr aph of degr ee 1. A factorization of G is a set of f act or s of G whi ch ar e pai r wi se edge disjoint, and whose uni on is all of G. A one- factorization of G is a f act or i zat i on of G whose f act or s ar e all one- f act or s. I n par t i cul ar , a one- f act or i zat i on is perfect if t he uni on of any pai r of i t s one- f act or s is a Hamilton cycle, a cycl e t ha t passes t hr ough ever y ver t ex of G. Fi gur e 2.4 shows a per f ect one- f act or i zat i on o f / ( 4 . 1 2 1 2 1 2 .. : : : : > < 4 (a) 3 4 (b) 3 4 (c) 3 Fi g. 2. 4. (a)(b)(c) are 3 one-factors, t hat together form a perfect one-factorization o f K4 The per f ect one- f act or i zat i on of compl et e gr aphs has been s t udi ed for ma n y year s since its i nt r oduct i on in 1960' s in [12]. I t is now known t hat [19]: T h e o r e m 2. 3. I f p is an odd prime, then Kp+a and K2p have perfect one- factorizations. Cons t r uct i ons of P l F for Kp+l and K2p can be f ound in [1] and [18]. Ad- di t i onal l y, cons t r uct i ons of P1F for K2n'S whose n' s ar e some ot her s por adi c i nt eger s have al so been found[18][19]. However it still r emai ns a conj ect ur e [18][19] t hat : Conjecture 2.1. For any posi t i ve i nt eger n, K2~ has per f ect one- f act or i zat i on( s ) . I t was pr oven in [22] t hat : 316 T h e o r e m 2. 4. Le t Pm be a P I F / o r Ki n. Co n s t r u c t i n g / } 2 n +l ( or equi va- l ent l y B2n+l ) is equi val ent to cons t r uct i ng P2n+2. The or e m 2.4 was pr oven cons t r uct i vel y in [22]. Combi ni ng The or e m 2.3 and The or e m 2.4, we get t hat : T h e o r e m 2. 5. For any odd p r i me p, a B- Co d e and i t s dual code o] si ze n x 1 can be const ruct ed, wher e n is ei t her p- 1 or p - 1, and I = 2n or 2n + 1. 2 2. 5. 4 E r a s u r e Re c o v e r y . Recall t ha t t he dual B- Code can recover all in- f or mat i on bi t s from any t wo col umns. Er as ur e decodi ng for t he dual B- Code is al most obvi ous f r om its gr aph descr i pt i on ( De s c r i pt i on 2. 1). The t wo pat hs, s t ar t i ng f r om i and j and l eadi ng t o all t he ot her vert i ces in t he gr aph, give t he decodi ng chai n used in recoveri ng a / } - Code f r om its i t h and j t h col umns. Fi gur e 2.5 shows t he decodi ng chai n used in r ecover i ng/ }5 f r om its 1st col umn and its 2nd, 3r d and 5t h col umns, respect i vel y. l 2 4 ~- 2 ~3 3 3 1 , 2 - +3 - - +4 1 -~ 4,3 ---~ 2 1 - + 3 - + 2 - - + 4 (a) (b) (c) Fi g. 2.5. Erasure decoding of/~5: recovering from its 1st and (a) 2nd (b) 3rd and (c) 5th columns. The decoding chains for each case are also listed. 1 through 4 axe the information bits in the corresponding columns. For mal er asur e r ecover y al gor i t hms for t he B- Code and its dual code, and mor e det ai l s about t he B- Code can be f ound in [22] and [23]. 2. 6 Co mp a r i s o n s o f Ar r a y Co d e s As al r eady seen above, t he X- Code and B- Code have t he opt i mal upda t e pr oper t y, i.e., t hei r upda t e compl exi t y is exact l y 2. The B- Code also achi eves t he ma x i mu m l engt h possible for MDS codes wi t h t he opt i mal updat e, t hus t he B- Code has opt i mal l engt h, t wi ce t ha t of t he X- Code wi t h t he same col- umn size. In addi t i on, t he par i t y bi t s are evenl y di s t r i but ed over all col umns, and each par i t y bi t requi res t he same number of X OR oper at i ons. Conse- quent l y, t he comput at i onal compl exi t y for comput i ng par i t y bi t s is balanced, i.e., t he X- Code and B- Code f eat ur e bal anced c omput at i on as well. Thi s pr op- er t y is qui t e useful in di st r i but ed st or age syst ems, since t he comput at i onal l oads ar e nat ur al l y di st r i but ed t o all disks evenly, el i mi nat i ng anot her bot t l e- neck. The pr oper t i es of t he X- Code and t he B- Code are summar i zed in Tabl e 2.4, t oget her wi t h a compar i son wi t h Re e d - S o l o mo n and E V E N ODD codes. Ta b l e 2. 4. X-Code, B-Code vs. Reed-Solomon and EVENODD. Codes \ MDS XOR Propert i es Reed Solomon Yes No EVENODD Yes Yes X-Code Yes Yes B-Code Yes Yes Opt i mal Updat e Opt i mal Lengt h Balanced Comput at i on No Yes No No No No Yes No Yes Yes Yes Yes 317 3. Efficiency through Redundancy Whi l e it is convent i onal wi s dom t ha t r e dunda nc y is necessar y for f aul t t ol - er ance, r e dunda nc y is in gener al r egar ded as a passi ve cost ( over head) t o achi eve rel i abi l i t y. However, in t hi s sect i on, it will be shown t h a t in a dis- t r i but e d s t or age syst em, r e dunda nc y is an act i ve pa r t of t he s ys t e m in t he sense t h a t pr ope r da t a r e dunda nc y can hel p t o i mpr ove t he pe r f or ma nc e ( da t a t hr oughput ) of s t or age syst ems. Thus da t a r e dunda nc y will i mpr ove not onl y t he r el i abi l i t y of a s ys t em, but al so t he efficiency of a s ys t em. A si m- i l ar i dea was first shown in [7], namel y t ha t r e dunda nt d a t a can ma ke packet r out i ng mor e efficient by r educi ng t he mean and var i ance of t he r out i ng de- lay. Recent l y, mor e scal abl e and efficient rel i abl e mul t i cas t schemes have been pr opos ed, bas ed on da t a r e dunda nc y in t he messages t o be mul t i cast [10]. We will show her e a mor e s ys t emat i c way of usi ng pr ope r r edundancy, bas ed on er r or - cor r ect i ng codes ( par t i cul ar l y t he MDS ar r ay codes descr i bed above) , t o i mpr ove t he per f or mance of d a t a ser ver s ys t ems , whi ch ar e a s uper s et of s t or age syst ems. Our d a t a ser ver s ys t e m s et up is shown in Fi gur e 3.1: a cl ust er of ser ver s is connect ed vi a some rel i abl e communi cat i on net wor k. I n addi t i on, br oa dc a s t is s uppor t e d over t he net wor k, so t ha t a cl i ent can br oa dc a s t i t s r equest for cer t ai n d a t a t o s ome or all of t he n ser ver s in t he syst em. The d a t a is di s t r i but ed over t he ser ver s in such a way t ha t a cl i ent can r ecover t he compl et e r equest ed da t a af t er it get s da t a f r om at l east k of t he n ser ver s and t hi s is t r ue for any k servers. Such a di s t r i but ed da t a ser ver s ys t e m is cal l ed an (n, k) ser ver syst em. Agai n, such (n, k) s ys t ems can be i mpl e me nt e d by usi ng er r or - cor r ect i ng codes, par t i cul ar l y MDS ar r ay codes. For t he above d a t a ser ver s ys t em, t her e ar e a coupl e pr obl ems t o solve: (1) Wh a t is t he proper r e dunda nc y when t he t ot al numbe r of t he ser ver s is gi ven? Or how shoul d k be det er mi ned when n is gi ven, in or der t o achi eve t he bes t s ys t e m per f or mance? Thi s is t he so-cal l ed data di st ri but i on pr obl e m at t he ser ver side. (2) Once d a t a r e dunda nc y is pr ope r l y di s t r i but ed a mo n g t he servers, how shoul d ma t c hi ng r ead appr oaches be chosen t o opt i mi ze me a n servi ce t i me? Thi s is t he pr obl e m cal l ed t he data acqui si t i on at t he cl i ent side. Bot h pr obl ems will be expl or ed in t hi s sect i on, mos t l y t heor et i cal l y. 318 Client . 1 unication ) Fi g. 3. 1. An (n, k) server syst em 3. 1 P r e l i mi n a r y An a l y s i s Bef or e we seek t he sol ut i ons t o t he above pr obl ems , we first define a ser ver s ys t e m model we will be using, bas ed on pr obabi l i t y anal ysi s. The n we gi ve s ome basi c anal yt i cal r esul t s t ha t can be used f ur t her t o solve t he d a t a dis- t r i but i on and t he da t a acqui si t i on pr obl ems. 3. 1. 1 S y s t e m Mo d e l . Defi ne t he servi ce t i me Ti of t he ser ver i (1 < i < n) t o be t he el apsed t i me f r om when t he cl i ent sends i t s r equest t o t he ser ver i t o when it recei ves d a t a f r om t he ser ver i. Not i ce t ha t T~ does not i ncl ude t he t i me needed at t he cl i ent side t o do any necessar y computations t o r ecover t he final dat a, since her e we as s ume t ha t t he c omput a t i ons ar e r a t he r si mpl e and t hus t ake much less t i me t ha n does t he d a t a del i ver y t hr ough c ommuni c a t i on medi a. We model Ti as a cont i nuous r a ndom var i abl e wi t h probability density .function (pdf) ,fi(t)[15]. For si mpl i ci t y of anal ysi s, we as s ume t ha t all T, s ar e i.i.d (independent, identically distributed) r a ndom var i abl es, i.e., 'fi (t) = . f(t ), l < i < n . 3. 1. 2 An a l y s i s Re s u l t s . Let Fi (t ) be t he cumulative distribution .function (cdf) of T~, i.e.[15], Fi (t ) = Probabi l i t y(Ti <_ t) = , f i (x)dx Now l et T( n, k) be t he el apsed t i me f r om when t he client br oadcas t s i t s d a t a r equest t o t he ser ver s t o when it recei ves d a t a f r om at l east k out of t he n servers. The n T( n, k) is anot her r a ndom var i abl e and is a si mpl e f unct i on of all t he Tis: T ( n , k ) > r i , where I1{i}11 >_ k I n t he above equat i on, IlSll is t he numbe r of t he el ement s in t he set S. 319 Let f(n, k) (t) and F(n,k ) (t) be t he pdf and cdf of T ( n , k) respect i vel y, t hen it is easy t o r el at e F( n, k) ( t ) and f ( n, k) ( t ) t o F ( t ) and f ( t ) [7]: F(~, k)(t ) = ~ (~) F( t ) i [ 1 - F( t ) ] n - i (3.1) i =k or [7][20]: f(n, k) (t) -- dF( n, k) ( t ) _ k (~) F ( t ) k - 1 [1 - F ( t ) ] n - k f ( t ) (3.2) dt The me an of T ( n , k) E [ T ( n , k)] is a good meas ur ement of t he server syst em' s per f or mance. It can be cal cul at ed once t he f ( n, k) ( t ) is known: E [ T ( n , k)] = t f ( n, k) ( t ) dt (3.3) 3. 1. 3 P r o p e r t i e s o f Me a n S e r v i c e T i me . Though it is usual l y har d t o get a cl ean closed f or m of E [ T ( n , k)] for a general pdf f ( t ) , it is still possi bl e t o get some of its pr oper t i es wi t h r espect t o n and k. Int ui t i vel y, for a fixed pdf f ( t ) , a bi gger n a n d / o r a smal l er k leads t o a smal l er E [ T ( n , k)] and t hi s can be pr oven mat hemat i cal l y[23]: Th e o r e m 3. 1. For a r andom variable T wi t h a f i xed pdf f ( t ) , t he f ol l owi ng i nequal i t i es hold f o r 1 < k < n: 1. E f T ( n , k)] > E [ T ( n + m, k)], f o r m > 1; 2. E f T ( n , k)] < E f T ( n , k + m)], f o r m >_ 1; 3. E [ T ( n , k ) ] < E [ T ( n + m , k + m)], f o r m >_ 1; 4. E [ T ( i , j ) ] >_ E [ T ( n , k ) ] , i f n >_ i and k <_ j , equality holds onl y when n -- i and k = j ; 5. E [ T ( i , j ) ] < E [ T ( n , k ) ] , i f n >_ i, k > j and n - k < i - j . [] We will use t hese pr oper t i es above as gui del i nes for t he da t a di st r i but i on and t he da t a acqui si t i on probl ems. One woul d hope t ha t t he vari ances of r andom vari abl es also had t he si mi l ar pr oper t i es. Unf or t unat el y, however, t he above pr oper t i es do not hol d for t he vari ances. One such an exampl e is shown in [23]. 3. 2 Se r v e r P e r f o r ma n c e Mo d e l Fr om Eq. (3. 2) and Eq. (3. 3), E f T ( n , k)] is a f unct i on of t he pdf f ( t ) of an i ndi vi dual ser ver ' s da t a servi ce t i me. The goal of t he da t a di st r i but i on and t he da t a acqui si t i on pr obl ems is t o r educe E f T ( n , k)] under vari ous condi t i ons. Before we anal yze t he da t a di st r i but i on and t he da t a acqui si t i on pr obl ems, it is necessar y t o est abl i sh some model of f ( t ) . 320 3 . 2 . 1 A b s t r a c t i o n f r o m E x p e r i m e n t s . T h e d a t a s e r v i c e t i me T d e p e n d s o n ma n y f a c t o r s i n a pr a c t i c a l s e r v e r s y s t e m, s uc h as c o mp u t i n g p o we r ( i . e. , C P U s pe e d) o f t h e s e r ve r s a nd t h e c l i e nt , l o c a l di s k I / O s p e e d o f t h e s e r v e r s a n d b a n d wi d t h a n d l a t e n c y o f t h e c o mmu n i c a t i o n me d i u m ( u s u a l l y i n c l u d i n g a r e l i abl e c o mmu n i c a t i o n s o f t wa r e l aye r ) c o n n e c t i n g t h e s e r v e r s a n d t h e c l i e nt . A mo d e l c o n s i d e r i n g al l t h e f a c t o r s wi l l be f ai rl y c o mp l e x . I n t hi s s e c t i o n , we wi l l t r y t o mo d e l t h e d a t a s e r v i c e t i me a s a s i mp l e p r o b a b i l i t y d i s t r i b u t i o n , t h a t c a n be a n a l y z e d r a t he r e as i l y, a n d y e t c a n a p p r o x i ma t e t h e r e al d a t a s e r v i c e t i me c l os e l y. Suc h a mo d e l wi l l b e a b s t r a c t e d f r o m e x p e r i me n t a l r e s u l t s o f a real d a t a s e r ve r s y s t e m. Ou r e x p e r i me n t a l s e r v e r s y s t e m c o n s i s t s o f s e v e r a l s e r ve r s , wh i c h ar e P C s r u n n i n g Li nux . Ea c h s e r ve r ha s d a t a s t o r e d o n i t s l o c a l ha r d di s k. D a t a i s a c c e s s e d v i a t h e Li nux fi l e s y s t e m. T h e c l i e nt i s a l s o a P C r u n n i n g t h e s a me Li nux . T h e n o d e s ar e c o n n e c t e d v i a My r i n e t s wi t c h e s . A sl i di ng wi n dow p r o t o c o l i s u s e d t o e ns ur e r e l i abl e c o mmu n i c a t i o n . E x p e r i me n t s ar e c o n d u c t e d i n s u c h a real s y s t e m t o me a s u r e t h e s e r v i c e t i me f or d a t a o f di f f e r e nt s i z e s . T h e p r o c e d u r e o f t h e e x p e r i me n t i s as f ol l ows : ( 1) t h e c l i e nt s e n d s a r e q u e s t f or a c e r t a i n a mo u n t o f d a t a t o a s erver; ( 2) t h e s e r v e r r e a ds t h e d a t a f r o m i t s l o c a l di s k a n d s e n d s i t t o t h e c l i e nt t h r o u g h t h e r e l i abl e c o mmu n i c a t i o n l ayer; ( 3) t h e d a t a i s de l i v e r e d t o t h e c l i e nt t h r o u g h t h e r e l i abl e c o mmu n i c a t i o n l ayer. T h e d a t a s e r v i c e t i me i s me a s u r e d f r o m t h e i n s t a n t t h a t t h e c l i e nt f i ni s he s s e n d i n g i t s r e que s t t o t h e i n s t a n t t h a t t h e c l i e nt g e t s t h e d a t a . We r un t h e a b o v e p r o c e d u r e a f e w t h o u s a n d t i me s f or d a t a o f a g i v e n s i z e , a n d g e t t h e s e r v i c e t i me p d f a c c o r d i n g t o t h e o b s e r v e d f r e que nc i e s o f di f f e r e nt r a n g e s o f s e r v i c e t i me . Fi g ur e 3. 2 s h o ws e mp i r i c a l s e r v i c e t i me pdf s f or d a t a s i z e s ( a) 32 Kb y t e s , (b) 320 Kb y t e s a n d ( c) 3200 Kb y t e s . lu im ~ m z f '~ \ \ i ,, / i i / ' \ . . . + + - - + " , , ~ . . . . o+o,~+ o o ~ o++m 1 1 1 1 1 + + 1 2 i m i ~ ir a + + 22 ' ~ (a) data si ze : 32K bytes (c) data si ze : 3200K byt es \ o m o o m ] "'] ] !,, ,-+ r . . . . \.-. ....... (b) data si ze : 320K bytes J i t ~ i \ i t i l l ! i ~ . i \ ' h i i J Fi g . 3 . 2 . Empi ri cal pdfs of servi ce t i me for dat a of different si zes 321 The effect i ve d a t a bandwi dt hs in t hi s exper i ment s ar e qui t e low, si nce t hey ar e t he concat enat i on of t he l ocal di sk ba ndwi dt h and t he rel i abl e com- muni cat i on l ayer bandwi dt h. But t he s hape of t he ba ndwi dt h pdf s is mor e i nt er est i ng. The exper i ment r esul t s show t h a t t he s hape of empi r i cal pdf s of di fferent d a t a size can be a ppr oxi ma t e d by t he s ame di st r i but i on. A cl oser l ook shows t ha t t he wi dt h of t he di s t r i but i on base is a ppr oxi ma t e l y pro- portional to t he da t a size. Mor e compl ex di st r i but i ons, such as t he Ga mma di s t r i but i on or t he Be t a di st r i but i on, mi ght be mor e accur acy. But t o si mpl i f y t he anal ysi s, t h a t follows, we will r egar d t he d a t a servi ce t i me T as a r a n d o m var i abl e defi ned on [a, b] (a and b ar e t wo p a r a me t e r s of a real s ys t em) , whi ch follows a t r i angul ar di st r i but i on, denot ed Tr[a, b]: ~ a <t < a+b f(t) = 2 (3.4) (b-o). ~ < t <_b I t s cdf (cumulative distribution function) is ~ a <t < a+b g(t) = (b--a)Z 2 (3.5) 1 - ~ ~+b <t <b ( b- - a) z 2 - - One expl anat i on for t hi s model is as follows: in a real s ys t em, d a t a is del i vered in packet s of some smal l size. The del i ver y t i me of t he i t h packet is a r a n d o m var i abl e ti, whose pr obabi l i t y di s t r i but i on can be char act er i zed by a uni f or m di st r i but i on over s ome t i me span; t he ti's ar e as s umed t o be i.i.d, r a ndom var i abl es. The n t he servi ce t i me T of t he whol e d a t a is: T -- s + ~ i ti, wher e s is a not he r uni f or m r a n d o m var i abl e descr i bi ng t he s et up (or over head) t i me for sendi ng a cer t ai n a mount of dat a. Thus t he pdf of T is a Gaussian-like f unct i on, whose bas e wi dt h is a ppr oxi ma t e l y pr opor t i ona l t o t he numbe r of t he packet s in t he dat a, whi ch in t ur n is pr opor t i ona l t o t he d a t a size. For si mpl i ci t y, we a p p r o x i ma t e t he Gaussi an- l i ke f unct i on by a sui t abl e t r i angul ar f unct i on. The di s t r i but i ons ar e shown in Fi gur e 3.3. 3. 2. 2 Ve r i f i c a t i o n wi t h T ( n , 1 ) . I nt ui t i vel y, havi ng mor e ser ver s shoul d pr ovi de be t t e r per f or mance when t he a mo u n t of da t a s t or ed on each ser ver is fixed, i.e., E[T(n, k)] decr eases as n i ncr eases a n d / o r k decr eases. We can get pdf s of t he T(n,k) for a d a t a ser ver s ys t e m by eval uat i ng Eq. (3. 2) for t he servi ce t i me di s t r i but i on in Eq. (3. 4) and Eq. (3. 5). Fi gur e 3. 4(a) shows t he pdfs of T(n, 1), wher e 1 < n < 3 and T is of t he t r i angul ar di s t r i but i on Tr [ 1, 2]. Her e we can see t he pdf of T(n, k) shi ft s left as n i ncr eases, whi ch i ndi cat es t h a t t he aver age of t he r a n d o m var i abl e T(n, k) decr eases as n i ncreases. To f ur t her veri fy t he pr oper t i es of E[T(n, k)], si mpl e exper i ment s t o mea- sure T(n, 1) were done on t he e xpe r i me nt a l ser ver s ys t e m descr i bed in pr e- vi ous subsect i on. The s ys t e m consi st s of t hr ee servers. I n or der t o r emove ot her f act or s t ha t al so affect d a t a servi ce t i me, such as cont ent i on in t he 322 --C i i i i i i ~ i ( ~ ) s~ I , 2 3 I S ~ 7 f t ( b ) 2 ~ ~ 5 7 I s ( ~ ) Fi g. 3. 3. Probabi l i t y distributions of dat a service t i me of (a) single packet, (b) t he whole data, (c) t he approxi mat i on with Tr[a,b] c ommuni c a t i on me di um (i ncl udi ng t he rel i abl e c ommuni c a t i on l ayer, whi ch is a bot t l eneck if we use a single client whi ch communi cat es wi t h t he t hr ee servers), we use t hr ee clients, each of whi ch is ser ved by a s e pa r a t e ser ver . Concept ual l y t he t hr ee clients are r egar ded as a single client, t hus t he whol e da t a servi ce t i me is t he mi ni mum of t he t hr ee i ndi vi dual servi ce t i me of t he ser ver - cl i ent pai r s. Fi gur e 3. 4(b) shows t he servi ce t i mes (T1,T2, and T3) of t he t hr ee i ndi vi dual server-cl i ent pai r s for 3200 Kbyt e s d a t a each. Si nce t he var i ance a mong t he t hr ee pai r s is bi gger t ha n t he var i ance wi t hi n each pai r , t he whol e servi ce t i me (Train), whi ch is t he mi ni mum of t he t hr ee, is det er mi ned by t he servi ce t i me of t he bes t cl i ent - ser ver pai r as can be seen in t he exper i ment al resul t s. In t hi s case, t he pdf of Tmi~ is ver y close t o t ha t of TI . To make t he exper i ment al r esul t s mor e i nt er est i ng, s ome r a n d o m l oads ar e added t o each server, so t ha t t he var i ance a mong t he t hr ee cl i ent - ser ver pai r s is less t ha n t he var i ance wi t hi n each pai r , i.e., each pai r behaves mor e si mi l arl y. The servi ce t i mes of t hr ee i ndi vi dual pai r s (T1, T2, and T3) and t he whol e servi ce t i me (Tmi~) ar e shown in Fi gur e 3. 4(c). Of t hose f our pdfs ( Tm~, T1, T2 and T3), t ha t of Tmi,~ is t he l ef t most , whi ch s uppor t s t he anal yt i cal pr oper t i es of T( n, k) and t he pdf model of T. 3 . 3 D a t a D i s t r i b u t i o n S c h e m e Now l et ' s t ur n t o t he da t a di st r i but i on pr obl em: in a ser ver s ys t em, wi t h a gi ven t ot al numbe r of servers, n, we need t o det er mi ne t he numbe r k of t he servers whi ch st or e t he raw d a t a in or der t o maxi mi ze t he pe r f or ma nc e of t he whol e s ys t e m (i.e., t o mi ni mi ze t he me a n servi ce t i me of cl i ent ' s d a t a r equest ) ; gi ven k, t he r est of t he ser ver s can st or e t he redundant dat a. Whe n n and t he pdf f ( t ) ar e fixed, E[T(n, k)] decr eases monot oni cal l y as k decr eases. 323 :3 2. 5 2 1 0. 5 0( 180 n n 9 : o .2 g 215 T 160 140 120 1 O0 60 P.1,
o c. c~ ,++ ;.;., 1 113 1 19 1 2 1 21 1 22 T ( =~) (b) T2 T3 180 180 140 120 8O 2O % o _- Tr ni n + _ 1":25"5 . . . . . . . . . . . 1. 21 1. 215 1. 22 1. 225 " 1. 23- - - - 1. 235 " - 1- 24 T ( se~) (c) Fi g. 3. 4. pdfs of T(n, 1): (a) analytical result, where the pdf of T is Tr[1, 2], and experimental service time for data of size 3200 Kbytes, where (b) no other loads on the servers, and (c) other random loads on the servers 324 Thi s means t ha t in or der t o make E[T(n, k)] smal l , k shoul d be as smal l as possi bl e. On t he ot her hand, however, t he smal l er k is, t he mor e d a t a needs t o be st or ed on each server, since t he t ot al a mount of t he d a t a a cl i ent needs is al ways fixed; t hi s means hi gher servi ce t i me f r om each server. Our goal is t o find such a k t ha t when bot h sides of t he pr obl em ar e consi der ed, E[T(n, k)] is mi ni mi zed. Af t er t he p a r a me t e r k is det er mi ned, in or der t o achi eve opt i ma l per f or - ma nc e in t e r ms of E[T(n, k)], we can use MDS a r r a y codes t o di s t r i but e t he r e dunda nt da t a so t ha t da t a f r om any k servers can be as s embl ed t o f or m t he whol e of t he r equest ed dat a, as was shown in t he pr evi ous sect i on. The onl y r emai ni ng pr obl em is t o det er mi ne k t o mi ni mi ze E[T(n, k)]. Appl yi ng t he pdf model of each ser ver ' s servi ce t i me, T, and usi ng MDS codes for di s t r i but i ng t he r edundant dat a, we get t ha t i f t he pdf of T is Tr[a, b] when k - - l , t hen for gener al k, t he cor r es pondi ng pdf is Tr[~, b], since t he bas e wi dt h of t he pdf is pr opor t i onal t o t he da t a size. Theor et i cal l y, t he opt i ma l k can be cal cul at ed as follows: kmin = argmink k (i) F(t) k-1 [1 - F(t)]n-ktf(t)dt (3.6) wher e f(t) and F(t) ar e as in Eq. (3. 4) and Eq. (3. 5), except t ha t a and b shoul d be r epl aced by ~ and ~ respect i vel y. Not i ce t ha t kmin is a f unct i on of t he ent i r e pdf f(t), not onl y t he me a n E(T) and t he var i ance Var[T]. Even for a si mpl e pdf such as Tr[a, b], t he above equat i on can not be sol ved in closed form. But in pr act i ce, t he s ys t e m p a r a me t e r s a and b can be det er mi ned by exper i ment s , t hen t he above equat i on can be sol ved numer i - cally. Fi gur e 3.5 gives several exampl es of sol vi ng t he above equat i on. In t he exampl es , a = 1 and b = 5. For n = 10, 20, and 40, E[T(n, k)] is cal cul at ed for 1 ~ k < n. The resul t s ar e shown in Fi gur e 3. 5( a) ( b) ( c) , wher e (b) and (c) onl y show t he l ast few val ues for k, since for smal l k E[T(n, k)] decr eases monot oni cal l y as k. Fr om t he resul t s, we can see (a) k mi n = 10, when n = 10, (b) kmin = 19, when n = 20, and (c) kmin = 37, when n = 40. Even t hough t he above exampl es use specific pdfs, t he s ame me t hod al so a ppl y wi t h ot her pdfs by pl uggi ng sui t abl e f(t) i nt o Eq. (3. 6). Thus , for a gi ven ser ver syst em, such a kmin can al ways be found. Pr o p e r MDS a r r a y codes can t hen be used based on t he (n, k) pai r. Thus we get an opt i ma l d a t a di s t r i but i on scheme for a gi ven ser ver syst em. 3. 4 Dat a Ac qui s i t i on Scheme Once t he da t a di st r i but i on scheme is set , i.e., k is det er mi ned and t he pr ope r MDS ar r ay code is chosen, t he cl i ent needs t o deci de how t o r equest (or r ead) dat a. I n general , a client shoul d send its r equest t o as ma n y ser ver s as possi bl e and al so make t he a mount of d a t a it needs f r om each ser ver as smal l as possi bl e, since t he pr oper t i es of E[T(n, k)] show t ha t mor e r e dunda nc y 325 i i \ \ . . . . +":--+-+._ _+_. 3 4 + ; + S + x (a) n---- 10 I : . .......... + / / / u~J ~L (b) n -- 20 a , t , . . . . . . . . . ll111\ ' (I+I "~+,, 1 i + l ,,, \ / K (c) n = 40 Fi g. 3. 5. E[T(n, k)] vs. k for different n, where a = 1 and b = 5 br i ngs b e t t e r pe r f or ma nc e . For a speci fi c di s t r i but i on s cheme, t h e cl i ent needs t o c a l c ul a t e t he pdf s of all pos s i bl e d a t a r e a d s chemes , a n d t h e n c hoos e a n o p t i ma l r e a d s cheme. Si nce t h e r e a d s chemes ar e cl osel y r e l a t e d t o t h e MDS a r r a y c ode bei ng us ed, her e we will gi ve an e x a mp l e us i ng a speci f i c c ode t o s how t h e gui del i nes f or c hoos i ng a n o p t i ma l r e a d s cheme. I n t hi s e xa mpl e , t he s er ver s y s t e m has 2n ser ver s, a n d t he d a t a t h a t t h e cl i ent r e que s t s can be a s s e mbl e d f r om a n y 2n - 2 ser ver s, i. e. , t hi s is a ( 2n, 2n - 2) s ys t e m. Th e B- Co d e c a n be us ed t o i mp l e me n t t hi s s ys t e m. Th e d a t a di s t r i but i on us i ng t he B- Co d e is as fol l ows: (1) t h e whol e r a w (information) d a t a is p a r t i t i o n e d i nt o 2n( n - 1) bl ocks of equal si ze ( s ome p a d d i n g s ar e a d d e d if neces s ar y) ; (2) each of t he 2n s er ver s s t or es n - 1 bl ocks of t h e d a t a ; (3) 2n bl ocks of r e d u n d a n t ( or parity) d a t a ar e c a l c ul a t e d a c c o r d i n g t o t he e n c o d i n g r ul es of t he B- Co d e , i.e., each p a r i t y bl ock is an XOR of s ui t a bl e 2n - 2 r aw d a t a bl ocks, a n d t h e n each s er ver s t or es 1 p a r i t y bl ock. Th e s t r u c t u r e of t he B- Co d e is s hown i n Fi gur e 2. 1. Th e MDS p r o p e r t y of t he B- Co d e gi ves 3 s chemes f or r e c o n s t r u c t i n g t h e whol e r aw d a t a f r o m t he d a t a s t or e d on 2n ser ver s, each of whi ch has n - 1 bl ocks of r aw d a t a a n d 1 bl ock of p a r i t y d a t a : (1) r e a d f r o m all of t h e 2n ser ver s, each of whi ch sends i t s n - 1 bl ocks of r aw da t a ; (2) r e a d f r o m a n y 2n - 2 ser ver s, e a c h of whi ch s ends all of i t s n bl ocks of d a t a ( i ncl udi ng r aw a n d p a r i t y d a t a ) ; (3) r e a d f r o m all of t he 2n ser ver s, each of whi c h s ends all of i t s n bl ocks of d a t a . Th e 3 s chemes ar e s hown i n Fi gur e 3. 6, whe r e t h e s h a d e d p a r t s ar e t he d a t a t o be r ead. Not i ce t h a t t h e r e is no r e d u n d a n t d a t a i n s c he me (1) or s c he me (2), so t he cl i ent mu s t wai t unt i l i t r ecei ves all t he d a t a f r om all t h e ser ver s. Bu t i n 326 T i o. , : (a) Scheme 1 . . . . ! (b) Scheme 2 n- i t (c) Scheme 3 Fi g. 3.6. Three read schemes using the B-Code scheme (3), t her e is r edundant dat a, t hen t he client onl y needs t o receive dat a f r om any 2n - 2 of t he 2n request ed servers. Let E[T( 2n, 2n) ] , - 1, E[T( 2n - 2, 2n - 2)]n and E[T(2n, 2n - 2)In denot e t he mean dat a service t i me of t he t hr ee schemes respectively. Fr om Pr oper t y 1 of Theor em 3.1, E[T( 2n - 2, 2n - 2)]n > E[T(2n, 2n - 2)]n. But t he rel at i on bet ween E[T( 2n, 2n)],~_1 and ei t her E[T( 2n - 2, 2n - 2)], or E[T( 2n, 2n - 2)In is not so obvi ous, si nce in scheme (1) t he client needs t o wai t for mor e servers, but needs less da t a (t hus less service t i me) from each server. So t o det er mi ne whi ch scheme is best scheme for a gi ven syst em, we need t o cal cul at e t he pdf of t he whol e service t i me for all possible t he schemes, whi ch are scheme (1) and scheme (3) in this case. Assume t hat t he pdf of t he t i me T for each server t o send n blocks of dat a t o t he client is Tr[a, b]; t hen t he pdf of T in scheme (1) is Tr [ ~a, n-1 -~-- b], since each server onl y needs t o send n - 1 blocks of dat a, and t he pdf of T in scheme (2) or (3) is Tr[a,b]. Now t he pdfs of t he whol e service t i me in t he different schemes can be cal cul at ed accor di ng t o Eq. (3. 2), Eq. (3. 4) and Eq. (3. 5). Fi gur e 3.7 shows t he pdfs for different values of n, where a -- 1 and b- - 10. Usi ng Eq. (3. 3), t he mean of t he whol e service t i me of different schemes can be cal cul at ed. These means are listed in Tabl e 3.1, for a -- 1 and b -- 10. Tabl e 3.1. Mean service time of different dat a read schemes, where a = 1, and b = 10 n 3 7 10 E[T(2n, 2n)]~-1 5. 2195 7.3128 7.8857 E[T(2n - 2, 2n - 2)]n 7.4089 8.4207 8.6976 E[T(2n, 2n- 2)], ~ 5.8910 7. 2466 7. 6786 The above cal cul at i ons show t hat t he per f or mance of t he t hr ee schemes depends on t he syst em par amet er n (when a and b are fixed). I n a smal l server syst em, scheme (1) is t he best . As n increases, scheme (3) becomes 327 bet t er . For a syst em of 6 servers (n = 3), scheme (1) is t he best, but for syst ems of 14 servers (n = 7) and 20 servers (n = 10), scheme (3) is t he best. Though quite simple, t he above exampl e shows t ha t aft er t he da t a distri- but i on is set at t he server side, t he client has different ways of r eadi ng da t a from t he servers. For a given syst em (i.e., a cert ai n pdf of T, a fixed (n, k) pair and a par t i cul ar code), t her e al ways exists an opt i mal read scheme for t he client. Fi ndi ng t hi s scheme requires careful cal cul at i on. Since t he r ead schemes are hi ghl y rel at ed t o t he codes used, expl ori ng codes t hat offer more read choices is an i nt erest i ng research probl em. It is conj ect ured in [23] t ha t all MDS codes have a so-called strong MDS propert y, which provi des t he flexibility of readi ng schemes. 4. S u mma r y Thi s chapt er deals wi t h two issues in hi ghl y available di st r i but ed st or age syst ems: rel i abi l i t y and efficiency. To achieve reliability, t hr ee classes of MDS ar r ay codes are described. They are sui t abl e for st orage appl i cat i ons because of t hei r simple comput at i ons for encodi ng and decodi ng, t hei r MDS pr oper t y and t hei r low (or opt i mal ) updat e complexity. Two problems, namel y t he da t a di st r i but i on probl em and t he da t a acqui si t i on probl em, and t hei r sol ut i ons are proposed t o use t he r edundancy in st orage syst ems pr oper l y t o i mprove t hei r performance. A pract i cal di st r i but ed st orage syst em is i mpl ement ed as par t of t he RAI N (Reliable Ar r ay of I ndependent Nodes) syst em, a reliable and efficient com- put i ng envi r onment at t he Paral l el and Di st r i but ed Comput i ng Lab of Cal- tech, usi ng t he approaches discussed in t hi s chapt er. A det ai l ed descri pt i on about t he RAIN syst em can be found in [6]. 328 0 . 6 O . m O . 4 N O . 3 0 . 2 0 . 1 0 o ' , c ~ : ~ - ~ c . 0 3 c~ c ~ o O O Q ? o o 0 o ~ 0 o o , ~. o ~ o *~ o~ o ~ o % o o o~ o % , - , 0 o o 0 o o- o ,,~ <> o % 3 4 5 6 o 7 8 v " e . . . . . T ( a ~ n = 3 0 . 6 0 . 5 0 . 4 O . 3 O . 2 0 0 . 6 0 5 0 3 0 . 2 0 . 1 0 4 - ~ 3 o c " o 0 0 < ~ , ~ ' c ' ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 0 ~ ~ 0 o ~ ' ~ ~ 0 ~ , 0 o ~ . o ~ t ) 0 0 ~ , ~ o ~ o O O o o o % .:.~ o O O o ,r : 0 ~ ' o o o 0 T ( b ) n - : 7 o . . . . . ~ o " 3 ~ 1 0 . o 0 .-., < " o ~ . . . . . - 5 . . . . 6 " " 7 T ( c ) n = 1 0 O O c , . O O o ~ 1 7 6 1 7 6 1 7 6 1 7 6 o 0 o ~ o 0 o o o o'r 0 o ,o, 0 ~ o o 0 o o '=' o ~ c~ o c, < ~ 0 o o o ~ 0 o i , 0 F i g . 3 . 7 . P D F s o f di f f e r e nt d a t a r e a d s c h e me , wh e r e a = 1, b = 10; 1, 2 a n d 3 r e p r e s e n t s c h e me ( 1 ) , ( 2) a n d ( 3) r e s p e c t i v e l y . R e f e r e n c e s 329 1. B. A. Anderson, "Symmetry Groups of Some Perfect 1-Factorizations of Com- plete Graphs," Discrete Mathematics, 18, 227-234, 1977. 2. M. Blaum, J. Brady, J. Bruck and J. Menon, "EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures," IEEE Trans. on Computers, 44(2), 192-202, Feb. 1995. 3. M. Blaum, J. Bruck, A. Vardy, "MDS Array Codes with Independent Parity Symbols," IEEE Trans. on Information Theory, 42(2), 529-542, March 1996. 4. M. Blaum, P. G. Farrell and H. C. A. van Tilborg, "Chapter on Array Codes", Handbook of Coding Theory, edited by V. S. Pless and W. C. Huffman, to appear. 5. M. Blaum, R. M. Roth, "New Array Codes for Multiple Phased Burst Correc- tion," IEEE Trans. on Information Theory, 39(1), 66-77, Jan. 1993. 6. V. Bohossian, C. Fan, P. LeMahieu, M. Riedel, L. Xu and J. Bruck, "Comput- ing in the RAIN: A Reliable Array of Independent Nodes", Caltech Technical Report, 1998. Available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / p a p e r s / e t r 0 2 9 , ps. 7. M. N. Frank, "Dispersity Routing in Store-and-Forward Networks," Ph.D. the- sis, University of Pennsylvania, 1975. 8. P. G. Farrell, "A Survey of Array Error Control Codes," ETT, 3(5), 441-454, 1992. 9. R. G. Gallager, "Low-Density Parity-Check Codes," MIT Press, Cambridge, Massachusetts, 1963. 10. J. Gemmell, "Scalable Reliable Multicast Using Erasuring-Correcting Re- Sends," Technical Report MSR-TR-97-20, Microsoft Research, June, 1997. 11. R. M. Goodman, R. J. McEliece and M. Sayano, "Phased Burst Error Correct- ing Arrays Codes," IEEE Trans. on Information Theory, 39, 684-693,1993. 12. A. Kotzig, "Hamilton Graphs and Hamilton Circuits," Theory of Graphs and Its Applications (Proc. Sympos. Smolenice), 63-82, 1963. 13. F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting Codes, Amsterdam: North-Holland, 1977. 14. Norman K. Ouchi, "System for Recovering Data Stored in Failed Memory Unit," US Patent 4092732, May 30, 1978. 15. A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd Edi- tion, McGraw-Hill, Inc., 1984. 16. D. A. Patterson, G. A. Gibson and R. H. Katz, "A Case for Redundant Arrays of Inexpensive Disks," Proc. SIGMOD Int. Conf. Data Management, 109-116, Chicago, IL, 1988. 17. R. M. Tanner, "A Recursive Approach to Low Complexity Codes," IEEE Trans. on Information Theory, 27(5), 533-547, Sep. 1981. 18. D. G. Wagner, "On the Perfect One-Factorization Conjecture," Discrete Math- ematics, 104, 211-215, 1992. 19. W. D. Wallis, One-Factorizations, Kluwer Academic Publisher, 1997. 20. Samuel S. Wilks, Mathematical Statistics, John Wiley & Sons, Inc., 1963. 21. L. Xu and J. Bruck, "X-Code: MDS Array Codes with Optimal Encoding," IEEE Trans. on Information Theory, 45(1), 272-276, Jan., 1999. A l s o available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / p a p e r s / e t r 0 2 0 . p s . 22. L. X u , V . Bohossian, J. B r u c k a n d D . W a g n e r , " L o w D e n s i t y M D S C o d e s a n d Factors o f C o m p l e t e G r a p h s , " P r o c e e d i n g s of 1 9 9 8 I E E E S y m p o s i u m o n I n f o r m a t i o n T h e o r y , A u g . , 1 9 9 8 ; R e v i s e d version to a p p e a r in I E E E T r a n s . o n 330 23. Information Theory, Sep. 1999. Also available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / p a p e r s / e t r 0 2 5 . p s . L. Xu, "Highly Available Di st ri but ed Storage Systems, " Ph. D. thesis, California Inst i t ut e of Technology, 1998. Also available at: h t t p : / / p a r a d i s e , c a l t e c h , e d u / ~ l i h a o / t h e s i s , html. List of Lectures K. Abdali W. Almesberger J. Blum T. Braun J. Bruck H. Busch G. Cooperman T. Eickermann L. Finkelstein E. Gabriel G. Havas A. Hoisie P. Holleczek E. Jessen M. K6ster H. Lederer Advanced Computing and Communi-cation Research under NSF Support SRP - a Scalable Resource Reservation Protocol for the Internet Para-Station: High Performance Environment for Clusters Differentiated Internet Services Reliable Distributed High Performance Computing BRAIN - Berlin Research Area Information Network Parallel TOP-C and Scaling Up with DSM Metacomputing in the Gigabit Testbed West Experiences at Northeastern University Connecting to a High Performance National Network (presented by G. Cooperman) High Performance Metacomputing in a Transatlantic Wide Area Application Testbed Some Performance Studies in Exact Linear Algebra Performance and Scalability Analysis of Applications on Teraflop-class Distributed Architectures Controlling the Quality of Service in Wide Area ATM Networks The Gigabitwissenschaftsnetz of DFN High Performance Computing across the ATM-WAN Essen-Bonn Visual Supercomputing and Metacomputing - Gigabit Testbed Projects with Contributions of the Max Planck Society 332 M. M~ihler I. Matta G. Michler D. Nastoll and H. Gollan T. Plagemann E. Quintana-Orti E. Rathgeb A. Rieke G. Schneider U. Schwiegelshohn R. Staszewski T. Warschko M. Weller Value-added Services Bases on Virtual LANs Quality of Service in Wide Area Networks: Issues and Protocols The Monster: A Challenge for High Performance Computing MILESS - A Learning and Teaching Server for Multi- Media Documents Gigabit Networking in Norway - Infrastructure, Applications and Projects A Portable Subroutine Library for Solving Linear Control Problems on Distributed Memory Computers Gigabit Wide Area Networks - Options and Trends Encryption in ATM Systems Low-Speed ATM over ADSL and the Need for High- Speed Networks The NRW Metacomputing Initiative Retrodigitalization and Multivalent Document Systems ParaStation 2: Efficient Parallel Computing in Workstation Clusters Multi-Broadcast Communication in ATM Computer Networks and Mathematical Algorithm Development List of R eg istered P a rticip a n ts Dr. Kamal Abdali National Science Foundation 1800 G Street N. W. Washington DC 20550 USA Kabdali@ndf.gov Dr. W. Almesberger D~partement d' informatique Laboratoire de r6seaux de communication- LRC IN- Ecublens CH- 1015 Lausanne Werner.almesberger@di.epfl.ch I. Bl um Fakult/it fOr Informatik Universit/it Karlsruhe Am Fasanengarten5 76131 Karlsruhe blum@ira.uka.demailto:blum@ira.uka. de Prof. Dr. T. Braun Universit~it Bern Institut for Informatik und angewandte Mathematik Neubrtickstr. 10 3012 Bern braun@iam.unibe.ch Prof. J. Bruck Department of Electrical Engineering California Institute of Technology Pasadena, CA 91125 USA bruck@vangogh.paradise.caltech.edu H. Busch Konrad-Zuse-Zentrum for Informationstechnik Berlin Bereich Rechenzentren Abteilung H6chstleistungsrechner - Leiter, Takustr. 7 14195 Berlin-Dahlem busch@zib.de E. Gabriel Rechenzentrum Universit/it Stuttgart Abteilung Paralleles Rechnen Allmandring 30 70550 Stuttgart gabriel@hlrs.de H. GoUan Institut ftir Experimentelle Mathemafik Universitiit GH Essen Ellernstr. 29 45326 Essen holger@exp-math.uni-essen.de Prof. George Havas Dept. of Computer Science University of Queensland Queensland 4072 Australien havas@cs.uq.edu.au Dr. Adol fy Hoi si e Scientifc Computing, CIC-19 MS B256 Los Alamos National Laboratory Los Alamos, NM 87545 hoisie@lanl.gov Dr. P. Holleczek Abt. Kommunikationssysteme Regionales Rechenzentrum Erlangen Martensstr. 1 91058 Erlangen peter.holleczek@rrze.uni-erlangen.de Profi Dr.- Ing. E. Jessen Institut for Informatik TU Mtinchen Augustenstr. 77 80290 Mtinchen jessen@informatik.tu-muenchen.de 334 Prof. Gene Cooperman College of Computer Science Northeastern University M/S 215 CN Boston, MA 02115 gene@ccs.neu.edu Dr. T. Ei ckermann Forschungszentrum Jfilich GmbH ZAM Leo-Brandt-Strafle 52428 Jillich th.eickermann@fz-juelich.de Dr. B. Lix Hochschulrechenzentrum Universitiit GH Essen Schfitzenbahn 70 45141 Essen lix@hrz.uni-essen.de Dr. M. M~f l er IBM Deutschland GmbH European Network Center, Heidelberg Vangerowstr. 18 69115 Heidelberg maehler@heidelbg.ibm.com Prof. Dr. P. Martini Rheinische Friedrich-Wilhelms- Universit~it Bonn Institut fiir Informatik IV ROmerstrasse 164 D-53117 Bonn Peter.Martini@cs.uni-bonn.de Prof. Dr. G. Michler Institut fiir Experimentelle Mathematik Universit~it GH Essen Ellernstr. 29 45326 Essen ar chiv@exp-math.uni-essen.de Dipl.-Inf. M. K6ster Rheinische Friedrich-Wilhelms-Universit~it Bonn Institut fiJr Informatik IV R6merstrasse 164 D-53117 Bonn koester@cs.uni-bonn.de Dr. H. Lederer Rechenzentrum Garching der Max-Planck- Gesell-schafi Max-Planck-Institut ffir Plasmaphysik Boltzmannstr.2 85748 Garching lederer@rzg.mpg.de Prof. E. Ouintana-Orti Departamento de Informatica Universidad Jaime I Campus Penyeta Roja E- 12071 Castellon, Spanien Quintana@nuvol.uji.es Prof. Dr.-Ing. E. P. Rathgeb Institut fiir Experimentelle Mathematik Universit~it GH Essen Ellernstr. 29 45326 Essen erwin.rathgeb@exp-math.uni-essen.de Andreas Rieke Lehrstuhl fiir Kommunikationssysteme Fachbereich Elektrotechnik Fernuniversit~it Hagen Feithstr. 142 58084 Hagen andreas.rieke@fernuni-hagen.de Prof. Dr. G. Schneider Gesellschaft fiir wissenschaftliche Datenverarbeitung mbH G6ttingen (GWDG) Am Faflberg 37077 G6ttingen gschnei2@gwdg.de D. Nastoll Hochschulrechenzentrum Universit/it GH Essen Schtitzenbahn 70 45141 Essen nastoll@hrz.uni-essen.de Prof. Dr. H. Obrecht Lehrstuhl far Baumechanik-Statik Universit~it Dortmund Fakult~it Bauwesen August-Schmidt-Strafle 8 44221 Dortmund msobr@busch.bauwesen.uni- dortmund. de Dr. D. V. Pasechnik Faculty of Technical Mathematics and Informatics Department of Statistics, Probability and Operations Mekelweg 4 NL-2628 CD Delft D.Pasechnik@twi.tudelft.nl Dr. T. Plagemann University of Oslo, UNIK Granveien 33, P.O Box 70 N-2007 Kjeller plagemann@unik.no Dr. T. Warschko Fakult~it far Informatik Universit~it Karlsruhe Am Fasanengarten5 76131 Karlsruhe warschko@ira.uka.de Dr. M. Weller Institut far Experimentelle Mathematik Universit~it GH Essen Ellernstr. 29 45326 Essen eowmob@exp-math.uni-essen.de 335 Prof. Dr. U. Schwi egel shohn Universit~it Dortmund Lehrstuhl Datenverarbeitungssysteme Otto-Hahn-Str. 4 44221 Dortmund uwe.@ds.e-technik.uni-dor tmund. de Prof. Dr. U. Stammbach ETH Zfirich Forschungsinstitut far Mathematik ETH Zentrum CH-8092 Zfirich Stammb@math.ethz.ch Dr. R. Staszewski Institut far Experimentelle Mathematik Universit~it GH Essen Ellernstr. 29 45326 Essen reiner@exp-math.uni-essen.de Dr. R. V61pel GMD SCAI (Institut far Wissenschaftliches Rechnen) Schloss Birlinghoven D-53754 Sankt Augustin Roland.voelpel@gmd.de P. Wunderl i ng GMD IMK (Institut far Medienkommunikation) Schloss Birlinghoven D-53754 Sankt Augustin wunderling@gmd.de R. Yahyapour Fakult~it far Elektrotechnik Lehrstuhl far Datenverarbeitungssysteme Otto-Hahn-Str. 4 44221 Dortmund yahya@peggy.E-technik.Uni-Dortmund.de Lect ure Not es in Con trol a n d In forma tion Sci en ces Ed ited by M. Thoma 1993-1999 P ublished Titles: Vol. 186: Sreenath, N. Systems Representation of Global Climate Change Models. Foundation for a Systems Science Approach. 288 pp. 1993 [3-540-19824-5] Vol. 187: Morecki, A.; Bianchi, G.; Jaworeck, K. (Eds) RoManSy 9: Proceedings of the Ninth CISM-IFToMM Symposium on Theory and Practice of Robots and Manipulators. 476 pp. 1993 [3-540-19834-2] Vol. 188: Naidu, D. Subbaram Aeroassisted Orbital Transfer: Guidance and Control Strategies 192 pp. 1993 [3-540-19819-9] Vol . 189: Ilchmann, A. Non-Identifier-Based High-Gain Adapti ve Control 220 pp. 1993 [3-540-19845-8] Vol. 190: Chatila, R.; Hirzinger, G. (Eds) Experimental Robotics Ih The 2nd International Symposium, Toulouse, France, June 25-27 1991 580 pp. 1993 [3-540-19851-2] Vol. 191: Blondel, V. Simultaneous Stabilization of Linear Systems 212 pp. 1993 [3-540-19862-8] Vol . 192: Smith, R.S.; Dahleh, M. (Eds) The Modeling of Uncertainty in Control Systems 412 pp. 1993 [3-540-19870-9] Vol . 193: Zinober, A.S.I. (Ed.) Vadable Structure and Lyapunov Control 428 pp. 1993 [3-540-19869-5] Vol. 194: Cao, Xi-Ren Realization Probabilities: The Dynamics of Queuing Systems 336 pp. 1993 [3-540-19872-5] Vol. 195: Liu, D.; Michel, A.N. Dynamical Systems wi th Saturation Nonlinearities: Anal ysi s and Design 212 pp. 1994 [3-540-19888-1] Vol. 196: Battilord, S. NoninteraclJng Control wi th Stability for Nonlinear Systems 196 pp. 1994 [3-540-19891-1] Vol . 197: Henry, J.; Yvon, J.P. (Eds) System Modelling and Optimization 975 pp approx. 1994 [3-540-19893-8] Vol . 198: Winter, H.; NOl~er, H.-G. (Eds) Advanced Technol ogi es for Ai r Traffic Flow Management 225 pp approx. 1994 [3-540-19895-4] Vol . 199: Cohen, G.; Quadrat, J.-P. (Eds) 1 l t h International Conference on Anal ysi s and Optimization of Systems - Discrete Event Systems: Sophia-Antipolis, June 15-16-17, 1994 548 pp. 1994 [3-540-19896-2] Vol. 200: Yoshikawa, T.; Miyazaki, F. (Eds) Experimental Robotics II1: The 3rd Intemational Symposium, Kyoto, Japan, October 28-30, 1993 624 pp. 1994 [3-540-19905-5] Vol . 201: Kogan, J. Robust Stability and Convexi ty 192 pp. 1994 [3-540-19919-5] Vol. 202: Francis, B.A.; Tannenbaum, A.R. (Eds) Feedback Control, Nonlinear Systems, and Complexity 288 pp. 1995 [3-540-19943-8] Vol . 203: Popkov, Y.S. Macrosystems Theory and its Applications: Equilibrium Models 344 pp. 1995 [3-540-19955-1] Vol . 204: Takahashi, S.; Takahara, Y. Logical Approach to Systems Theory 192 pp. 1995 [3-540-19956-X] Vo l . 205: Kotta , U. Inversion Method in the Discrete-time Nonlinear Control Systems Synthesis Problems 168 pp. 1995 [3-540-19966-7] Vo l . 206: Ag a n ov i c, Z.; Ga jic, Z. Linear Optimal Control of Bilinear Systems wi th Applications to Singular Perturbations and Weak Coupling 133 pp. 1995 [3-540-19976-4] Vo l . 207: Gabasov, R.; Kidllova, F.M.; Prischepova, S.V. Optimal Feedback Control 224 pp. 1995 [3-540-19991-8] Vol . 208: Khalil, H.K.; Chow, J.H.; Ioannou, P.A. (Eds) Proceedings of Workshop on Advances inControl and its Applications 300 pp. 1995 [3-540-19993-4] Vo l . 209: Foia s, C.; Oz ba y, H, ; Tannenbaum, A. Robust Control of Infinite Dimensional Systems: Frequency Domain Methods 230 pp. 1995 [3-540-19994-2] Vol . 210: De Wilde, P. Neural Network Models: An Analysis 164 pp. 1996 [3-540-19995-0] Vol . 211: Gawronski, W. Balanced Control of Flexible Structures 280 pp. 1996 [3-540-76017-2] Vol . 212: Sanchez, A. Formal Specification and Synthesis of Procedural Controllers for Process Systems 248 pp. 1996 [3-540-76021-0] Vo l . 213: Patra, A.; Rao, G.P. General Hybrid Odhogonal Functions and thei r Applications in Systems and Control 144 pp. 1996 [3-540-76039-3] Vo l . 214: Yi n, G.; Zhang, Q. (Eds) Recent Advances in Control and Optimization of Manufacturing Systems 240 pp. 1996 [3-540-76056-5] Vo l . 215: Bonivento, C.; Marro, G.; Zanasi, R. (Eds) Colloquium on Automatic Control 240 pp. 1996 [3-540-76060-1] Vol . 216: Kulhav~, R. Recursive Nonlinear Estimation: A Geometric A p p r o a c h 244 pp. 1996 [3-540-76063-61 Vol . 217: Garofalo, F.; Glielmo, L. (Eds) Robust Control via Variable Structure and Lyapunov Techniques 336 pp. 1996 [3-540-76067-9] Vol . 218: van der Schaft, A. I-2 Gain and Passivity Techni ques i n Nonlinear Control 176 pp. 1996 [3-540-76074-1] Vol . 219: Berger, M.-O.; Dedche, R.; Herlin, I.; Jaffr6, J.; Morel, J.-M. (Eds) ICAOS '96: 12th International Conference on Anal ysi s and Optimization of Systems - Images, Wavelets and PDEs: Pads, June 26-28 1996 378 pp. 1996 [3-540-76076-8] Vo l . 220: Brog lia to, B. Nonsmooth Impact Mechanics: Models, Dynamics and Control 420 pp. 1996 [3-540-76079-2] Vol . 221: Kelkar, A.; Joshi, S. Control of Nonlinear Multibody Flexible Space Structures 160 pp. 1996 [3-540-76093-8] Vo l . 222: Morse, A.S. Control Using Logic-Based Switching 288 pp. 1997 [3-540-76097-0] Vol . 223: Khatib, O.; Salisbury, J.K. Experimental Robotics IV: The 4th Intemational Symposium, Stanford, Califomia, June 30 - Jul y 2, 1995 596 pp. 1997 [3-540-76133-0] VoI . 224: Ma g n i, J.-F.; Ben n a n i, S.; Terlouw, J. (Eds) Robust Flight Control: A Design Challenge 664 pp. 1997 [3-540-76151-9] Vol . 233: Chiacchio, P.; Chiaverini, S. (Eds) Complex Robotic Systems 189 pp. 1998 [3-540-76265-5] Vol . 234: Arena, P.; Fortuna, L.; Muscato, G.; Xibilia, M.G. Neural Networks in Multidimensional Domains: Fundamentals and New Trends in Modelling and Control 179 pp. 1998 [1-85233-006-6] Vol . 225: Poznyak, A.S.; Najim, K. Leaming Automata and Stochastic Optimization 219 pp. 1997 [3-540-76154-3] Vol . 226: Cooperman, G.; Michler, G.; Vinck, H. (Eds) Workshop on High Performance Computing and Gigabit Local Area Networks 248 pp. 1997 [3-540-76169-1] Vol . 227: Tarbouriech, S.; Garcia, G. (Eds) Control of Uncertain Systems wi th Bounded Inputs 203 pp. 1997 [3-540-76183-7] Vol . 228: Dugard, L.; Verdest, E.I. (Eds) Stability and Control of Time-delay Systems 344 pp. 1998 [3-540-76193-4] Vol . 229: Laumond, J.-P. (Ed.) Robot Motion Planning and Control 380 pp. 1998 [3-540-76219-1] Vol . 230: Siciliano, B.; Valavanis, K.P. (Eds) Control Problems in Robotics and Automation 328 pp. 1998 [3-540-76220-5] Vol . 231: Emeryanov, S.V.; Burovoi, I.A.; Levada, F.Yu. Control of Indefinite Nonlinear Dynamic Systems 196 pp. 1998 [3-540-76245-0] Vol . 232: Casals, A.; de Almeida, A.T. (Eds) Experimental Robotics V: The Fifth International Symposium Barcelona, Catalonia, June 15-18, 1997 190 pp. 1998 [3-540-76218-3] Vol . 235: Chen, B.M. Hoo Control and Its Applications 361 pp. 1998 [1-85233-026-0] Vol . 236: de Almeida, A.T.; Khatib, O. (Eds) Autonomous Robotic Systems 283 pp. 1998 [1-85233-036-8] Vol . 237: Kreigman, D.J.; Hagar, G.D.; Morse, A.S. (Eds) The Confluence of Vision and Control 304 pp. 1998 [1-85233-025-2] Vol . 238: Elia , N. ; Da hleh, M.A. Computational Methods for Controller Design 200 pp. 1998 [1-85233-075-9] Vol . 239: Wang, Q.G.; Lee, T.H.; Tan, K.K. Finite Spectrum Assi gnment for Ti me-Del ay Systems 200 pp. 1998 [1-85233-065-1] Vol . 240: Lin, Z. Low Gain Feedback 376 pp. 1999 [1-85233-081-3] Vol . 241: Yamamoto, Y.; Hara S. Learning, Control and Hybrid Systems 472 pp. 1999 [1-85233-076-7] Vol . 242: Conte, G.; Moog, C.H.; Perdon A.M. Nonlinear Control Systems 192 pp. 1999 [1-85233-151-8] Vol . 243: Tzafestas, S.G.; Schmidt, G. (Eds) Progress in Systems and Robot Anal ysi s and Control Design 624 pp. 1999 [1-85233-123-2] Vol . 244: Nijmeijer, H.; Fossen, T.I. (Eds) New Directions in Nonlinear Observer Design 552pp: 1999 [1-85233-134-8] Vol. 245: Garulli, A.; Tesi, A.; Vicino, A. (Eds) Robustness in Identification and Control 448pp: 1999 [1-85233-179-8] Vol. 246: Aeyels, D.; Lamnabhi-Laganigue, F.; van der Sd'talt, A. (Eds) Stability and Stabilization of Nonlinear Systems 408pp: 1999 [1-85233-638--2] Vol. 247: Young, K.D.; Ozg0ner, U. (Eds) Variable Structure Systems, Sliding Mode and Nonlinear Control 400pp: 1999 [1-85233-197-6] Vol . 246: Chen, Y.; Wen C. Iterative Learning Control 216pp: 1999 [1-85233-190-9]