You are on page 1of 49

Parallel Computer Architectures

Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Overview

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Overview

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Single Processor System

Memory

Main memory to store data and program

Proc

Processor to fetch program from memory, and execute program instructions: Load data from memory, process data and write results back to memory.

Input/ output is not covered here. 4


Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Marc Tremblay, Sun

Caches

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Single Processor System

Memory Caches are smaller than main memory, but much faster. They are employed to bridge the gap between a bigger and slower main memory and the processor which is much faster. The cache is invisible to the programmer Only when measuring the runtime, the effect of caches will become apparent.

Cache

Proc

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Single Processor System


I am ignoring - instruction caches - address caches (TLB) - write buffers - prefetch buffers here as data caches are most important for HPC applications on chip cache($) With a growing number of transistors on each chip over time, caches can be put on the same piece of silicon.

Memory

L2 On Cache off chip cache($)


L1 Cache

Proc

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

In 2005 Intel cancelled the 4 GHz Chip


Fast clock cycles make processor chips more expensive, hotter and more power consuming.

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

The Impact of Moores Law


Intel-Processors: Intel-Processors:

The number of transistors on a chip is still doubling every 18 months but the clock speed is no longer growing that fast. Higher clock speed causes higher temperature and higher power consumption.
Clock ClockSpeed Speed(MHz) (MHz) Transistors (x1000) Transistors (x1000)

Instead well see many more cores per chip!


Source: Herb Sutter www.gotw.ca/publications/concurre ncy-ddj.htm

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Dual Core-Processors

Memory

Since 2005/6 Intel and AMD are producing dualcore processors for the mass market. In 2006/7 Intel and AMD introduce quadcore processors. By 2008 it will be hard to buy a PC without a dualcore processor. Your future PC / laptop will be a parallel computer!

L2 Cache

L1

L1

core core

10

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Dual Core-Processors Intel Woodcrest

Memory

Here: L2 Cache
L1 L1

L2 Cache
L1 L1

core core

core core

4 MB shared Cache on chip 2 Cores with local L1 Cache and a socket for a second processor chip

11

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Multi Core-Processors
UltraSPARC IV 1.2 GHz 130 nm ~66 Mio trans. 108 Watt UltraSPARC IV+ 1.5 GHz 90 nm 295 Mio trans. 90 Watt (?) Opteron 875 2.2 GHz 90 nm 199 mm2 233 Mio trans. 95 Watt Memory UltraSPARC T1 1.0 GHz 90 nm 378 mm2 300 Mio trans. 72 Watt Memory

Memory

Memory

2x8MB

32 MB
64KB64KB

8 KB L1 for each core core core


1 MB 1 MB 3 MB L2

64KB64KB

64KB64KB

8 cores

core core 12

core core
2 MB L2

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

What to do with all these Threads

Marc Tremblay, Sun

Waiting in parallel

13

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

What to do with all these Threads


Marc Tremblay, Sun

14

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Sun Fire T2000 at Aachen

Memory

Memory

Memory

Memory

25.6 GB/sec. 4 DDR2 memory controllers on chip


.75 MB L2 .75 MB L2 .75 MB L2 .75 MB L2

Internal Crossbar 134 GB/s

FPU core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

1GHz

15

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Sun T5120 Eight Cores x Eight Threads


1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz
Memory Memory Memory Memory

42.7 GB/sec. 4 FB DRAM memory controllers on chip


0.5 MB L2 0.5 MB L2 0.5MB L2 0.5MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2

Internal Crossbar 1.4 GHz core 8 threads per core 8 KB 1 FPU per L1 core
Ce nter fo r Co mputing an d C omm unic ation

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

16

Parallel Computer Architectures

C C C

Chip Level Parallelism


= 1.11 ns UltraSPARC III superscalar, single core 4 sparc v9 instr/cycle 1 active thread per core UltraSPARC IV+ superscalar, dual core 2 x 4 sparc v9 instr/cycle 1 active thread per core Opteron 875 superscalar, dual core 2 x 3 x86 instr/cycle 1 active thread per core UltraSPARC T1 Single issue, 8 cores 8 x 1 sparc v9 instr/cycle 4 active threads per core context switch comes for free time
Parallel Computer Architectures

= 0.66 ns

= 0.45 ns

= 1.0 ns context switch 17

Ce nter fo r Co mputing an d C omm unic ation

C C C

Overview

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

18

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Shared Memory Parallel Computers Uniform Memory Access (UMA)


-Crossbar adds latency -Architecture is not scalable Memory In a shared memory parallel computer multiple processors have access to the same main memory. Cache Yes, a dualcore / multicore processor based machine is a parallel computer on a chip.

Crossbar / Bus

Cache

Cache

Cache

Proc

Proc

Proc

Proc

19

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Shared Memory Parallel Computers Non Uniform Memory Access (NUMA)


- Faster local memory access - slower remote memory access Memory Memory Memory Memory

Cache

Cache

Cache

Cache

Proc

Proc

Proc

Proc

Crossbar / Bus

20

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Sun Fire E2900 at Aachen


- simplistic view - programers perspective - rather uniform memory access

Memory

Memory

Memory

Crossbar - 9.6 GB/s total peak memory bandwidth


2.4 GB/s. memory controller on chip

2x8MB

2x8MB

64KB64KB

core core 21

12 dual core US IV processors 1.2 GHz


Parallel Computer Architectures

64KB64KB

core core

Ce nter fo r Co mputing an d C omm unic ation

C C C

Sun Fire V40z at Aachen


Memory 6.4 GB/s
64KB64KB

Memory 6.4 GB/s


64KB64KB

- simplistic view - programers perspective - non-uniform memory access DDR 400 memory controller on chip

core core
1 MB 1 MB

8 GB/s

core core
1 MB 1 MB

8 GB/s
64KB64KB

8 GB/s
64KB64KB

core core
1 MB 1 MB

8 GB/s

core core
1 MB 1 MB

6.4 GB/s Memory

6.4 GB/s Memory

2.2 GHz

22

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Sun T5120 Eight Cores x Eight Threads


1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz
Memory Memory Memory Memory

42.7 GB/sec. 4 FB DRAM memory controllers on chip


0.5 MB L2 0.5 MB L2 0.5MB L2 0.5MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2

Internal Crossbar 1.4 GHz core 8 threads per core 8 KB 1 FPU per L1 core
Ce nter fo r Co mputing an d C omm unic ation

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

core
8 KB L1

23

Parallel Computer Architectures

C C C

Overview

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

24

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Distributed Memory Parallel Computer / Cluster


In a distributed memory parallel computer each processor has only access to its own main memory. Programs have to use an external network for communication and cooperation. They have to exchange messages

External network

Memory

Memory

Memory

Cache

Cache

Cache

Proc

Proc

Proc

25

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

MPI-Paradigm: Send - Receive

Memory send
Processor

Memory receive
Processor

Network

26

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

MPI on Distributed Memory Parallel Computers


Typically, when using Message Passing with MPI, one MPI process runs on each processor (core) External network MPI is a program library plus a mechanism to launch multiple cooperation executable progams. Typically it is the same binary, which is started on multiple processors. (SPMD=single program mutliple data paradigm) MPI is the de-facto standard for message passing.

Memory MPI-Task

Memory MPI-Task

Memory MPI-Task

Cache

Cache

Cache

Proc

Proc

Proc

27

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

MPI on Shared Memory Parallel Computers


MPI can be used on shared memory systems as well. The shared memory serves as the network. Again, typically one MPI process runs on each processor (core) MPI is formally specified for C, C++ and Fortran. All major vendors provide an MPI library for their machines. And there are free versions available. Cache Cache Cache Cache

Memory (interleaved) MPI-Task MPI-Task MPI-Task Crossbar / Bus

JAVA-implementations are available, too, but they are not widely used and not standardized.

Proc

Proc

Proc

Proc

28

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

OpenMP on Shared Memory Parallel Computers


On shared memory systems shared memory programming can be used, where typically one light weight process (= thread) runs on each processor (core) Memory OpenMP (interleaved)

Crossbar / Bus

-Thread Cache -Thread Cache Cache -Thread Cache

Proc

Proc

Proc

Proc

29

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

MPI on SMP-Clusters
Today, most clusters have SMP nodes and MPI is well suited for this architecture. External network

Memory (interleaved) MPI-Task MPI-Task MPI-Task Crossbar / Bus

Memory (interleaved) MPI-Task MPI-Task MPI-Task Crossbar / Bus

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Proc 30

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Hybrid Parallelization on SMP-Clusters (MPI+OpenMP)

External network

Memory (interleaved)

Memory (interleaved)

Crossbar / Bus

Crossbar / Bus

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Proc 31

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Innovative: Cluster OpenMP (DSM=distributed shared memory system)


External network Cluster - OpenMP

Memory (interleaved)

Memory (interleaved)

Crossbar / Bus

Crossbar / Bus

Cache Proc Threa d 32

Cache

Cache

Cache

Cache

Cache

Cache

Cache -Thread Proc

-Thread -Thread Proc Proc Proc

-Thread Proc Proc

Proc

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Nodes of Todays Clusters are Shared Memory Machines with Multicore Processors
External network

Memory

Memory

L2 Cache

L2 Cache

...

L2 Cache

L2 Cache

L1

L1

L1

L1

L1

L1

L1

L1

core core

core core

core core

core core

33

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Networks and Topologies


Networks Fast- / Gigabit-Ethernet Myrinet SCI QsNet (Quadrix) Infiniband Proprietary networks Topologies Bus Tree Fat Tree 2D-, 3D- Torus Hypercube Crossbar / Switch

34

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Modern Parallel Computer Architectures


COTS (=commercial off-the-shelf), COW (=Cluster of Workstations) with 1 or 2 dual core prozessorchips and sheap network (GigabitEthernet) self-made Cluster of rack mounted Pizza boxes with 1-4 CPUs dual/quad core processor chips with fast network (Infiniband) SMP-Cluster with standard SMP-servers and propriatary or multi-rail networks Sun Fire Cluster, SGI Columbia (Altix nodes), ASC purple (IBM p575 nodes) Supercomputers, designed for High-End-Computing: Cray XT3, IBM BlueGene/L, Earth Simulator (NEC SX6)
Parallel Computer Architectures

35

Ce nter fo r Co mputing an d C omm unic ation

C C C

Overview

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

36

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

HPC @ RZ.RWTH-AACHEN.DE
SunFire V40z Cluster

Sun Sun Fire Fire T2000 T2000

Xeon -Cluster Xeon-Cluster SunFire E25K Cluster

37

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

RWTH Aachen Compute Cluster


Compute Cluster der RWTH Aachen
Feb 08
processor type
UltraSPARC IV UltraSPARC IV UltraSPARC T1 UltraSPARC T1 Opteron 848 Opteron 875 Opteron 885 Xeon 5160 (Woodcrest) Xeon 5160 (Woodcrest) Xeon 5160 (Woodcrest) Xeon E5450 (Harpertown) Xeon E5450 (Harpertown) Xeon X7350 (Tigerton)

#nodes
2 8 20 1 64 4 2 7 2 4 55 5

model
SF E25K SF E6900 SF T2000 SF T2000 SF V40z SF V40z SF X4600

memor clock y #procs #cores #threads [MHz] [GB]


72 24 1 1 4 4 8 2 2 2 2 2 144 48 8 8 4 8 16 4 4 8 8 8 144 48 64 32 4 8 16 4 4 8 8 8 1050 1200 1400 1000 2200 2200 2600 3000 3000 2667 3000 3000 288 96 32 8 8 16 32 8 16 16 16 32

network
Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Infiniband Gigabit Ethernet Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet

accumulated accumulated performance memory [TFLOPS] [TB]


0.60 0.92 0.2240 0.0001 1.13 0.14 0.17 0.17 0.05 0.17 2.64 0.24 0.58 0.77 0.64 0.01 0.51 0.06 0.06 0.06 0.03 0.06 0.88

Dell 1950

FujitsuSiemens RX200 FujitsuSiemens RX600 sum

2 176

38

Parallel Computer Architectures 16 16 2930 64 Infiniband


1740 2884

Ce nter fo r Co mputing an d C omm unic ation

0.19 6.64

C C C
0.16 0.13 3.95

System Management
Frontend nodes for interactive work, program development and testing, GUIs
cluster.rz.RWTH-Aachen.DE = cluster-solaris.rz.RWTH-Aachen.DE = cluster-solaris-sparc.rz.RWTH-Aachen.DE cluster-solaris-opteron.rz.RWTH-Aachen.DE cluster-linux.rz.RWTH-Aachen.DE = cluster-linux-opteron.rz.RWTH-Aachen.DE cluster-linux-xeon.rz.RWTH-Aachen.DE cluster-windows.rz.RWTH-Aachen.DE = cluster-windows-xeon.rz.RWTH-Aachen.DE Abbreviations: cl[uster], sol[aris], lin[ux], win[dows], x[eon], o[pteron], s[parc]

Batch system Sun Grid Engine: jobs (> 20 min) and Microsoft Compute Cluster resp.
Parallel Computer Architectures

39

Ce nter fo r Co mputing an d C omm unic ation

C C C

Overview over HPC Tools


Current program development environment for HPC on the Sun SPARC, AMD Opteron and Intel Xeon systems at the RWTH 4 platforms : 1. SPARC/Solaris 10, 64bit 2. Opteron/Solaris, 64 bit 3. Opteron/Linux and Xeon/Windows, 64 bit 4. Opteron/Windows and Xeon/Windows, 64 bit serial programming, shared memory parallelization, message passing compilers / MPI libraries, debugging tools, performance analysis tools

40

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Programming Environment Compilers + Debugging Tools


Company Compiler Version Language OpenMP support Autopar Debugger Runtime Analysis Spa rc Opteron Xeon

Solaris Lin Win dbx sunstudio thread analyzer idb analyzer, collect, er_print, gprof vtune gprof gprof pgprof

Sun

Studio 12

F95/C/C++

F95/C++

F95/C++

Intel GNU GNU PGI

V10.0

F95/C++

F95/C++ Threading Tools

F95/C++

X X X X

X X

V4.0 F95/C++ gdb V4.2 F95/C++ F95/C++ gdb V7.1 F77/F90/C/C++ F77/F90/C/C++ F77/F90/C/C++ pgdbg Visual Studio Microsoft C++ Visual Studio 2003 Visual Studio Microsoft C++ C++ Visual Studio 2005 Etnus TotalView 8.3

X X X X X C C C

41

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Sun Fire Cluster Programming Environment MPI Implementations and Tools


Provider Version MPI2 support Debugger Runtime Analysis Plattform Solaris 10 Opteron + Sparc Solaris 10 Opteron + Sparc Linux Sol, Lin,Win Sol, Lin,Win ? Network

Sun

HPC ClusterTools 6 HPC ClusterTools 7.1 based on Open MPI Version 3.1 based on mpich2 mpich 1.2.6 mpich2 1.0.x OpenMPI 1.2.5 based on FT-MPI, LA-MPI, LAM, PACX CCS V1 based on mpich2

yes

TotalView

analyzer, mpprof

tcp, shm

Sun

yes

TotalView TraceCollector & Analyzer (former Vampir) jumpshot jumpshot

tcp, shm, ib

Intel ANL ANL Open MPI p.d. Microsoft Univ Dresden VI-HPS

(yes) no yes (tcp) yes

TotalView TotalView TotalView TotalView Visual Studio w/ MS Compute Cluster Pack

tcp, shm, ib tcp, shm tcp, shm tcp, myr, Infiniband

(yes) ?

Windows Vampir-NG multiple research tools

tcp, shm, (Infiniband) any any

42

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Overview

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

43

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Measuring Performance Linpack Benchmark


The theorectical peak performance is determined by the clock cycle and the number of floating point operations per cycle. The actual floating point performance can be determined by the LINPACK Benchmark (www.top500.org) solving a linear equation system with a full coefficient matrix of an abitrary size. The unit of measurement is M[ega]flops = Million floating point operations per second G[iga]flops, T[era]flops, P[eta]flops (=10^9, 10^12, 10^15 flops) The Top 500 listing the fastest supercomputers is updated twice per year. Latest No 1 (28. list Nov. 2006): IBM BlueGene/L with 131072 processors, 32 TB total memory Lawrence Livermore National Laboratory (LLNL) peak: 367 Tflops = 367000 Gflops linpack: 280 Tflops 76% of peak matrix size N=1.769.471 For Comparison: Dualcore Intel Xeon 5160 (Woodcrest) , 3 GHz: 2 cores * 4 flops/cycle (SSE) * 3 GHz = 24 Gflops C
44
Parallel Computer Architectures
Ce nter fo r Co mputing an d C omm unic ation

C C

The TOP500 List


1.000.000 RWTH: Peak Performance (R peak) RWTH: Linpack Performance TOP500: Rank 1 TOP500: Rank 50 TOP500: Rank 200 TOP500: Rank 500 100 10 1
Jun Jun Jun Jun Jun Jun Jun Jun 93 94 95 96 97 98 99 00 Jun Jun Jun Jun Jun Jun 01 02 03 04 05 06

Gflops
100.000 10.000 1.000

PC-Technology Moore's Law


Fujitsu VPP300 at Aachen University

Sun Fire Cluster at Aachen University

45

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

The current Top 20 (Nov 07)


Proc. Linpack Frequ RMax %peak Processor ency 478200 80.18 PowerPC 440 700 167300 75.08 PowerPC 450 850 126900 73.77 Xeon 53xx (Clov 3000 117900 69.00 Xeon 53xx (Clov 3000 102800 70.20 Xeon 53xx (Clov 2667 102200 80.14 Opteron Dual Co 2400 101700 85.21 Opteron Dual Co 2600 91290 79.60 PowerPC 440 700 85368 84.97 Opteron Dual Co 2600 82161 79.60 PowerPC 440 700 75760 81.65 POWER5 1900 73032 79.60 PowerPC 440 700 63830 67.75 PowerPC 970 2300 62680 69.97 Xeon 53xx (Clov 2333 56520 90.78 Itanium 2 1600 56430 55.31 Opteron Dual Co 2400 54648 86.15 Opteron Dual Co 2800 53000 81.57 Xeon EM64T 3600 52840 82.83 Itanium 2 1600 51870 85.09 Itanium 2 1500 Computer Architectures System Family Interconnect IBM BlueGe Proprietary IBM BlueGe Proprietary SGI Altix Infiniband HP Cluster 3Infiniband DDR HP Cluster 3Infiniband DDR Cray XT XT3 proprietary Cray XT XT3 proprietary IBM BlueGe Proprietary Cray XT XT3 proprietary IBM BlueGe Proprietary IBM pSeriesFederation IBM BlueGe Proprietary IBM Cluster Myrinet Dell Cluster Infiniband SDR SGI Altix NUMAlink Sun Fire - CInfiniband Cray XT XT3 proprietary Dell Cluster Infiniband Bull SMP Cl Quadrics SGI Altix Numalink/IB
Ce nter fo r Co mputing an d C omm unic ation

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 46 20

Site ManufacComputer Country Procs LLNL IBM Blue GenUSA 212992 Blue GenGermany 65536 FZ JuelicIBM SGI/NMC SGI SGI AltixUSA 14336 TATA SO HP Cluster PIndia 14240 Cluster PSweden 13728 Gov Age HP Sandia NCray IncSandia/ USA 26569 23016 Oak RidgCray IncCray XT USA IBM WatIBM Blue GenUSA 40960 NERSC/ Cray IncCray XT USA 19320 Blue GenUSA 36864 Stony BrIBM LLNL IBM pSeries USA 12208 Blue GenUSA 32768 RensselaIBM BarcelonIBM BladeCe Spain 10240 NCSA Dell PowerEdUSA 9600 SGI Altix 470 Germany 9728 Leibniz R GSIC, TINEC/SunSun FireJapan 11664 11328 Univ Edi Cray IncCray XT UK Sandia NDell PowerEdUSA 9024 CEA Bull SA NovaScaFrance 9968 10160 NASA/A SGI SGI AltixUSA Parallel

C C C

Aachen on Rank 180 in June 2005 http://www.rz.rwth-aachen.de/hpc/sun/


Over 2 TeraFlop/s Linpack Performance
April 2005. The upgrade from UltraSPARC III to UltraSPARC IV including an increase of the main memory capacity more than doubled our Linpack performance! A linear system with 499,200 unknowns was solved in 11:12:48.8 hours at an average speed of 2054.4 billion floating point operations per second (GFlop/s). The program had a total memory footprint of 2 Terabyte. 1276 processor cores were kept busy with 82,930,000,000 million floating point operations.

47

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Future Parallel Computers


With a growing number of cores per chip the SMP box is shrinking. In a few years from now, many or all applications will be multi-threaded SMP boxes with small footprint will be building blocks of large systems. Memory hierarchies will grow (L3 caches ) Network latency will be close to 1 s Network bandwidth several GB/s Current research:
Distributed Shared Memory (Cluster OpenMP ) combining the advantage of SMP with scalability of DMP

In 2008/9: Petaflop/s systems by IBM, Cray, NEC,


Woodcrest ~ 24 Gflop/s @ ~100 W => 1 PFL @ ~4 MW

Main problems: Power supply and cooling


48
Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

Some Web-links
Information related to HPC at the RWTH http://www.rz.rwth-aachen.de/hpc/ Information related to MPI at the RWTH http://www.rz.rwth-aachen.de/mpi/ Sun Fire SMP Cluster Primer http://www.rz.rwth-aachen.de/hpc/primer Web page of the SunHPC and VI-HPS workshops with more links and information http://www.rz.rwth-aachen.de/sunhpc

Joint SunHPC Seminar (March 3-4) and VI-HPS (March 5-7) Tuning Workshop 49
Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

C C C

You might also like