MPIintro Kurs

Parallel Computer Architectures
Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany
Ce nter fo r Co mputing an d C omm unic ation
C C C
Overview
Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500
C C C
Overview
C C C
Single Processor System
Memory
Main memory to store data and program
Proc
Processor to fetch program from memory, and execute program instructions: Load data from memory, process data and write results back to memory.
Input/ output is not covered here. 4

C C C
Marc Tremblay, Sun
Caches
C C C
Memory Caches are smaller than main memory, but much faster. They are employed to bridge the gap between a bigger and slower main memory and the processor which is much faster. The cache is invisible to the programmer Only when measuring the runtime, the effect of caches will become apparent.
Cache
Proc
C C C

I am ignoring - instruction caches - address caches (TLB) - write buffers - prefetch buffers here as data caches are most important for HPC applications on chip cache($) With a growing number of transistors on each chip over time, caches can be put on the same piece of silicon.
Memory
L2 On Cache off chip cache($)

L1 Cache
Proc
C C C
In 2005 Intel cancelled the 4 GHz Chip

Fast clock cycles make processor chips more expensive, hotter and more power consuming.
C C C
The Impact of Moores Law

Intel-Processors: Intel-Processors:
The number of transistors on a chip is still doubling every 18 months but the clock speed is no longer growing that fast. Higher clock speed causes higher temperature and higher power consumption.
Clock ClockSpeed Speed(MHz) (MHz) Transistors (x1000) Transistors (x1000)
Instead well see many more cores per chip!

Source: Herb Sutter www.gotw.ca/publications/concurre ncy-ddj.htm
C C C
Dual Core-Processors
Memory
Since 2005/6 Intel and AMD are producing dualcore processors for the mass market. In 2006/7 Intel and AMD introduce quadcore processors. By 2008 it will be hard to buy a PC without a dualcore processor. Your future PC / laptop will be a parallel computer!
L2 Cache
L1
L1
core core
10
C C C
Dual Core-Processors Intel Woodcrest
Memory
Here: L2 Cache
L1 L1
L2 Cache
L1 L1
core core
core core
4 MB shared Cache on chip 2 Cores with local L1 Cache and a socket for a second processor chip
11
C C C
Multi Core-Processors
UltraSPARC IV 1.2 GHz 130 nm ~66 Mio trans. 108 Watt UltraSPARC IV+ 1.5 GHz 90 nm 295 Mio trans. 90 Watt (?) Opteron 875 2.2 GHz 90 nm 199 mm2 233 Mio trans. 95 Watt Memory UltraSPARC T1 1.0 GHz 90 nm 378 mm2 300 Mio trans. 72 Watt Memory
Memory
Memory
2x8MB
32 MB
64KB64KB
8 KB L1 for each core core core

1 MB 1 MB 3 MB L2
64KB64KB
64KB64KB
8 cores
core core 12
core core
2 MB L2
C C C
What to do with all these Threads
Marc Tremblay, Sun
Waiting in parallel
13
C C C
What to do with all these Threads

Marc Tremblay, Sun
14
C C C
Sun Fire T2000 at Aachen
Memory
Memory
Memory
Memory
25.6 GB/sec. 4 DDR2 memory controllers on chip

.75 MB L2 .75 MB L2 .75 MB L2 .75 MB L2
Internal Crossbar 134 GB/s
FPU core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
1GHz
15
C C C
Sun T5120 Eight Cores x Eight Threads

1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz
Memory Memory Memory Memory
42.7 GB/sec. 4 FB DRAM memory controllers on chip

0.5 MB L2 0.5 MB L2 0.5MB L2 0.5MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2
Internal Crossbar 1.4 GHz core 8 threads per core 8 KB 1 FPU per L1 core
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
16
C C C
Chip Level Parallelism

= 1.11 ns UltraSPARC III superscalar, single core 4 sparc v9 instr/cycle 1 active thread per core UltraSPARC IV+ superscalar, dual core 2 x 4 sparc v9 instr/cycle 1 active thread per core Opteron 875 superscalar, dual core 2 x 3 x86 instr/cycle 1 active thread per core UltraSPARC T1 Single issue, 8 cores 8 x 1 sparc v9 instr/cycle 4 active threads per core context switch comes for free time
= 0.66 ns
= 0.45 ns
= 1.0 ns context switch 17
C C C
Overview
18
C C C
Shared Memory Parallel Computers Uniform Memory Access (UMA)

-Crossbar adds latency -Architecture is not scalable Memory In a shared memory parallel computer multiple processors have access to the same main memory. Cache Yes, a dualcore / multicore processor based machine is a parallel computer on a chip.
Crossbar / Bus
Cache
Cache
Cache
Proc
Proc
Proc
Proc
19
C C C
Shared Memory Parallel Computers Non Uniform Memory Access (NUMA)

- Faster local memory access - slower remote memory access Memory Memory Memory Memory
Cache
Cache
Cache
Cache
Proc
Proc
Proc
Proc
Crossbar / Bus
20
C C C
Sun Fire E2900 at Aachen

- simplistic view - programers perspective - rather uniform memory access
Memory
Memory
Memory
Crossbar - 9.6 GB/s total peak memory bandwidth

2.4 GB/s. memory controller on chip
2x8MB
2x8MB
64KB64KB
core core 21
12 dual core US IV processors 1.2 GHz

64KB64KB
core core
C C C
Sun Fire V40z at Aachen

Memory 6.4 GB/s
64KB64KB
Memory 6.4 GB/s

64KB64KB
- simplistic view - programers perspective - non-uniform memory access DDR 400 memory controller on chip
core core
1 MB 1 MB
8 GB/s
core core
1 MB 1 MB
8 GB/s
64KB64KB
8 GB/s
64KB64KB
core core
1 MB 1 MB
8 GB/s
core core
1 MB 1 MB
6.4 GB/s Memory
6.4 GB/s Memory
2.2 GHz
22
C C C
Sun T5120 Eight Cores x Eight Threads

1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz
Memory Memory Memory Memory
42.7 GB/sec. 4 FB DRAM memory controllers on chip

0.5 MB L2 0.5 MB L2 0.5MB L2 0.5MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2 0.5 MB L2
Internal Crossbar 1.4 GHz core 8 threads per core 8 KB 1 FPU per L1 core
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
23
C C C
Overview
24
C C C
Distributed Memory Parallel Computer / Cluster

In a distributed memory parallel computer each processor has only access to its own main memory. Programs have to use an external network for communication and cooperation. They have to exchange messages
External network
Memory
Memory
Memory
Cache
Cache
Cache
Proc
Proc
Proc
25
C C C
MPI-Paradigm: Send - Receive
Memory send
Processor
Memory receive
Processor
Network
26
C C C
MPI on Distributed Memory Parallel Computers

Typically, when using Message Passing with MPI, one MPI process runs on each processor (core) External network MPI is a program library plus a mechanism to launch multiple cooperation executable progams. Typically it is the same binary, which is started on multiple processors. (SPMD=single program mutliple data paradigm) MPI is the de-facto standard for message passing.
Memory MPI-Task
Memory MPI-Task
Memory MPI-Task
Cache
Cache
Cache
Proc
Proc
Proc
27
C C C
MPI on Shared Memory Parallel Computers

MPI can be used on shared memory systems as well. The shared memory serves as the network. Again, typically one MPI process runs on each processor (core) MPI is formally specified for C, C++ and Fortran. All major vendors provide an MPI library for their machines. And there are free versions available. Cache Cache Cache Cache
Memory (interleaved) MPI-Task MPI-Task MPI-Task Crossbar / Bus
JAVA-implementations are available, too, but they are not widely used and not standardized.
Proc
Proc
Proc
Proc
28
C C C
OpenMP on Shared Memory Parallel Computers

On shared memory systems shared memory programming can be used, where typically one light weight process (= thread) runs on each processor (core) Memory OpenMP (interleaved)
Crossbar / Bus
-Thread Cache -Thread Cache Cache -Thread Cache
Proc
Proc
Proc
Proc
29
C C C
MPI on SMP-Clusters
Today, most clusters have SMP nodes and MPI is well suited for this architecture. External network
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Proc 30
Proc
Proc
Proc
Proc
Proc
Proc
Proc
C C C
Hybrid Parallelization on SMP-Clusters (MPI+OpenMP)
External network
Memory (interleaved)
Crossbar / Bus
Crossbar / Bus
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Proc 31
Proc
Proc
Proc
Proc
Proc
Proc
Proc
C C C
Innovative: Cluster OpenMP (DSM=distributed shared memory system)

External network Cluster - OpenMP
Crossbar / Bus
Crossbar / Bus
Cache Proc Threa d 32
Cache
Cache
Cache
Cache
Cache
Cache
Cache -Thread Proc
-Thread -Thread Proc Proc Proc
-Thread Proc Proc
Proc
C C C
Nodes of Todays Clusters are Shared Memory Machines with Multicore Processors
External network
Memory
Memory
L2 Cache
L2 Cache
...
L2 Cache
L2 Cache
L1
L1
L1
L1
L1
L1
L1
L1
core core
core core
core core
core core
33
C C C
Networks and Topologies

Networks Fast- / Gigabit-Ethernet Myrinet SCI QsNet (Quadrix) Infiniband Proprietary networks Topologies Bus Tree Fat Tree 2D-, 3D- Torus Hypercube Crossbar / Switch
34
C C C
Modern Parallel Computer Architectures

COTS (=commercial off-the-shelf), COW (=Cluster of Workstations) with 1 or 2 dual core prozessorchips and sheap network (GigabitEthernet) self-made Cluster of rack mounted Pizza boxes with 1-4 CPUs dual/quad core processor chips with fast network (Infiniband) SMP-Cluster with standard SMP-servers and propriatary or multi-rail networks Sun Fire Cluster, SGI Columbia (Altix nodes), ASC purple (IBM p575 nodes) Supercomputers, designed for High-End-Computing: Cray XT3, IBM BlueGene/L, Earth Simulator (NEC SX6)
35
C C C
Overview
36
C C C
HPC @ RZ.RWTH-AACHEN.DE
SunFire V40z Cluster
Sun Sun Fire Fire T2000 T2000
Xeon -Cluster Xeon-Cluster SunFire E25K Cluster
37
C C C
RWTH Aachen Compute Cluster

Compute Cluster der RWTH Aachen
Feb 08
processor type
UltraSPARC IV UltraSPARC IV UltraSPARC T1 UltraSPARC T1 Opteron 848 Opteron 875 Opteron 885 Xeon 5160 (Woodcrest) Xeon 5160 (Woodcrest) Xeon 5160 (Woodcrest) Xeon E5450 (Harpertown) Xeon E5450 (Harpertown) Xeon X7350 (Tigerton)
#nodes
2 8 20 1 64 4 2 7 2 4 55 5
model
SF E25K SF E6900 SF T2000 SF T2000 SF V40z SF V40z SF X4600
memor clock y #procs #cores #threads [MHz] [GB]

72 24 1 1 4 4 8 2 2 2 2 2 144 48 8 8 4 8 16 4 4 8 8 8 144 48 64 32 4 8 16 4 4 8 8 8 1050 1200 1400 1000 2200 2200 2600 3000 3000 2667 3000 3000 288 96 32 8 8 16 32 8 16 16 16 32
network
Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Infiniband Gigabit Ethernet Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet
accumulated accumulated performance memory [TFLOPS] [TB]

0.60 0.92 0.2240 0.0001 1.13 0.14 0.17 0.17 0.05 0.17 2.64 0.24 0.58 0.77 0.64 0.01 0.51 0.06 0.06 0.06 0.03 0.06 0.88
Dell 1950
FujitsuSiemens RX200 FujitsuSiemens RX600 sum
2 176
38
Parallel Computer Architectures 16 16 2930 64 Infiniband

1740 2884
0.19 6.64
C C C
0.16 0.13 3.95
System Management
Frontend nodes for interactive work, program development and testing, GUIs
cluster.rz.RWTH-Aachen.DE = cluster-solaris.rz.RWTH-Aachen.DE = cluster-solaris-sparc.rz.RWTH-Aachen.DE cluster-solaris-opteron.rz.RWTH-Aachen.DE cluster-linux.rz.RWTH-Aachen.DE = cluster-linux-opteron.rz.RWTH-Aachen.DE cluster-linux-xeon.rz.RWTH-Aachen.DE cluster-windows.rz.RWTH-Aachen.DE = cluster-windows-xeon.rz.RWTH-Aachen.DE Abbreviations: cl[uster], sol[aris], lin[ux], win[dows], x[eon], o[pteron], s[parc]
Batch system Sun Grid Engine: jobs (> 20 min) and Microsoft Compute Cluster resp.
39
C C C
Overview over HPC Tools

Current program development environment for HPC on the Sun SPARC, AMD Opteron and Intel Xeon systems at the RWTH 4 platforms : 1. SPARC/Solaris 10, 64bit 2. Opteron/Solaris, 64 bit 3. Opteron/Linux and Xeon/Windows, 64 bit 4. Opteron/Windows and Xeon/Windows, 64 bit serial programming, shared memory parallelization, message passing compilers / MPI libraries, debugging tools, performance analysis tools
40
C C C
Programming Environment Compilers + Debugging Tools

Company Compiler Version Language OpenMP support Autopar Debugger Runtime Analysis Spa rc Opteron Xeon
Solaris Lin Win dbx sunstudio thread analyzer idb analyzer, collect, er_print, gprof vtune gprof gprof pgprof
Sun
Studio 12
F95/C/C++
F95/C++
F95/C++
Intel GNU GNU PGI
V10.0
F95/C++
F95/C++ Threading Tools
F95/C++
X X X X
X X
V4.0 F95/C++ gdb V4.2 F95/C++ F95/C++ gdb V7.1 F77/F90/C/C++ F77/F90/C/C++ F77/F90/C/C++ pgdbg Visual Studio Microsoft C++ Visual Studio 2003 Visual Studio Microsoft C++ C++ Visual Studio 2005 Etnus TotalView 8.3
X X X X X C C C
41
Sun Fire Cluster Programming Environment MPI Implementations and Tools

Provider Version MPI2 support Debugger Runtime Analysis Plattform Solaris 10 Opteron + Sparc Solaris 10 Opteron + Sparc Linux Sol, Lin,Win Sol, Lin,Win ? Network
Sun
HPC ClusterTools 6 HPC ClusterTools 7.1 based on Open MPI Version 3.1 based on mpich2 mpich 1.2.6 mpich2 1.0.x OpenMPI 1.2.5 based on FT-MPI, LA-MPI, LAM, PACX CCS V1 based on mpich2
yes
TotalView
analyzer, mpprof
tcp, shm
Sun
yes
TotalView TraceCollector & Analyzer (former Vampir) jumpshot jumpshot
tcp, shm, ib
Intel ANL ANL Open MPI p.d. Microsoft Univ Dresden VI-HPS
(yes) no yes (tcp) yes
TotalView TotalView TotalView TotalView Visual Studio w/ MS Compute Cluster Pack
tcp, shm, ib tcp, shm tcp, shm tcp, myr, Infiniband
(yes) ?
Windows Vampir-NG multiple research tools
tcp, shm, (Infiniband) any any
42
C C C
Overview
43
C C C
Measuring Performance Linpack Benchmark

The theorectical peak performance is determined by the clock cycle and the number of floating point operations per cycle. The actual floating point performance can be determined by the LINPACK Benchmark (www.top500.org) solving a linear equation system with a full coefficient matrix of an abitrary size. The unit of measurement is M[ega]flops = Million floating point operations per second G[iga]flops, T[era]flops, P[eta]flops (=10^9, 10^12, 10^15 flops) The Top 500 listing the fastest supercomputers is updated twice per year. Latest No 1 (28. list Nov. 2006): IBM BlueGene/L with 131072 processors, 32 TB total memory Lawrence Livermore National Laboratory (LLNL) peak: 367 Tflops = 367000 Gflops linpack: 280 Tflops 76% of peak matrix size N=1.769.471 For Comparison: Dualcore Intel Xeon 5160 (Woodcrest) , 3 GHz: 2 cores * 4 flops/cycle (SSE) * 3 GHz = 24 Gflops C
44
C C
The TOP500 List

1.000.000 RWTH: Peak Performance (R peak) RWTH: Linpack Performance TOP500: Rank 1 TOP500: Rank 50 TOP500: Rank 200 TOP500: Rank 500 100 10 1
Jun Jun Jun Jun Jun Jun Jun Jun 93 94 95 96 97 98 99 00 Jun Jun Jun Jun Jun Jun 01 02 03 04 05 06
Gflops
100.000 10.000 1.000
PC-Technology Moore's Law

Fujitsu VPP300 at Aachen University
Sun Fire Cluster at Aachen University
45
C C C
The current Top 20 (Nov 07)

Proc. Linpack Frequ RMax %peak Processor ency 478200 80.18 PowerPC 440 700 167300 75.08 PowerPC 450 850 126900 73.77 Xeon 53xx (Clov 3000 117900 69.00 Xeon 53xx (Clov 3000 102800 70.20 Xeon 53xx (Clov 2667 102200 80.14 Opteron Dual Co 2400 101700 85.21 Opteron Dual Co 2600 91290 79.60 PowerPC 440 700 85368 84.97 Opteron Dual Co 2600 82161 79.60 PowerPC 440 700 75760 81.65 POWER5 1900 73032 79.60 PowerPC 440 700 63830 67.75 PowerPC 970 2300 62680 69.97 Xeon 53xx (Clov 2333 56520 90.78 Itanium 2 1600 56430 55.31 Opteron Dual Co 2400 54648 86.15 Opteron Dual Co 2800 53000 81.57 Xeon EM64T 3600 52840 82.83 Itanium 2 1600 51870 85.09 Itanium 2 1500 Computer Architectures System Family Interconnect IBM BlueGe Proprietary IBM BlueGe Proprietary SGI Altix Infiniband HP Cluster 3Infiniband DDR HP Cluster 3Infiniband DDR Cray XT XT3 proprietary Cray XT XT3 proprietary IBM BlueGe Proprietary Cray XT XT3 proprietary IBM BlueGe Proprietary IBM pSeriesFederation IBM BlueGe Proprietary IBM Cluster Myrinet Dell Cluster Infiniband SDR SGI Altix NUMAlink Sun Fire - CInfiniband Cray XT XT3 proprietary Dell Cluster Infiniband Bull SMP Cl Quadrics SGI Altix Numalink/IB
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 46 20
Site ManufacComputer Country Procs LLNL IBM Blue GenUSA 212992 Blue GenGermany 65536 FZ JuelicIBM SGI/NMC SGI SGI AltixUSA 14336 TATA SO HP Cluster PIndia 14240 Cluster PSweden 13728 Gov Age HP Sandia NCray IncSandia/ USA 26569 23016 Oak RidgCray IncCray XT USA IBM WatIBM Blue GenUSA 40960 NERSC/ Cray IncCray XT USA 19320 Blue GenUSA 36864 Stony BrIBM LLNL IBM pSeries USA 12208 Blue GenUSA 32768 RensselaIBM BarcelonIBM BladeCe Spain 10240 NCSA Dell PowerEdUSA 9600 SGI Altix 470 Germany 9728 Leibniz R GSIC, TINEC/SunSun FireJapan 11664 11328 Univ Edi Cray IncCray XT UK Sandia NDell PowerEdUSA 9024 CEA Bull SA NovaScaFrance 9968 10160 NASA/A SGI SGI AltixUSA Parallel
C C C
Aachen on Rank 180 in June 2005 http://www.rz.rwth-aachen.de/hpc/sun/

Over 2 TeraFlop/s Linpack Performance
April 2005. The upgrade from UltraSPARC III to UltraSPARC IV including an increase of the main memory capacity more than doubled our Linpack performance! A linear system with 499,200 unknowns was solved in 11:12:48.8 hours at an average speed of 2054.4 billion floating point operations per second (GFlop/s). The program had a total memory footprint of 2 Terabyte. 1276 processor cores were kept busy with 82,930,000,000 million floating point operations.
47
C C C
Future Parallel Computers

With a growing number of cores per chip the SMP box is shrinking. In a few years from now, many or all applications will be multi-threaded SMP boxes with small footprint will be building blocks of large systems. Memory hierarchies will grow (L3 caches ) Network latency will be close to 1 s Network bandwidth several GB/s Current research:
Distributed Shared Memory (Cluster OpenMP ) combining the advantage of SMP with scalability of DMP
In 2008/9: Petaflop/s systems by IBM, Cray, NEC,

Woodcrest ~ 24 Gflop/s @ ~100 W => 1 PFL @ ~4 MW
Main problems: Power supply and cooling

48
C C C
Some Web-links
Information related to HPC at the RWTH http://www.rz.rwth-aachen.de/hpc/ Information related to MPI at the RWTH http://www.rz.rwth-aachen.de/mpi/ Sun Fire SMP Cluster Primer http://www.rz.rwth-aachen.de/hpc/primer Web page of the SunHPC and VI-HPS workshops with more links and information http://www.rz.rwth-aachen.de/sunhpc
Joint SunHPC Seminar (March 3-4) and VI-HPS (March 5-7) Tuning Workshop 49
C C C

MPIintro Kurs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MPIintro Kurs

Uploaded by

Copyright:

Available Formats

Parallel Computer Architectures

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Single Processor System

Main memory to store data and program

Input/ output is not covered here. 4

Ce nter fo r Co mputing an d C omm unic ation

Marc Tremblay, Sun

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Single Processor System

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Single Processor System

L2 On Cache off chip cache($)

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

In 2005 Intel cancelled the 4 GHz Chip

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

The Impact of Moores Law

Instead well see many more cores per chip!

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Dual Core-Processors Intel Woodcrest

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

8 KB L1 for each core core core

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

What to do with all these Threads

Marc Tremblay, Sun

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

What to do with all these Threads

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Sun Fire T2000 at Aachen

25.6 GB/sec. 4 DDR2 memory controllers on chip

Internal Crossbar 134 GB/s

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Sun T5120 Eight Cores x Eight Threads

42.7 GB/sec. 4 FB DRAM memory controllers on chip

Parallel Computer Architectures

Chip Level Parallelism

= 1.0 ns context switch 17

Ce nter fo r Co mputing an d C omm unic ation

Processor Architecture System/Node Architecture Clusters HPC @ RZ.RWTH-AACHEN Top500

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Shared Memory Parallel Computers Uniform Memory Access (UMA)

Parallel Computer Architectures

Ce nter fo r Co mputing an d C omm unic ation

Shared Memory Parallel Computers Non Uniform Memory Access (NUMA)