Professional Documents
Culture Documents
Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany
C C C
Overview
C C C
Overview
C C C
Memory
Proc
Processor to fetch program from memory, and execute program instructions: Load data from memory, process data and write results back to memory.
C C C
Caches
C C C
Memory Caches are smaller than main memory, but much faster. They are employed to bridge the gap between a bigger and slower main memory and the processor which is much faster. The cache is invisible to the programmer Only when measuring the runtime, the effect of caches will become apparent.
Cache
Proc
C C C
Memory
Proc
C C C
C C C
The number of transistors on a chip is still doubling every 18 months but the clock speed is no longer growing that fast. Higher clock speed causes higher temperature and higher power consumption.
Clock ClockSpeed Speed(MHz) (MHz) Transistors (x1000) Transistors (x1000)
C C C
Dual Core-Processors
Memory
Since 2005/6 Intel and AMD are producing dualcore processors for the mass market. In 2006/7 Intel and AMD introduce quadcore processors. By 2008 it will be hard to buy a PC without a dualcore processor. Your future PC / laptop will be a parallel computer!
L2 Cache
L1
L1
core core
10
C C C
Memory
Here: L2 Cache
L1 L1
L2 Cache
L1 L1
core core
core core
4 MB shared Cache on chip 2 Cores with local L1 Cache and a socket for a second processor chip
11
C C C
Multi Core-Processors
UltraSPARC IV 1.2 GHz 130 nm ~66 Mio trans. 108 Watt UltraSPARC IV+ 1.5 GHz 90 nm 295 Mio trans. 90 Watt (?) Opteron 875 2.2 GHz 90 nm 199 mm2 233 Mio trans. 95 Watt Memory UltraSPARC T1 1.0 GHz 90 nm 378 mm2 300 Mio trans. 72 Watt Memory
Memory
Memory
2x8MB
32 MB
64KB64KB
64KB64KB
64KB64KB
8 cores
core core 12
core core
2 MB L2
C C C
Waiting in parallel
13
C C C
14
C C C
Memory
Memory
Memory
Memory
FPU core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
1GHz
15
C C C
Internal Crossbar 1.4 GHz core 8 threads per core 8 KB 1 FPU per L1 core
Ce nter fo r Co mputing an d C omm unic ation
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
16
C C C
= 0.66 ns
= 0.45 ns
C C C
Overview
18
C C C
Crossbar / Bus
Cache
Cache
Cache
Proc
Proc
Proc
Proc
19
C C C
Cache
Cache
Cache
Cache
Proc
Proc
Proc
Proc
Crossbar / Bus
20
C C C
Memory
Memory
Memory
2x8MB
2x8MB
64KB64KB
core core 21
64KB64KB
core core
C C C
- simplistic view - programers perspective - non-uniform memory access DDR 400 memory controller on chip
core core
1 MB 1 MB
8 GB/s
core core
1 MB 1 MB
8 GB/s
64KB64KB
8 GB/s
64KB64KB
core core
1 MB 1 MB
8 GB/s
core core
1 MB 1 MB
2.2 GHz
22
C C C
Internal Crossbar 1.4 GHz core 8 threads per core 8 KB 1 FPU per L1 core
Ce nter fo r Co mputing an d C omm unic ation
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
core
8 KB L1
23
C C C
Overview
24
C C C
External network
Memory
Memory
Memory
Cache
Cache
Cache
Proc
Proc
Proc
25
C C C
Memory send
Processor
Memory receive
Processor
Network
26
C C C
Memory MPI-Task
Memory MPI-Task
Memory MPI-Task
Cache
Cache
Cache
Proc
Proc
Proc
27
C C C
JAVA-implementations are available, too, but they are not widely used and not standardized.
Proc
Proc
Proc
Proc
28
C C C
Crossbar / Bus
Proc
Proc
Proc
Proc
29
C C C
MPI on SMP-Clusters
Today, most clusters have SMP nodes and MPI is well suited for this architecture. External network
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Proc 30
Proc
Proc
Proc
Proc
Proc
Proc
Proc
C C C
External network
Memory (interleaved)
Memory (interleaved)
Crossbar / Bus
Crossbar / Bus
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Cache
Proc 31
Proc
Proc
Proc
Proc
Proc
Proc
Proc
C C C
Memory (interleaved)
Memory (interleaved)
Crossbar / Bus
Crossbar / Bus
Cache
Cache
Cache
Cache
Cache
Cache
Proc
C C C
Nodes of Todays Clusters are Shared Memory Machines with Multicore Processors
External network
Memory
Memory
L2 Cache
L2 Cache
...
L2 Cache
L2 Cache
L1
L1
L1
L1
L1
L1
L1
L1
core core
core core
core core
core core
33
C C C
34
C C C
35
C C C
Overview
36
C C C
HPC @ RZ.RWTH-AACHEN.DE
SunFire V40z Cluster
37
C C C
#nodes
2 8 20 1 64 4 2 7 2 4 55 5
model
SF E25K SF E6900 SF T2000 SF T2000 SF V40z SF V40z SF X4600
network
Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Infiniband Gigabit Ethernet Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet Infiniband Gigabit Ethernet
Dell 1950
2 176
38
0.19 6.64
C C C
0.16 0.13 3.95
System Management
Frontend nodes for interactive work, program development and testing, GUIs
cluster.rz.RWTH-Aachen.DE = cluster-solaris.rz.RWTH-Aachen.DE = cluster-solaris-sparc.rz.RWTH-Aachen.DE cluster-solaris-opteron.rz.RWTH-Aachen.DE cluster-linux.rz.RWTH-Aachen.DE = cluster-linux-opteron.rz.RWTH-Aachen.DE cluster-linux-xeon.rz.RWTH-Aachen.DE cluster-windows.rz.RWTH-Aachen.DE = cluster-windows-xeon.rz.RWTH-Aachen.DE Abbreviations: cl[uster], sol[aris], lin[ux], win[dows], x[eon], o[pteron], s[parc]
Batch system Sun Grid Engine: jobs (> 20 min) and Microsoft Compute Cluster resp.
Parallel Computer Architectures
39
C C C
40
C C C
Solaris Lin Win dbx sunstudio thread analyzer idb analyzer, collect, er_print, gprof vtune gprof gprof pgprof
Sun
Studio 12
F95/C/C++
F95/C++
F95/C++
V10.0
F95/C++
F95/C++
X X X X
X X
V4.0 F95/C++ gdb V4.2 F95/C++ F95/C++ gdb V7.1 F77/F90/C/C++ F77/F90/C/C++ F77/F90/C/C++ pgdbg Visual Studio Microsoft C++ Visual Studio 2003 Visual Studio Microsoft C++ C++ Visual Studio 2005 Etnus TotalView 8.3
X X X X X C C C
41
Sun
HPC ClusterTools 6 HPC ClusterTools 7.1 based on Open MPI Version 3.1 based on mpich2 mpich 1.2.6 mpich2 1.0.x OpenMPI 1.2.5 based on FT-MPI, LA-MPI, LAM, PACX CCS V1 based on mpich2
yes
TotalView
analyzer, mpprof
tcp, shm
Sun
yes
tcp, shm, ib
Intel ANL ANL Open MPI p.d. Microsoft Univ Dresden VI-HPS
(yes) ?
42
C C C
Overview
43
C C C
C C
Gflops
100.000 10.000 1.000
45
C C C
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 46 20
Site ManufacComputer Country Procs LLNL IBM Blue GenUSA 212992 Blue GenGermany 65536 FZ JuelicIBM SGI/NMC SGI SGI AltixUSA 14336 TATA SO HP Cluster PIndia 14240 Cluster PSweden 13728 Gov Age HP Sandia NCray IncSandia/ USA 26569 23016 Oak RidgCray IncCray XT USA IBM WatIBM Blue GenUSA 40960 NERSC/ Cray IncCray XT USA 19320 Blue GenUSA 36864 Stony BrIBM LLNL IBM pSeries USA 12208 Blue GenUSA 32768 RensselaIBM BarcelonIBM BladeCe Spain 10240 NCSA Dell PowerEdUSA 9600 SGI Altix 470 Germany 9728 Leibniz R GSIC, TINEC/SunSun FireJapan 11664 11328 Univ Edi Cray IncCray XT UK Sandia NDell PowerEdUSA 9024 CEA Bull SA NovaScaFrance 9968 10160 NASA/A SGI SGI AltixUSA Parallel
C C C
47
C C C
C C C
Some Web-links
Information related to HPC at the RWTH http://www.rz.rwth-aachen.de/hpc/ Information related to MPI at the RWTH http://www.rz.rwth-aachen.de/mpi/ Sun Fire SMP Cluster Primer http://www.rz.rwth-aachen.de/hpc/primer Web page of the SunHPC and VI-HPS workshops with more links and information http://www.rz.rwth-aachen.de/sunhpc
Joint SunHPC Seminar (March 3-4) and VI-HPS (March 5-7) Tuning Workshop 49
Parallel Computer Architectures
C C C