Professional Documents
Culture Documents
Lero 2012
Lecturer
Details
Name:
Anne
Meade
Work:
Lero,
the
Irish
So7ware
Engineering
Research
Centre
Research
Interests:
Distributed
Compu@ng
Parallel
programming
Data
Decomposi@on
So7ware
Design
Contact:
anne.meade@ul.ie
Tierney
building,
Oce
T2-027
Lero 2013
Acknowledgements
Parts
of
this
course
draws
on
MPI
courses
made
available
by
the
following:
ICHEC
-
Irish
Center
for
High
End
CompuFng
in
Ireland
NCI,
Australias
NaFonal
Research
CompuFng
Service
CINECA
SupercompuFng
Group,
Italy
Simlab
EducaFon
Progam
(funded
from
Stability
Pact
for
South
Eastern
Europe)
Training
and
EducaFon
Centre
at
Edinburgh
Parallel
CompuFng
Centre
(EPCC-TEC)
Lero
2013
3
Course
Outline
Weeks
1-4
Introduc@on
to
Parallel
Programming
Distributed
Programming
Model
Focus
on
MPI
(Message
Passing
Interface)
Mixture
of
lectures
and
hands-on
lab
exercises
Assignment,
Week
2
-
C
coding
exercise
(Warm
up
exercise
-
Not
included
in
overall
grade)
Week 5 : Mid Term lab session 1 Students will be required to convert C based serial Game of Life applica@on to parallel by manually wri@ng MPI calls Purpose: To test knowledge of MPI and distributed compu@ng Grade: 12.5% (11% for correct func@onality, 1.5% bonus marks for comple@on within @meframe) Students will use tooling support to semi-automate genera@on of MPI and will convert C based serial Game of Life applica@on to parallel by invoking generated code Purpose: To test understanding of the parallel intent of a program and reason about automa@on Grade: 12.5% (11% for correct func@onality, 1.5% bonus marks for comple@on within @meframe)
Lero 2013
Course
Resources
MPI
links:
h\p://www.mpi-forum.org/docs/docs.html
h\p://www.mpi-forum.org/docs/mpi1-report.pdf
h\p://www.mcs.anl.gov/research/projects/mpi/
MPI
Books
(can
use
as
reference,
however
notes
and
web
links
should
prove
sucient):
Parallel
programming
with
MPI
-
Pacheco,
Peter
S
(1997)
- 2
copies
available
in
the
library
Parallel
programming
in
C
with
MPI
and
openMP-
Quinn,
Michael
J.
(2007
&
2003)
- 1
copy
available
in
the
library
(2007)
- 1
copy
available
in
the
library
(2003)
Lero 2013
1.
Background
to
HPC
Lero 2013
Lero 2013
Also
Anima@on/Games
Example:
Rendering
a
scene
in
a
blockbuster
movie
Rendering:
taking
informa@on
from
anima@on
les
(ligh@ng,
textures,
shading)
and
applying
to
3D
models
to
generate
2D
image
that
makes
up
frame
of
lm
Parallel
compu@ng
essen@al
to
generate
needed
number
of
frames
(24
per
second)
for
feature
length
lm
Toy
Story
(Pixar
1995)
100
dual-processor
machines,
200
processing
cores
Monsters
University
(Pixar
2013)
2,000
computers,
24,000
processing
cores
- 100 million CPU hours to render the lm - 1 CPU would have taken 10,000 years to nish!
Lero 2013
Memory
Systems
Shared
Memory
Processors
share
the
address
space
and
can
communicate
with
each
other
by
wri@ng
and
reading
shared
variables
Uniform
Memory
Access
(UMA)
All
processors
share
a
connec@on
to
a
common
memory
and
have
equal
access
@mes
(latency
is
the
same
for
each
CPU)
As
seen
in
Symmetric
Mul@processor
machines
(SMP)
Fig1. Uniform Memory Access (UMA) Fig.2 Non Uniform Memory Access (NUMA)
Lero
2013
9
Lero 2013
10
Lero
2013
11
Lero 2013
12
Hybrid
Systems
Hybrid
Distributed
Memory
Combina@on
of
both
distributed
and
shared
memory
Fig.
4
Hybrid
Memory
Same advantages and disadvantages of both shared and distributed systems Can use GPUs (Graphical Processing Units) also known as accelerators to speed-up calcula@ons Accelerators are separated from the system and do not share memory with the host system
Lero
2013
13
Top500
www.top500.org
Ranking
of
most
powerful
supercomputers
in
the
world
Since
1993
parallel
machine
performance
has
been
measured
and
recored
with
LINPACK
benchmark
(dense
linear
algebra)
Rank
Site
Na@onal
University
of
Defence
Technology,
China
DOE/SC/Oak
Ridge
Laboratory,
US
DOE/NNSA/LLNL,
US
RIKEN
Advanced
Ins@tute
for
Computa@on
Science,
Japan
DOE/SC/Argonne
Na@onal
lab,
US
System
Tianhe-2
(NUDT)
Titan
(Cray)
Sequoia
(IBM)
K
Computer,
Fujitsu
Mira,
IBM
Cores
3,120,000
560,640
1,572,864
705,024
786,432
Perf.
(TFLOPS/s)*
33,862.7
17,590.0
17,173.2
10,512.0
8,586.6
1
2
3
4
5
Speedup
The
execu@on
@me
depends
on
what
the
program
does
A
parallel
program
spends
@me
in:
Work
Synchroniza@on
Communica@on
Extra
work
(overheads)
A
program
implemented
for
a
parallel
machine
is
likely
to
do
extra
work
(than
a
sequen@al
program)
The goal of speedup is to use P processors to make a program run P @mes faster Speedup is the factor by which the programs speed improves:
Lero 2013
15
Limits
to
Speedup
All
parallel
programs
contain:
Parallel
regions
Serial
regions
Prac@cal
Limits:
Communica@on
overhead
Synchroniza@on
overhead
Extra
opera@ons
needed
for
parallel
version
Amdahl's
Law
is
a
law
governing
the
maximum
speedup
of
using
parallel
processors
on
a
problem,
versus
using
only
one
serial
processor
Speedup
of
an
applica@on
=
Where
P
represents
the
parallel
part
1-P
represents
the
serial
part
(non-parallelizable)
N
represent
the
number
of
processors
Lero
2013
16
2.
IntroducFon
to
MPI
Note:
All
coding
samples
are
in
the
language
C
Lero 2013
17
Message
passing
Designed
for
implementa@on
on
distributed
memory
models
Unlike
the
shared
memory
model,
resources
are
local
A
process
is
a
program
performing
a
task
on
a
processor
Each
process
operates
in
its
own
environment
(logical
address
space)
and
communica@on
occurs
via
the
exchange
of
messages
Each
process
in
a
message
passing
progam
runs
an
instance/copy
of
a
program
The
message
passing
scheme
can
also
be
implemented
on
shared
memory
architectures
Messages
can
be
instruc@ons,
data
or
synchoniza@on
signals
Lero 2013
18
MPI
Message
Passing
Interface
A
specica4on
for
the
developers
and
users
of
message
passing
libraries
Addresses
the
message-passing
parallel
programming
model
Data
is
moved
from
the
address
space
of
one
process
to
that
of
another
process
through
coopera@ve
opera@ons
Portable
with
Fortran
and
C/C++
interfaces
History
1992:
MPI
dra7
proposal,
MPI
forum
found
1994:
Final
version
of
MPI-1.0
released
1996:
MPI-2
2012:
MPI-3.0
Many
implementa@ons:
MPICH,
LAM,
OpenMPI
Lero
2013
19
Work
distribu@on
MPI-processes
need
iden@ers:
rank
=
iden@fying
number
All
distribu@on
decisions
are
based
on
the
rank
i.e.
which
process
works
on
what
data
rank
=
0
data
rank
=
1
data
rank
=
2
data
rank
=
(size-1)
data
program
program
program
program
Communica@on Network
Lero 2013
20
Message
passing
Messages
are
packets
of
data
Y
moving
between
sub
programs
The
communica@on
system
must
allow
the
following
three
opera@ons:
send(message)
receive(message)
synchroniza@on
rank
=
0
send(Y)
data
rank
=
1
data
rank
=
2
receive(Y)
data
rank
=
(size-1)
data
program
program
program
program
Communica@on Network
Lero 2013
21
The
communicator
that
includes
all
processors
is
called
MPI_COMM_WORLD
MPI_COMM_WORLD
is
a
predened
object
in
mpi.h
and
is
therefore
automa@cally
dened
All
MPI
communica@on
rou@nes
have
a
communicator
argument
A
programmer
can
dene
many
communicators
at
the
same
@me
MPI_Comm_size
shows
how
many
processes
are
contained
within
a
communicator
P6
P5
P1
P0
P4
P2
P3
P7
Communicator
with
8
processes
Lero
2013
23
MPI_Comm_rank
Determines
the
rank
of
the
calling
process
in
the
communicator
This
header
le
is
necessary
Input:
comm
Communicator
(handle)
Output:
rank
Rank
of
the
calling
process
in
group
of
comm
(integer)
MPI_COMM_WORLD
P6
P5
P3
rank=1
Comm1
rank=3
rank=0
rank=2
P0
P2
P1
P4
P7 MPI_Comm_rank(Comm1, &rank)
Lero 2013
24
MPI_Comm_size
Returns
the
size
of
the
group
associated
with
a
communicator
Input: comm Communicator (handle) Output: size Number of processes in group of comm (integer)
P4
Lero 2013
25
Ini@alizing
MPI
MPI
Init
is
the
rst
rou@ne
that
must
be
called
in
an
MPI
program
Every
MPI
program
must
call
this
rou@ne
once,
before
any
other
MPI
rou@nes
Making
mul@ple
calls
to
MPI_Init
is
erroneous
The
C
version
of
the
rou@ne
accepts
the
arguments
to
the
main
func@on (argc
and
argv):
! !int MPI_Init(int *argc, char *** argv)!
Example use:
int main(int argc, char *argv[]){! ! !MPI_Init(&argc, &argv);! !. . . ! }!
Lero
2013
26
MPI_Finalize
An
MPI
program
should
call
the
MPI
rou@ne
MPI_Finalize
when
all
communica@ons
have
completed
This
rou@ne
cleans
up
all
MPI
data
structures,
etc.
It
does
not
cancel
outstanding
communica@ons,
so
it
is
the
responsibility
of
the
programmer
to
make
sure
all
communica@ons
have
completed.
Once
this
rou@ne
has
been
called,
no
other
calls
can
be
made
to
MPI
rou@nes,
not
even
MPI_Init,
so
a
process
cannot
later
re-enrol
in
MPI.
Syntax:
Use:
int
MPI_Finalize()!
MPI_Finalize();!
Lero 2013
27
Lero 2013
28
Exercise
1
Write
a
program
that
prints
hello
world
per
process
running
Compile
and
run
on
one
processsor
Run
on
several
processors
in
parallel
Modify
your
program
so
that
Every
process
writes
its
rank
and
the
size
of
MPI_COMM_WORLD
Only
process
ranked
0
in
MPI_COMM_WORLD
prints
hello
world
Is
the
sequence
of
the
output
determinis@c
(in
order?)
Lero
2013
29
Compila@on
C
compila@on
gcc
o
prog
prog.c
./prog.o
mpicc
prog.c
o
prog
mpirun
n
num
prog
Execu@on
C
program
Compila@on
MPI
in
C:
Lero 2013
30