MPI

Welcome
Parallelism in So-ware Development
Lero 2012
Lecturer Details
Name: Anne Meade Work: Lero, the Irish So7ware Engineering Research Centre Research Interests: Distributed Compu@ng Parallel programming Data Decomposi@on So7ware Design Contact: anne.meade@ul.ie Tierney building, Oce T2-027

Lero 2013
Acknowledgements

Parts of this course draws on MPI courses made available by the following: ICHEC - Irish Center for High End CompuFng in Ireland NCI, Australias NaFonal Research CompuFng Service CINECA SupercompuFng Group, Italy Simlab EducaFon Progam (funded from Stability Pact for South Eastern Europe) Training and EducaFon Centre at Edinburgh Parallel CompuFng Centre (EPCC-TEC)

Lero 2013 3
Course Outline
Weeks 1-4 Introduc@on to Parallel Programming Distributed Programming Model Focus on MPI (Message Passing Interface) Mixture of lectures and hands-on lab exercises Assignment, Week 2 - C coding exercise (Warm up exercise - Not included in overall grade)
Week 5 : Mid Term lab session 1 Students will be required to convert C based serial Game of Life applica@on to parallel by manually wri@ng MPI calls Purpose: To test knowledge of MPI and distributed compu@ng Grade: 12.5% (11% for correct func@onality, 1.5% bonus marks for comple@on within @meframe) Students will use tooling support to semi-automate genera@on of MPI and will convert C based serial Game of Life applica@on to parallel by invoking generated code Purpose: To test understanding of the parallel intent of a program and reason about automa@on Grade: 12.5% (11% for correct func@onality, 1.5% bonus marks for comple@on within @meframe)
Week 6 : Mid Term lab session 2
Lero 2013
Course Resources
MPI links: h\p://www.mpi-forum.org/docs/docs.html h\p://www.mpi-forum.org/docs/mpi1-report.pdf h\p://www.mcs.anl.gov/research/projects/mpi/ MPI Books (can use as reference, however notes and web links should prove sucient): Parallel programming with MPI - Pacheco, Peter S (1997) - 2 copies available in the library Parallel programming in C with MPI and openMP- Quinn, Michael J. (2007 & 2003) - 1 copy available in the library (2007) - 1 copy available in the library (2003)

Lero 2013
1. Background to HPC

Lero 2013
High Performance Compu@ng (HPC)

Computers are used to model physical systems in many elds of science, medicine and engineering. Typical problems driving the need for HPC across the following domains: Climate modeling Fluid turbulence Ocean circula@on Semi-conductor modeling Protein folding Signal processing Combus@on systems
Lero 2013
Also Anima@on/Games
Example: Rendering a scene in a blockbuster movie
Rendering: taking informa@on from anima@on les (ligh@ng, textures, shading) and applying to 3D models to generate 2D image that makes up frame of lm Parallel compu@ng essen@al to generate needed number of frames (24 per second) for feature length lm Toy Story (Pixar 1995) 100 dual-processor machines, 200 processing cores Monsters University (Pixar 2013) 2,000 computers, 24,000 processing cores

- 100 million CPU hours to render the lm - 1 CPU would have taken 10,000 years to nish!
Lero 2013
Memory Systems
Shared Memory
Processors share the address space and can communicate with each other by wri@ng and reading shared variables Uniform Memory Access (UMA)
All processors share a connec@on to a common memory and have equal access @mes (latency is the same for each CPU) As seen in Symmetric Mul@processor machines (SMP)
Non Uniform Memory Access (NUMA)

Memory is shared but not all processors have equal access @me to all memories (latency is determined by physical distance from the CPU) Does not scale well
Fig1. Uniform Memory Access (UMA) Fig.2 Non Uniform Memory Access (NUMA)

Lero 2013 9
Shared Memory System

Advantages:
Global address space provides a user-friendly programming perspec@ve to memory (such as with the OpenMP API) Data sharing between tasks is both fast and uniform due to the proximity of memory to the CPUs Disadvantages: Need for cache-coherency (i.e. local cache has the most up-to-date copy of a shared memory resource) Lack of scalability between memory and CPUs. - Adding more CPUs can increase trac on the shared memory-CPU path Programmer responsibility for synchroniza@on constructs that ensure "correct" access of global memory (i.e. prevent race condi@ons)
Lero 2013
10
Distributed Memory Systems

Distributed Memory Each process has its own address space and communicates with other processes by message passing Each processor has its own local memory
Fig.3. Distributed Memory

Lero 2013
11
Distributed Memory System

Advantages Memory is scalable with number of processors. Increase the number of processors and the size of memory increases propor@onately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency Cost eec@veness: can use commodity, o-the-shelf processors and networking. Disadvantages: Programmer is responsible for mapping data structures across nodes Programmer is responsible for coordina@ng communica@on between nodes when remote data is required in a local computa@on (called message-passing) Access to remote memory is signicantly slower than to local memory Currently, only low-level programming APIs (such as MPI) are available to perform message-passing between nodes

Lero 2013
12
Hybrid Systems
Hybrid Distributed Memory
Combina@on of both distributed and shared memory

Fig. 4 Hybrid Memory
Same advantages and disadvantages of both shared and distributed systems Can use GPUs (Graphical Processing Units) also known as accelerators to speed-up calcula@ons Accelerators are separated from the system and do not share memory with the host system

Lero 2013 13
Top500
www.top500.org Ranking of most powerful supercomputers in the world Since 1993 parallel machine performance has been measured and recored with LINPACK benchmark (dense linear algebra)
Rank Site Na@onal University of Defence Technology, China DOE/SC/Oak Ridge Laboratory, US DOE/NNSA/LLNL, US RIKEN Advanced Ins@tute for Computa@on Science, Japan DOE/SC/Argonne Na@onal lab, US System Tianhe-2 (NUDT) Titan (Cray) Sequoia (IBM) K Computer, Fujitsu Mira, IBM Cores 3,120,000 560,640 1,572,864 705,024 786,432 Perf. (TFLOPS/s)* 33,862.7 17,590.0 17,173.2 10,512.0 8,586.6
1 2
3 4 5
* TFLOPS: 1012 (Tera) Floa@ng Point Opera@ons Per Second

Lero 2013 14
Speedup
The execu@on @me depends on what the program does A parallel program spends @me in: Work Synchroniza@on Communica@on Extra work (overheads) A program implemented for a parallel machine is likely to do extra work (than a sequen@al program)
The goal of speedup is to use P processors to make a program run P @mes faster Speedup is the factor by which the programs speed improves:
Lero 2013
15
Limits to Speedup
All parallel programs contain: Parallel regions Serial regions Prac@cal Limits: Communica@on overhead Synchroniza@on overhead Extra opera@ons needed for parallel version Amdahl's Law is a law governing the maximum speedup of using parallel processors on a problem, versus using only one serial processor Speedup of an applica@on = Where P represents the parallel part 1-P represents the serial part (non-parallelizable) N represent the number of processors Lero 2013
16
2. IntroducFon to MPI
Note: All coding samples are in the language C

Lero 2013
17
Message passing
Designed for implementa@on on distributed memory models Unlike the shared memory model, resources are local A process is a program performing a task on a processor Each process operates in its own environment (logical address space) and communica@on occurs via the exchange of messages Each process in a message passing progam runs an instance/copy of a program The message passing scheme can also be implemented on shared memory architectures Messages can be instruc@ons, data or synchoniza@on signals
Lero 2013
18
MPI
Message Passing Interface A specica4on for the developers and users of message passing libraries Addresses the message-passing parallel programming model Data is moved from the address space of one process to that of another process through coopera@ve opera@ons Portable with Fortran and C/C++ interfaces History 1992: MPI dra7 proposal, MPI forum found 1994: Final version of MPI-1.0 released 1996: MPI-2 2012: MPI-3.0 Many implementa@ons: MPICH, LAM, OpenMPI
Lero 2013 19
Work distribu@on
MPI-processes need iden@ers: rank = iden@fying number All distribu@on decisions are based on the rank i.e. which process works on what data

rank = 0 data rank = 1 data rank = 2 data rank = (size-1) data
program
program
program
program
Communica@on Network
Lero 2013
20
Message passing
Messages are packets of data Y moving between sub programs The communica@on system must allow the following three opera@ons: send(message) receive(message) synchroniza@on

rank = 0 send(Y) data rank = 1 data rank = 2 receive(Y) data rank = (size-1) data
program
program
program
program
Communica@on Network
Lero 2013
21
Basic features of MPI Programs

An MPI program consists of mul@ple instances of a serial program that communicate by library calls Calls can be roughly divided into four classes: Calls used to ini4alize, manage and terminate communica@ons Calls used to communicate between pairs of processors (point to point communica@on) Calls to communicate among groups of processors (collec@ve communica@on) Calls to create datatypes
Lero 2013 22
The MPI_COMM_WORLD Communicator

It is possible to divide the total number of processes into groups called communicators A communicator is a variable iden@fying a group of processes that are allowed to
communicate with each other
The communicator that includes all processors is called MPI_COMM_WORLD MPI_COMM_WORLD is a predened object in mpi.h and is therefore automa@cally dened All MPI communica@on rou@nes have a communicator argument A programmer can dene many communicators at the same @me MPI_Comm_size shows how many processes are contained within a communicator P6 P5 P1 P0 P4 P2 P3 P7 Communicator with 8 processes
Lero 2013 23
MPI_Comm_rank
Determines the rank of the calling process in the communicator
This header le is necessary Input: comm Communicator (handle) Output: rank Rank of the calling process in group of comm (integer)
MPI_COMM_WORLD P6 P5 P3
rank=1
Comm1
rank=3 rank=0 rank=2
P0
P2
P1
P4
P7 MPI_Comm_rank(Comm1, &rank)
Lero 2013
24
MPI_Comm_size
Returns the size of the group associated with a communicator

Input: comm Communicator (handle) Output: size Number of processes in group of comm (integer)
MPI_COMM_WORLD size = 8 processes P0 P6 P3 P5 P2
Comm1 P1 P7 size = 4 processes
P4
Lero 2013
25
Ini@alizing MPI
MPI Init is the rst rou@ne that must be called in an MPI program Every MPI program must call this rou@ne once, before any other MPI rou@nes Making mul@ple calls to MPI_Init is erroneous The C version of the rou@ne accepts the arguments to the main func@on (argc and argv):
! !int MPI_Init(int *argc, char *** argv)!
Example use:
int main(int argc, char *argv[]){! ! !MPI_Init(&argc, &argv);! !. . . ! }!
Lero 2013 26
MPI_Finalize
An MPI program should call the MPI rou@ne MPI_Finalize when all communica@ons have completed This rou@ne cleans up all MPI data structures, etc. It does not cancel outstanding communica@ons, so it is the responsibility of the programmer to make sure all communica@ons have completed. Once this rou@ne has been called, no other calls can be made to MPI rou@nes, not even MPI_Init, so a process cannot later re-enrol in MPI. Syntax:
Use:
int
MPI_Finalize()!
MPI_Finalize();!
Lero 2013
27
First MPI program

MPI Include le Serial Code Start Parallel Code Do work Finish Parallel Code
Lero 2013
28
Exercise 1
Write a program that prints hello world per process running Compile and run on one processsor Run on several processors in parallel Modify your program so that Every process writes its rank and the size of MPI_COMM_WORLD Only process ranked 0 in MPI_COMM_WORLD prints hello world Is the sequence of the output determinis@c (in order?)
Lero 2013
29
Compila@on
C compila@on gcc o prog prog.c ./prog.o mpicc prog.c o prog mpirun n num prog Execu@on C program Compila@on MPI in C:
Execu@on with num processes:
Lero 2013
30

MPI

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MPI

Uploaded by

Copyright:

Available Formats

Welcome

Parallelism in So-ware Development

Week 6 : Mid Term lab session 2

High Performance Compu@ng (HPC)

Non Uniform Memory Access (NUMA)

Shared Memory System

Distributed Memory Systems

Fig.3. Distributed Memory

Distributed Memory System

* TFLOPS: 1012 (Tera) Floa@ng Point Opera@ons Per Second

Basic features of MPI Programs

The MPI_COMM_WORLD Communicator

communicate with each other

MPI_COMM_WORLD size = 8 processes P0 P6 P3 P5 P2

Comm1 P1 P7 size = 4 processes

First MPI program

Execu@on with num processes:

You might also like