Professional Documents
Culture Documents
Laboratory no. 1
Execution model
Multiple MPI processes are run simultaneously. These processes can be identical or different and they can communicate collectively or on a point-to-point fashion. A communicator defines the communication domain. Inside this domain, each process has a rank. The communicator identifies a group of processes that can communicate with each other and also offers communication protection.
Communicator types:
intra-communicator for communication between processes belonging to the same group inter-communicator for communication between process groups MPI_Init initializes an MPI process MPI_Finalize terminates an MPI process MPI_Comm_size determines the number of processes MPI_Comm_rank determines the rank of the current process MPI_Send sends a message MPI_Receive receives a message
Core functions:
Examples:
int main(int argc, char **argv) { MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &count); MPI_Comm_rank (MPI_COMM_WORLD, &myrank); printf ("I am the process %d of %d", myrank, count); MPI_Finalize(); return 0; }
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies int main(int argc, char **argv) { MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &count); MPI_Comm_rank (MPI_COMM_WORLD, &myrank); if (myrank == 0) // master process code else // slave process code MPI_Finalize(); return 0; }
Laboratory no. 1
Communications
Communication types: Point-to-point Collective
Point-to-point communications Blocking communications Send: Standard blocking send: int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Buffered mode blocking send: int MPI_Bsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Synchronous mode blocking send: int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Ready send: int MPI_Rsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) <parameters>: [IN] buf - initial address of send buffer (choice) [IN] count - number of elements to send (nonnegative integer) [IN] datatype - datatype of each send buffer element (handle) [IN] dest - rank of destination (integer) [IN] tag - message tag (integer) [IN] comm - communicator (handle) Receive: Blocking receive: int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) <parameters>: [IN] count - maximum number of elements to receive (integer) 2
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
[IN] datatype - datatype of each receive buffer entry (handle) [IN] source - rank of source (integer) [IN] tag - message tag (integer) [IN] comm - communicator (handle) [OUT] buf - initial address of receive buffer (choice) [OUT] status - status object (status) Non-blocking communications Send: int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) <parameters>: [IN] buf - initial address of send buffer (choice) [IN] count - number of elements in send buffer (integer) [IN] datatype - datatype of each send buffer element (handle) [IN] dest - rank of destination (integer) [IN] tag - message tag (integer) [IN] comm - communicator (handle) [OUT] request - communication request (handle) Receive: int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) <parameters>: [IN] buf - initial address of receive buffer (choice) [IN] count - number of elements in receive buffer (integer) [IN] datatype - datatype of each receive buffer element (handle) [IN] source - rank of source (integer) [IN] tag - message tag (integer) [IN] comm - communicator (handle) [OUT] request - communication request (handle)
Note: In non-blocking mode, the receive operation returns even if no message has been received. In order to wait for the result, use MPI_Wait() and MPI_Test() functions. Example:
MPI_Comm_Rank (MPI_COMM_WORLD,&myrank); MPI_Status status; MPI_Request req; if (myrank == 0) { int a; MPI_Isend(&a,1,MPI_INT,msgtag,MPI_COMM_WORLD,&req); ... MPI_Wait(&req, &status); } else if (myrank == 1) { int b; MPI_Recv(&b,1,MPI_INT,0,mesgtag,MPI_COMM_WORLD,&status); }
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies MPI_Waitany(), MPI_Testany(), MPI_Testall(), MPI_Waitsome(), MPI_Testsome().
Other
functions:
Collective communications
Collective communications imply a group of participating processes. The core functions provided by MPI libraries for collective communications are: barrier all group members are synchronized at barrier (MPI_Barrier) broadcast a group member broadcasts a message to all the other group members (MPI_Bcast) gather a group member collects data from all group members (MPI_Gather) scatter a group member spreads data to all group members (MPI_Scatter) all gather gather variation where all group members receive the data (MPI_Allgather) all-to-all scatter/gather collecting/spreading data from all group members to all group members (MPI_Alltoall) it is an extension of MPI_Allgather where each process may send distinct data to all the other processes reduce operations sum, multiply, min, max or other user-defined functions where the results are sent to all group members or only to one of them Notes: these functions are executed in a collective fashion by all processes that belong to the specified communicator; all processes use the same parameters all processes must execute the collective communications no synchronization is guaranteed, except when using the barrier there are no non-blocking collective communications there are no tags the size of receive buffers must match exactly the size of send buffers
int MPI_Barrier(MPI_Comm comm) <parameters>: [IN] comm - communicator (handle) int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) <parameters>: [IN] buffer - starting address of buffer (choice) [IN] count - number of entries in buffer (integer) [IN] datatype - data type of buffer (handle) [IN] root - rank of broadcast root (integer) [IN] comm - communicator (handle)
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) <parameters>: [IN] sendbuf - address of send buffer (choice, significant only at root) [IN] sendcount - number of elements sent to each process (integer, significant only at root) [IN] sendtype - datatype of send buffer elements (handle, significant only at root) [IN] recvcount - number of elements in receive buffer (integer) [IN] recvtype - datatype of receive buffer elements (handle) [IN] root - rank of sending process (integer) [IN] comm - communicator (handle) [OUT] recvbuf - address of receive buffer (choice)
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) <parameters>: [IN] sendbuf - starting address of send buffer (choice) [IN] sendcount - number of elements in send buffer (integer) [IN] sendtype - datatype of send buffer elements (handle) [IN] recvcount - number of elements for any single receive (integer, significant only at root) [IN] recvtype - datatype of recvbuffer elements (handle, significant only at root) [IN] root - rank of receiving process (integer) [IN] comm - communicator (handle) [OUT] recvbuf - address of receive buffer (choice, significant only at root)
Examples:
// broadcast example double param; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); if(rank==5) { param=23.0; } MPI_Bcast(¶m,1,MPI_DOUBLE,5,MPI_COMM_WORLD); printf("Process: %d, param: %f\n", rank, param); MPI_Finalize(); // scatter example MPI_Comm comm; int gsize,*sendbuf; int root, rbuf[100]; ... 6
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies MPI_Comm_size(comm, &gsize); sendbuf = (int *)malloc(gsize*100*sizeof(int)); ... MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, comm); // gather example MPI_Comm comm; int gsize,sendarray[100]; int root, *rbuf; ... MPI_Comm_size( comm, &gsize); rbuf = (int *)malloc(gsize*100*sizeof(int)); MPI_Gather(sendarray, 100, MPI_INT, rbuf, 100, comm);
Laboratory no. 1
MPI_INT,
root,
MPI_INT,
root,
In order to run the application on multiple systems, create a file named lamhosts and insert on each line the IP address (or name) and the number of CPUs of every work station (if necessary). Example:
7
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies calc1.domeniu.ro/192.168.247.10 cpu=2 calc2.domeniu.ro/192.168.247.11 cpu=3 calc3.domeniu.ro/192.168.247.12 cpu=2 calc4.domeniu.ro/192.168.247.13
Laboratory no. 1
Running programs:
mpirun-lam np 4 prog
Exercises:
Implement an application that creates 2 processes: one that sends data and one that receives data. The data will be represented by an array of double values (at least 2000 elements). Make 2 versions of the application: one that uses blocking communications, and one that uses non-blocking communications. Compare the results. Implement a producer/consumer application using non-blocking communications. Considering N processes where N-1 processes represent the producers and one process represents the consumer.
Bibliography:
Open MPI 1.5.4 API docs: http://www.open-mpi.org/doc/v1.5/ MPI 2.2 standard: http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
OpenMP model OpenMP programs display the following characteristics: all threads have access to the same shared memory there are two types of data: shared and private shared data is available to any thread in the working group private data can only be accessed by the owner thread data transfer takes place in a transparent manner for the programmer each execution has at least one synchronization mechanism
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
Execution model OpenMP API offers support for: nested parallel regions (a parallel region inside another parallel region) dynamic management of the number of threads used to execute parallel regions
Using compiler directives, one can specify: how to partition the work volume among threads how to synchronize threads variable visibility (private, shared etc.)
Runtime libraries
provide a function set used to interrogate and modify the execution environment function signatures are located in <omp.h> (for C, C++)
#include <omp.h> main () { int var1, var2, *var3; // serial code ... // beginning of parallel section. forks a team of threads. #pragma omp construct [clause [clause]] { int var4, var5; // parallel section executed by all threads ... // all threads join master thread and are terminated } // resume serial code ... }
10
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
Directive syntax:
#pragma omp parallel [clause ...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integerexpression) structured_block
Example:
#include <stdio.h> int main(void) { printf ("Master thread before omp par\n\n"); #pragma omp parallel { printf ("Hello from team thread\n"); } printf ("\nMaster thread after omp parallel\n"); return 0; } omp for #pragma omp parallel { #pragma omp for for (i=0; i<MAX; ++i) { res[i] = huge(); } } #pragma omp parallel for for (i=0; i<MAX; ++i) { res[i] = huge(); }
Directive syntax:
#pragma omp for [clause ...] newline schedule (type [,chunk]) ordered private (list) 11
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) collapse (n) nowait for_loop
Laboratory no. 1
Example:
#define N 12 ... #pragma omp parallel #pragma omp for for(i=1; i<N+1; ++i) c[i] = a[i] + b[i];
Variable visibility
A variable defined outside a parallel block is implicitly shared. A variable defined inside a parallel block is implicitly private to the declaring thread. The variable visibility can be modified using the shared and private clauses: shared (variable list) if this clause is present in a pragma omp directive, then all listed variables are shared among the threads spawned by the master thread. OpenMP API controls the variable access, but does NOT implicitly make any kind of synchronization. private (variable list) if this clause is present in a pragma omp directive, then all listed variables are private to every thread spawned by the master thread. Every listed variable will be redeclared in the private memory space of each thread using the same name and data type, WITHOUT copying the variable values from the shared memory. All modifications on these private variables are visible only to the owning threads.
12
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies
Laboratory no. 1
Collective operations
shared variables. Each listed variable is redeclared inside each thread as a private variable and it is initialized according to the next table. These private variables are modified by each thread, and at the end of parallel region they are combined according to the specified operator into a resulting value which is assigned to the shared variable. Operand Initial Operand Initial value value + 0 & 0 0 | 0 * 1 && 1 ^ 0 || 0
#pragma omp parallel for reduction(+:sum) for(i=0; i<N; ++i) { sum += a[i] * b[i]; }
The variable list cannot contain pointers, arrays, references or const variables. In C++, the operators cannot be overloaded.
Synchronization mechanisms
barrier Syntax:
#pragma omp barrier new-line
Example:
#pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); // Processed A into B #pragma omp barrier DoSomeWork(B,C); // Processed B into C }
critical Syntax:
#pragma omp critical [(lock_name)]
Examples:
#include <omp.h> main() { int x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } // end of parallel section } 13
Fundamentals of Distributed Systems Master study program: Distributed Systems and Web Technologies float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; ++i) { #pragma omp critical sum += a[i] * b[i]; } return sum; }
Laboratory no. 1
Exercises:
Implement an application that computes the product between a matrix of NxN dimensions and a vector of dimension N. Consider N at least 1000. Run the application multiple times using a different number of threads. For each run, measure the running time.
Bibliography:
OpenMP 3.1 specifications: http://www.openmp.org/mp-documents/OpenMP3.1.pdf Ruud van der Pas, An Introduction Into OpenMP, IWOMP 2005 University of Oregon, Eugene, Oregon, USA June 1-4, 2005 Blaise Barney, OpenMP, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/openMP/ Intel Software College, Programming with OpenMP http://gcc.gnu.org/onlinedocs/libgomp/Runtime-Library-Routines.html#Runtime-LibraryRoutines
14