You are on page 1of 10

MPI

The gold standard for high-performance parallelism is MPI,


the Message-Passing Interface. As an interface MPI is a
specification for how to communicate information between various
processes, which may be close to or very far from one another.
The MPI-3.0 Standard was released in September 2012. There are
two primary open source projects that
implement MPI: MPICH and Open MPI. Since they implement the
same standard, these are largely interchangeable. They both take
great care to provide the MPI interface completely and correctly.
It is not an understatement to say that supercomputing is built on top
of MPI. This is because MPI is an abstraction for parallelism that is
independent of the machine. This allows physicists (and other
domain experts) to learn and write MPIcode and have it work on any
computer. Meanwhile, the architects of the supercomputers can
implement a version ofMPI that is optimized to the machines that
they are building. The architects do not have to worry about who is
going to use their version of MPI, or how they will use it.
MPI is a useful abstraction for anyone who buys into its model. It is
a successful abstraction because almost everyone at this
point does buy into it. MPI is huge and very flexible, and we do not
have space here to do it justice. It currently scales up to the level of
hundreds of thousands to millions of processors. It also works just
fine on a handful of processors. If you are serious about parallelism
on even medium scales (1,000+ processors), MPI is an unavoidable
boon.
As its name states, MPI is all about communication. Mostly this
applies to data, but it is also true for algorithms. The basic elements
of MPI all deal with how to communicate between processes. For a
user, this is primarily what is of interest. As with all good things,
there is a Python interface. In fact, there are many of them. The most
commonly used one is called mpi4py. We will be discussing

the mpi4py package rather than the officially supported C, C++, or


Fortran interfaces.
In MPI terminology, processes are called ranks and are given
integer identifiers starting at zero. As with the other forms of
parallelism we have seen, you may have more ranks than there are
physical processors. MPI will do its best to spread the ranks out
evenly over the available resources. Oftenthough not always
rank 0 is considered to be a special master process that commands
and controls the other ranks.
TIP
Having a master rank is a great strategy to use, until it isnt! The point at
which this approach breaks down is when the master process is overloaded by
the sheer number of ranks. The master itself then becomes the bottleneck for
doling out work. Reimagining an algorithm to not have a master process can
be tricky.

At the core of MPI are communicator objects. These provide


metadata about how many processes there are with
theGet_size() method, which rank you are on
with Get_rank(), and how the ranks are grouped together.
Communicators also provide tools for sending messages from one
processor and receiving them on other processes via
the send() andrecv() methods. The mpi4py package has two
primary ways of communicating data. The slower but more general
way is that you can send arbitrary Python objects. This requires that
the objects are fully picklable. Pickling is Pythons native storage
mechanism. Even though pickles are written in plain text, they are
not human readable by any means. To learn more about how
pickling works and what it looks like, please refer to the pickling
section of the Python documentation.
NumPy arrays can also be used to communicate with mpi4py. In
situations where your data is already in a NumPy arrays, it is most
appropriate to let mpi4py use these arrays. However, the

communication is then subject to the same constraints as normal


NumPy arrays. Instead of going into the details of how to use
NumPy and mpi4py together, here we will only use the generic
communication mechanisms. This is because they are easier to use,
and moving to NumPy-based communication does not add anything
to your understanding of parallelism.
The mpi4py package comes with a couple of common
communicators already instantiated. The one that is typically used is
called COMM_WORLD. This represents all of the processes
that MPI was started with and enables basic point-topoint communication. Point-to-point communication allows any
process to communicate directly with any other process. Here we
will be using it to have the rank 0 process communicate back and
forth with the other ranks.
As with multiprocessing, the main module must be importable. This
is because MPI must be able to launch its own processes. Typically
this is done through the command-line utility mpiexec. This takes
a -n switch and a number of nodes to run on. For simplicity, we
assume one process per node. The program to runPython, hereis
then followed by any arguments it takes. Suppose that we have
written our N-body simulation in a file called n-body-mpi.py. If
we wish to run on four processes, we would start MPI with the
following command on the command line:
$ mpiexec -n 4 python n-body-mpi.py

Now we just need to write the n-body-mpi.py file! Implementing


an MPI-based solver for the N-body problem is not radically
different from the solutions that we have already seen.
The remove_i(), initial_cond(), a(), timestep(),
and timestep_i() functions are all the same as they were
in Multiprocessing.
What changes for MPI is the simulate() function. To be
consistent with the other examples in this chapter (and because it is a

good idea), we will also implement an MPI-aware process pool.


Lets begin by importing MPI and the following helpers:
from mpi4py import MPI
from mpi4py.MPI import COMM_WORLD
from types import FunctionType

The MPI module is the primary module in mpi4py. Within this


module lives the COMM_WORLD communicator that we will use, so
it is convenient to import it directly. Finally, types is a Python
standard library module that provides base classes for built-in
Python types. The FunctionType will be useful in the MPIaware Pool that is implemented here:
class Pool(object):
"""Process pool using MPI."""
def __init__(self):
self.f = None

self.P = COMM_WORLD.Get_size()
self.rank = COMM_WORLD.Get_rank()

def wait(self):
if self.rank == 0:
raise RuntimeError("Proc 0 cannot
wait!")
status = MPI.Status()
while True:
task = COMM_WORLD.recv(source=0,
tag=MPI.ANY_TAG, status=status)
if not task:
break
if isinstance(task, FunctionType):
self.f = task

continue
result = self.f(task)
COMM_WORLD.isend(result, dest=0,
tag=status.tag)
def map(self, f, tasks):
N = len(tasks)
P = self.P
Pless1 = P - 1
if self.rank != 0:
self.wait()
return
if f is not self.f:
self.f = f
requests = []
for p in range(1, self.P):
r = COMM_WORLD.isend(f, dest=p)
requests.append(r)
MPI.Request.waitall(requests)
requests = []
for i, task in enumerate(tasks):
r = COMM_WORLD.isend(task, dest=(i
%Pless1)+1, tag=i)
requests.append(r)
MPI.Request.waitall(requests)
results = []
for i in range(N):
result = COMM_WORLD.recv(source=(i
%Pless1)+1, tag=i)
results.append(result)
return results

def __del__(self):
if self.rank == 0:
for p in range(1, self.P):
COMM_WORLD.isend(False, dest=p)

A reference to the function to execute. The pool starts off with


no function.

The total number of processors.


Which processor we are on.

A method for receiving data when the pool has no tasks.


Normally, a task is data to give as arguments to the
function f(). However, if the task is itself a function, it
replaces the current f().
The master process cannot wait.

Receive a new task from the master process.


If the task was a function, put it onto the object and then
continue to wait.

If the task was not a function, then it must be a real task. Call
the function on this task and send back the result.
A map() method to be used like before.
Make the workers wait while the master sends out tasks.

Send all of the workers the function.


Evenly distribute tasks to all of the workers.

Wait for the results to come back from the workers.


Shut down all of the workers when the pool is shut down.
The purpose of the Pool class is to provide a map() method that is
similar to the map() on the multiprocessing pool. This class
implements the rank-0-as-master strategy. The map() method can
be used in the same way as for other pools. However, other parts of
the MPI pool operate somewhat differently. To start with, there is no
need to tell the pool its size. P is set on the command line and then
discovered with COMM_WORLD.Get_size() automatically in the
pools constructor.
Furthermore, there will be an instance of Pool on each processor
because MPI runs the same executable (python) and script (nbody-mpi.py) everywhere. This implies that each pool should be
aware of its own rank so that it can determine if it is the master or

just another worker. The Pool class has to jointly fulfill both the
worker and the master roles.
The wait() method here has the same meaning
as Thread.run() from Threads. It does work when there is
work to do and sits idle otherwise. There are three paths
that wait() can take, depending on the kind of task it receives:
1. If a function was received, it assigns this function to the
attribute f for later use.
2. If an actual task was received, it calls the f attribute with the
task as an argument.
3. If the task is False, then it stops waiting.
The master process is not allowed to wait and therefore not allowed
to do real work. We can take this into account by telling MPI to
use P+1 nodes. This is similar to what we saw with threads.
However, with MPI we have to handle the master process explicitly.
With Python threading, Python handles the main thread, and thus the
master process, for us.
The map() method again takes a function and a list of tasks. The
tasks are evenly distributed over the workers. Themap() method is
only runnable on the master, while workers are told to wait. If the
function that is passed in is different than the current value of
the f attribute, then the function itself is sent to all of the workers.
Sending happens via the initiate send (COMM_WORLD.isend())
call. We ensure that the function has made it to all of the workers via
the call toMPI.Request.waitall(). This acts as an
acknowledgment between the sender and all of the receivers. Next,
the tasks are distributed to their appropriate ranks. Finally, the
results are received from the workers.
When the master pool instance is deleted, it will automatically
instruct the workers to stop waiting. This allows the workers to be
cleaned up correctly as well. Since the Pool API here is different
enough, a new version of the top-levelsimulate() function must

also be written. Only the master process should be allowed to


aggregate results together. The new version of simulate() is
shown here:
def simulate(N, D, S, G, dt):
x0, v0, m = initial_cond(N, D)
pool = Pool()
if COMM_WORLD.Get_rank() == 0:
for s in range(S):
x1, v1 = timestep(x0, v0, G, m, dt,
pool)
x0, v0 = x1, v1
else:
pool.wait()

Lastly, if we want to run a certain case, we need to add a main


execution to the bottom of n-body-mpi.py. For 128 bodies in 3
dimensions over 300 time steps, we would call simulate() as
follows:
if __name__ == '__main__':
simulate(128, 3, 300, 1.0, 1e-3)

Given MPIs fine-grained control over communication, how does


the N-body problem scale? With twice as many processors, we again
expect a 2x speedup. If the number of MPI nodes exceeds the
number of processors, however, we would expect a slowdown due to
managing the excess overhead. Figure 12-7 shows a sample study on
a dual-core laptop.
Figure 12-7. Speedup in MPI N-body simulation

While there is a speedup for the P=2 case, it is only about 1.4x,
rather than the hoped-for 2x. The downward trend forP>2 is still
present, and even steeper than with multiprocessing. Furthermore,
the P=1 MPI case is about 5.5x slower than the same simulation
with no parallelism. So, for small simulations MPIs overhead may
not be worth it.
Still, the situation presented here is a worst-case scenario for MPI:
arbitrary Python code with two-way communication on a small

machine of unspecified topology. If we had tried to optimize our


algorithm at allby giving MPI more information or by using
NumPy arrays to communicatethe speedups would have been
much higher.
These results should thus be viewed from the vantage point that
even in the worst case, MPI is competitive. MPI truly shines in a
supercomputing environment, where everything that you have
learned about message passing still applies.

You might also like