Writing Message Passing Parallel Programs with MPI A Three

Report
Message-Passing Programming
Neil MacDonald, Elspeth Minty, Joel Malard,
Tim Harding, Simon Brown,
Mario Antonioletti, David Henty,
Gavin J. Pringle
Contents
 Message Passing Concepts
 Basic MPI Programs
 Point-to-Point Communication
 Profiling tool: VAMPIR
 Modes, Tags and Communicators
 Non-blocking Communication
 Collective Communications
 Derived Datatypes
 Virtual Topologies
 Decomposition Techniques
 MPI design
Motivation
The MPI library is the most important piece of
software in parallel programming
All the world’s largest supercomputers are
programmed using MPI
Writing parallel programs using MPI is fun!
Message-Passing Concepts
Sequential Programming Paradigm
Process
Memory
M
Processor
P
Programming Paradigm
Processes
0
1
2
3
Message Passing Interface
Communication Network
Programming Paradigm
All variables are private.
Processes communicate via special subroutine
calls.
Typically
Written in a conventional sequential language.
A single program is run on each processor.
A Message-Passing Program
if ( me == 0 ) then X = X + 111
Process A
Process B
X: 19
me: 0
X: 130
X: 5
me: 1
X: 5
What is SPMD?
Single Program, Multiple Data
Same program runs everywhere.
Restriction on the general message-passing
model.
Some vendors only support SPMD parallel
programs.
General message-passing model can be
emulated.
Emulating General Message Passing
Using SPMD: C
main (int argc, char **argv)
{
if (controller_process)
{
Controller( /* Arguments */ );
}
else
{
Worker
( /* Arguments */ );
}
}
Emulating General Message-Passing
Using SPMD: Fortran
PROGRAM SPMD
IF (controller_process) THEN
CALL CONTROLLER ( ! Arguments ! )
ELSE
CALL WORKER
ENDIF
END PROGRAM SPMD
( ! Arguments ! )
Messages
Messages are packets of data moving between
processes.
The message passing system has to be told
the following information:
–
–
–
–
–
–
–
Sending process(es)
Source location
Data type
Data length
Receiving process(es)
Destination location
Size of receive buffer(s)
Access
A process needs to register to the message
passing interface.
A message passing system is similar to:
–
–
–
–
Mail box
Phone line
fax machine
etc.
Addressing
Messages need to have addresses to be sent
to.
Addresses are similar to:
–
–
–
–
–
Postal address
Phone number
Fax number
Email address
etc.
Reception
It is important that the receiving process is
capable of dealing with messages it is sent.
–
–
–
–
–
Letter box
Telephone receiver
Printer
Email inbox
etc
Completion
The mode of a communication determines
when its constituent operations complete.
The form of an operation determines when the
procedure implementing that operation will
return.
Point-to-Point Communication
Simplest form of message passing.
One process sends a message to another.
Different types of point-to-point communication.
Synchronous Sends
Provide information about the completion of
the message.
Beep!
Asynchronous Sends
Only know when the message has left.
?
Blocking Operations
Relate to when the operation has completed.
Only return from the subroutine call when the
operation has completed.
Non-Blocking Operations
 Return straight away and allow the sub-program to
continue to perform other work. At some later time the
sub-program can test or wait for the completion of the
non-blocking operation.
Beep!
Non-Blocking Operations
All non-blocking operations should have
matching wait operations. Some systems
cannot free resources until wait has been
called.
A non-blocking operation immediately followed
by a matching wait is equivalent to a blocking
operation.
Non-blocking operations are not the same as
sequential subroutine calls as the operation
continues after the call has returned.
Collective communications
Collective communication routines are higher
level routines involving several processes at a
time.
Can be built out of point-to-point
communications.
Barriers
Synchronise processes.
Barrier
Barrier
Barrier
Broadcast
A one-to-all communication.
Reduction Operations
Combine data from several processes to
produce a single result.
Strike
Summary
Message-Passing is a programming model
– that is implemented by MPI
Essential to understand the basic concepts
– private variables
– explicit communications
– SPMD
Major difficulty in learning MPI is understanding
the Message-Passing model
– a very different model to sequential programming
Message Passing
Programming
What is MPI?
MPI Forum
First message-passing interface standard.
Sixty people from forty different organisations.
Users and vendors represented, from the US
and Europe.
Two-year process of proposals, meetings and
review.
Message Passing Interface document
produced.
Goals and Scope of MPI
MPI's prime goals are:
– To provide source-code portability.
– To allow efficient implementation.
It also offers:
– A great deal of functionality.
– Support for heterogeneous parallel architectures.
Header files
C:
#include <mpi.h>
Fortran:
include 'mpif.h'
Fortran 90:
use mpi
MPI Function Format
C:
error = MPI_Xxxxx(parameter, ...);
MPI_Xxxxx(parameter, ...);
Fortran:
CALL MPI_XXXXX(parameter, ..., IERROR)
Handles
MPI controls its own internal data structures.
MPI releases `handles' to allow programmers
to refer to these.
C handles are of defined typedefs.
Fortran handles are INTEGERs.
Initialising MPI
C:
int MPI_Init(int *argc, char ***argv)
Fortran:
MPI_INIT(IERROR)
INTEGER IERROR
Must be the first MPI procedure called.
– but multiple processes are already running before MPI_Init
MPI_Init
int main(int argc, char *argv[])
{
...
MPI_Init(&argc, &argv);
...
int main()
{
MPI_Init(NULL, NULL);
...
program my_mpi_program
integer :: ierror
...
CALL MPI_INIT(IERROR)
MPI_COMM_WORLD
Communicators
MPI_COMM_WORLD
0
2
1
4
3
5
6
Rank
How do you identify different processes in a
communicator?
MPI_Comm_rank(MPI_Comm comm, int *rank)
MPI_COMM_RANK(COMM, RANK, IERROR)
INTEGER COMM, RANK, IERROR
The rank is not the physical processor number.
– numbering is 0, 1, 2, ....
MPI_Comm_rank
int rank;
...
MPI_Comm_rank(MPI_COMM_WORLD, &rank)
printf(“Hello from rank %d\n”, rank);
...
integer :: ierror
integer :: rank
...
CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
write(*,*) ‘Hello from rank ‘, rank
...
Size
How many processes are contained within a
communicator?
MPI_Comm_size(MPI_Comm comm, int *size)
MPI_COMM_SIZE(COMM, SIZE, IERROR)
INTEGER COMM, SIZE, IERROR
Exiting MPI
C:
int MPI_Finalize()
Fortran:
MPI_FINALIZE(IERROR)
INTEGER IERROR
Must be the last MPI procedure called.
Aborting MPI
Aborting the execution from any processor
(e.g. error condition)
C:
int MPI_Abort(MPI_Comm comm,int errorcode)
Fortran:
MPI_ABORT(COMM, ERRORCODE, IERROR)
INTEGER COMM, ERRORCODE, IERROR
Behaviour
– will abort all processes even if only called by one process
– this is the ONLY MPI routine that can have this effect!
Summary
Have covered basic calls
– but no explicit message-passing yet
Can still write useful programs
– eg a task farm of independent jobs
Need to compile and launch parallel jobs
– procedure is not specified by MPI
– next lecture gives machine-specific details
Messages
Messages
A message contains a number of elements of
some particular datatype.
MPI datatypes:
– Basic types.
– Derived types.
Derived types can be built up from basic types.
C types are different from Fortran types.
MPI Basic Datatypes - C
MPI Datatype
C datatype
MPI_CHAR
signed char
MPI_SHORT
signed short int
MPI_INT
signed int
MPI_LONG
signed long int
MPI_UNSIGNED_CHAR
unsigned char
MPI_UNSIGNED_SHORT
unsigned short int
MPI_UNSIGNED
unsigned int
MPI_UNSIGNED_LONG
unsigned long int
MPI_FLOAT
float
MPI_DOUBLE
double
MPI_LONG_DOUBLE
long double
MPI_BYTE
MPI_PACKED
MPI Basic Datatypes - Fortran
MPI Datatype
Fortran Datatype
MPI_INTEGER
INTEGER
MPI_REAL
REAL
MPI_DOUBLE_PRECISION
DOUBLE PRECISION
MPI_COMPLEX
COMPLEX
MPI_LOGICAL
LOGICAL
MPI_CHARACTER
CHARACTER(1)
MPI_BYTE
MPI_PACKED
Point-to-Point Communication
Point-to-Point Communication
1
2
5
Destination
3
4
Source 0
Communicator
 Communication between two processes.
 Source process sends message to destination
process.
 Communication takes place within a communicator.
 Destination process is identified by its rank in the
communicator.
Point-to-point messaging in MPI
Sender calls a SEND routine
– specifying the data that is to be sent
– this is called the send buffer
Receiver calls a RECEIVE routine
– specifying where the incoming data should be stored
– this is called the receive buffer
Data goes into the receive buffer
Metadata describing message also transferred
– this is received into separate storage
– this is called the status
Communication modes
Sender mode
Notes
Synchronous
send
Only completes when the receive has completed.
Buffered send
Always completes (unless an error occurs), irrespective of receiver.
Standard
send
Either synchronous or buffered.
Ready send
Always completes (unless an error occurs), irrespective of whether the
receive has completed.
Receive
Completes when a message has arrived.
MPI Sender Modes
OPERATION
MPI CALL
Standard send
MPI_Send
Synchronous send
MPI_Ssend
Buffered send
MPI_Bsend
Ready send
MPI_Rsend
Receive
MPI_Recv
Sending a message
 C:
int MPI_Ssend(void *buf, int count,
MPI_Datatype datatype,
int dest, int tag,
MPI_Comm comm)
 Fortran:
MPI_SSEND(BUF, COUNT, DATATYPE, DEST,
TAG, COMM, IERROR)
<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG
INTEGER COMM, IERROR
Send data from proc 1 to proc 3
// Array of ten integers
int x[10];
...
if (rank == 1)
MPI_Ssend(x, 10, MPI_INT, dest=3, tag=0, MPI_COMM_WORLD);
// Integer scalar
int x;
...
if (rank == 1)
MPI_Ssend(&x, 1, MPI_INT, dest=3, tag=0, MPI_COMM_WORLD);
Send data from proc 1 to proc 3
! Array of ten integers
integer, dimension(10) :: x
...
if (rank .eq. 1)
CALL MPI_SSEND(x, 10, MPI_INTEGER, dest=3, tag=0,
MPI_COMM_WORLD, ierr)
! Integer scalar
integer :: x
...
if (rank .eq. 1)
CALL MPI_SSEND(x, 1, MPI_INTEGER, dest=3, tag=0,
MPI_COMM_WORLD, ierr)
Receiving a message
 C:
int MPI_Recv(void *buf, int count,
MPI_Datatype datatype,
int source, int tag,
MPI_Comm comm, MPI_Status *status)
 Fortran:
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM,
STATUS, IERROR)
<type> BUF(*)
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM,
STATUS(MPI_STATUS_SIZE), IERROR
Receive data from proc 1 on proc 3
int y[10];
MPI_Status status;
...
if (rank == 3)
MPI_Recv(y, 10, MPI_INT, src=1, tag=0, MPI_COMM_WORLD,
&status);
int y;
...
if (rank == 3)
MPI_Recv(&y, 1, MPI_INT, src=1, tag=0, MPI_COMM_WORLD,
&status);
Receive data from proc 1 on proc 3
integer, dimension(10) :: y
integer, dimension(MPI_STATUS_SIZE) :: status
...
if (rank .eq. 3)
CALL MPI_RECV(y, 10, MPI_INTEGER, src=1, tag=0,
MPI_COMM_WORLD, status, ierr)
integer :: y
...
if (rank .eq. 3)
CALL MPI_RECV(y, 1, MPI_INTEGER, src=1, tag=0,
MPI_COMM_WORLD, status, ierr)
Synchronous Blocking Message-Passing
Processes synchronise.
Sender process specifies the synchronous mode.
Blocking both processes wait until the
transaction has completed.
For a communication to succeed:
Sender must specify a valid destination rank.
Receiver must specify a valid source rank.
The communicator must be the same.
Tags must match.
Message types must match.
Receiver's buffer must be large enough.
Wildcarding
 Receiver can wildcard.
 To receive from any source MPI_ANY_SOURCE
 To receive with any tag MPI_ANY_TAG
 Actual source and tag are returned in the receiver's
status parameter.
Communication Envelope
Senders Address
For the attention of:
Destination Address
Data
Item 1
Item 2
Item 3
Commmunication Envelope Information
Envelope information is returned from
MPI_RECV as status
Information includes:
– Source: status.MPI_SOURCE or status(MPI_SOURCE)
– Tag:
status.MPI_TAG or status(MPI_TAG)
– Count: MPI_Get_count or MPI_GET_COUNT
Received Message Count
 C:
int MPI_Get_count(MPI_Status *status,
MPI_Datatype datatype,
int *count)
 Fortran:
MPI_GET_COUNT(STATUS, DATATYPE, COUNT,
IERROR)
INTEGER STATUS(MPI_STATUS_SIZE),
DATATYPE, COUNT, IERROR
Message Order Preservation
1
2
5
3
4
0
Communicator
Messages do not overtake each other.
This is true even for non-synchronous sends.
Message Passing Programming
Modes, Tags and Communicators
Overview
Lecture will cover
– explanation of MPI modes (Ssend, Bsend and Send)
– meaning and use of message tags
– rationale for MPI communicators
These are all commonly misunderstood
– essential for all programmers to understand modes
– often useful to use tags
– certain cases benefit from exploiting different communicators
Modes
MPI_Ssend (Synchronous Send)
– guaranteed to be synchronous
– routine will not return until message has been delivered
MPI_Bsend (Buffered Send)
– guaranteed to be asynchronous
– routine returns before the message is delivered
– system copies data into a buffer and sends it later on
MPI_Send (standard Send)
– may be implemented as synchronous or asynchronous send
– this causes a lot of confusion (see later)
MPI_Ssend
Process A
Process B
Running other
non-MPI code
Ssend(x,B)
Wait in Ssend
Recv(y,A)
x
Ssend returns
x can be
overwritten by A
Data Transfer
y
Recv returns
y can now be
read by B
MPI_Bsend
Process A
Process B
Running other
non-MPI code
Bsend(x,B)
Bsend returns
x can be
overwritten by A
x
Recv(y,A)
y
Recv returns
y can now be
read by B
Notes
Recv is always synchronous
– if process B issued Recv before the Bsend from process A,
then B would wait in the Recv until Bsend was issued
Where does the buffer space come from?
– for Bsend, the user provides a single large block of memory
– make this available to MPI using MPI_Buffer_attach
If A issues another Bsend before the Recv
– system tries to store message in free space in the buffer
– if there is not enough space then Bsend will FAIL!
Send
Problems
– Ssend runs the risk of deadlock
– Bsend less likely to deadlock, and your code may run faster, but
• the user must supply the buffer space
• the routine will FAIL if this buffering is exhausted
MPI_Send tries to solve these problems
– buffer space is provided by the system
– Send will normally be asynchronous (like Bsend)
– if buffer is full, Send becomes synchronous (like Ssend)
MPI_Send routine is unlikely to fail
– but could cause your program to deadlock if buffering runs out
MPI_Send
Process A
Process B
Send(x,B)
Send(y,A)
Recv(x,B)
Recv(y,A)
This code is NOT guaranteed to work
– will deadlock if Send is synchronous
– is guaranteed to deadlock if you used Ssend!
Solutions
To avoid deadlock
– either match sends and receives explicitly
– eg for ping-pong
• process A sends then receives
• process B receives then sends
For a more general solution use non-blocking
communications (see later)
For this course you should program with Ssend
– more likely to pick up bugs such as deadlock than Send
Tags
Every message can have a tag
– this is a non-negative integer value
– not everyone uses them
– many MPI programs set all tags to zero
Tags can be useful in some situations
– can choose to receive messages only of a given tag
Most commonly used with MPI_ANY_TAG
– receives the most recent message regardless of the tag
– user then finds out the actual value by looking at the status
Communicators
All MPI communications take place within a
communicator
– a communicator is fundamentally a group of processes
– there is only one pre-defined communicator:
MPI_COMM_WORLD which contains ALL the processes
A message can ONLY be received within the
same communicator from which it was sent
– unlike tags, it is not possible to wildcard on comm
Uses of Communicators (i)
Can split MPI_COMM_WORLD into pieces
– each process has a new rank within each sub-communicator
– guarantees messages from the different pieces do not interact
• can attempt to do this using tags but there are no guarantees
MPI_COMM_WORLD
size=7
size=4
rank=1 rank=3 rank=5
rank=6
rank=0
rank=2 rank=4
rank=0
rank=1 rank=3
rank=2
comm1
rank=1
rank=2
rank=0
size=3
comm2
Uses of Communicators (ii)
Can make a copy of MPI_COMM_WORLD
– containing all the same processes but in a new communicator
Enables processes to communicate with each
other safely within a piece of code
– guaranteed that messages cannot be received by other code
– this is essential for people writing parallel libraries (eg a Fast
Fourier Transform) to stop library messages becoming mixed
up with user messages
• user cannot intercept the the library messages if the library keeps
the identity of the new communicator a secret
• not safe to simply try and reserve tag values due to wildcarding
Summary (i)
Question: Why bother with all these send modes?
Answer
– it is a little complicated, but you should make sure you understand
– Ssend and Bsend are clear
• map directly onto synchronous and asynchronous sends
– Send can be either synchronous or asynchronous
• MPI is trying to be helpful here, giving you the benefits of Bsend if
there is sufficient system memory available, but not failing completely
if buffer space runs out
• in practice this leads to endless confusion!
The amount of system buffer space is variable
– programs that run on one machine may deadlock on another
– you should NEVER assume that Send is asynchronous!
Summary (ii)
Question: What are the tags for?
Answer
– if you don’t need them don’t use them!
• perfectly acceptable to set all tags to zero
– can be useful for debugging
• eg always tag messages with the rank of the sender
Summary (iii)
Question: Can I just use MPI_COMM_WORLD?
Answer
– yes: many people never need to create new communicators in
their MPI programs
– however, it is probably bad practice to specify
MPI_COMM_WORLD explicitly in your routines
• using a variable will allow for greater flexibility later on, eg:
MPI_Comm comm;
/* or INTEGER for Fortran */
comm = MPI_COMM_WORLD;
...
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
....
Non-Blocking
Communications
Deadlock
1
2
5
3
4
Communicator
0
Non-Blocking Communications
Separate communication into three phases:
Initiate non-blocking communication.
Do some work (perhaps involving other
communications?)
Wait for non-blocking communication to
complete.
Non-Blocking Send
1
2
5
3
4
Communicator
0
Non-Blocking Receive
1
2
5
3
4
Communicator
0
Handles used for Non-blocking Comms
datatype same as for blocking
(MPI_Datatype or INTEGER).
communicator same as for blocking
(MPI_Comm or INTEGER).
request MPI_Request or INTEGER.
A request handle is allocated when a
communication is initiated.
Non-blocking Synchronous Send
 C:
int MPI_Issend(void* buf, int count,
MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm,
MPI_Request *request)
int MPI_Wait(MPI_Request *request,
MPI_Status *status)
 Fortran:
MPI_ISSEND(buf, count, datatype, dest,
tag, comm, request, ierror)
MPI_WAIT(request, status, ierror)
Non-blocking Receive
 C:
int MPI_Irecv(void* buf, int count,
MPI_Datatype datatype, int src,
int tag, MPI_Comm comm,
MPI_Request *request)
int MPI_Wait(MPI_Request *request,
MPI_Status *status)
 Fortran:
MPI_IRECV(buf, count, datatype, src,
tag, comm, request, ierror)
MPI_WAIT(request, status, ierror)
Blocking and Non-Blocking
Send and receive can be blocking or nonblocking.
A blocking send can be used with a nonblocking receive, and vice-versa.
Non-blocking sends can use any mode synchronous, buffered, standard, or ready.
Synchronous mode affects completion, not
initiation.
Communication Modes
NON-BLOCKING OPERATION
MPI CALL
Standard send
MPI_ISEND
Synchronous send
MPI_ISSEND
Buffered send
MPI_IBSEND
Ready send
MPI_IRSEND
Receive
MPI_IRECV
Completion
 Waiting versus Testing.
 C:
int MPI_Wait(MPI_Request *request,
MPI_Status *status)
int MPI_Test(MPI_Request *request,
int *flag,
MPI_Status *status)
 Fortran:
MPI_WAIT(handle, status, ierror)
MPI_TEST(handle, flag, status, ierror)
Multiple Communications
Test or wait for completion of one message.
Test or wait for completion of all messages.
Test or wait for completion of as many
messages as possible.
Testing Multiple Non-Blocking Comms
Process
in
in
in
Derived Datatypes
MPI Datatypes
Basic types
Derived types
– vectors
– structs
– others
Motivation
Send / Recv calls need a datatype argument
– pre-defined values exist for pre-defined language types
– eg real <-> MPI_REAL; int <-> MPI_INT
What about types defined by a program?
– eg structures (in C) or user-defined types (Fortran)
Send / Recv calls take a count parameter
– what about data that isn’t contiguous in memory?
– eg subsections of 2D arrays
Approach
Can define new types in MPI
– user calls setup routines to describe new datatype to MPI
• remember, MPI is a library and NOT a compiler!
– MPI returns a new datatype handle
– store this value in a variable, eg MPI_MY_NEWTYPE
Derived types have same status as pre-defined
– can use in any message-passing call
Some care needed for reduction operations
– user must also define a new MPI_Op appropriate to the new
datatype to tell MPI how to combine them
Defining types
All derived types stored by MPI as a list of
basic types and displacements (in bytes)
– for a structure, types may be different
– for an array subsection, types will be the same
User can define new derived types in terms of
both basic types and other derived types
Derived Data types - Type
basic datatype 0
displacement of datatype 0
basic datatype 1
displacement of datatype 1
...
...
basic datatype n-1
displacement of datatype n-1
Contiguous Data
 The simplest derived datatype consists of a number of
contiguous items of the same datatype.
 C:
int MPI_Type_contiguous(int count,
MPI_Datatype oldtype,
MPI_Datatype *newtype)
 Fortran:
MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE,
NEWTYPE, IERROR)
INTEGER COUNT, OLDTYPE, NEWTYPE, IERROR
Use of contiguous
May make program clearer to read
Imagine sending a block of 10 integers
– use MPI_Ssend with MPI_INT / MPI_INTEGER and count = 10
Or …
– define a new contiguous type of 10 integers called BLOCK
– use MPI_Ssend with type=BLOCK and count = 1
May also be useful intermediate stage in
building more complicated types
– ie later used in definition of another derived type
Vector Datatype Example
Oldtype
5 element stride
between blocks
Newtype
3 elements per block
2 blocks
count = 2
stride = 5
blocklength = 3
What is a vector type?
Why is a pattern with blocks and gaps useful?
A vector type corresponds to a
subsection of a 2D array
 Think about how arrays are stored in memory
– unfortunately, different conventions for C and Fortran!
– must use statically allocated arrays in C because dynamically
allocated arrays (using malloc) have no defined storage format
– In Fortran, can use either static or allocatable arrays
Coordinate System (how I draw arrays)
x[0][3] x[1][3] x[2][3] x[3][3]
x[0][2] x[1][2] x[2][2] x[3][2]
x[i][j]
x[0][1] x[1][1] x[2][1] x[3][1]
j
x[0][0] x[1][0] x[2][0] x[3][0]
i
x(i,j)
x(1,4)
x(2,4) x(3,4)
x(4,4)
x(1,3)
x(2,3) x(3,3)
x(4,3)
x(1,2)
x(2,2) x(3,2)
x(4,2)
x(1,1)
x(2,1) x(3,1)
x(4,1)
Arrray Layout in Memory
C: x[16]
1
2
3
4
F: x(16)
5
6
7
8
9
10
C: x[4][4]
11
12
13
14
15
F: x(4,4)
13
14
15
16
15
9
10
11
12
10
14
5
6
7
8
9
13
1
2
3
4
4
8
12
16
3
7
11
2
6
1
5
16
j
i
Data is contiguous in memory
– different conventions for mapping 2D t o 1D arrays in C and Fortran
C example
C: x[5][4]
A 3 x 2 subsection of a 5 x 4 array
– three blocks of two elements separated by gaps of two
Fortran example
F: x(5,4)
A 3 x 2 subsection of a 5 x 4 array
– two blocks of three elements separated by gaps of two
Equivalent Vector Datatypes
count = 3
blocklength = 2
stride = 4
count = 2
blocklength = 3
stride = 5
Constructing a Vector Datatype
 C:
int MPI_Type_vector (int count,
int blocklength, int stride,
MPI_Datatype oldtype,
MPI_Datatype *newtype)
 Fortran:
MPI_TYPE_VECTOR (COUNT, BLOCKLENGTH,
STRIDE, OLDTYPE, NEWTYPE, IERROR)
Sending a vector
Have defined a 3x2 subsection of a 5x4 array
– but not defined WHICH subsection
– is it the bottom left-hand corner? top-right?
Data that is sent depends on what buffer you
pass to the send routines
– pass the address of the first element that should be sent
Vectors in send routines
MPI_Ssend(&x[1][1], 1, vector3x2, ...);
MPI_SSEND(x(2,2)
, 1, vector3x2, ...)
MPI_Ssend(&x[2][1], 1, vector3x2, ...);
MPI_SSEND(x(3,2)
, 1, vector3x2, ...)
Extent of a Datatatype
 May be useful to find out how big a derived type is
– can use these routines to compute extents of basic types too
– answer is returned in bytes
 C:
int MPI_Type_extent (MPI_Datatype datatype,
MPI_Aint *extent)
 Fortran:
MPI_TYPE_EXTENT( DATATYPE, EXTENT, IERROR)
INTEGER DATATYPE, EXTENT, IERROR
Structures
Can define compound objects in C and Fortran
struct compound
{
int
ival;
double dval[3];
};
type compound
integer
:: ival
double precision :: dval(3)
end type compound
Storage format NOT defined by the language
– different compilers do different things
– eg insert arbitrary padding between successive elements
– need to tell MPI the byte displacements of every element
Constructing a Struct Datatype
 C:
int MPI_Type_struct (int count,
int *array_of_blocklengths,
MPI_Aint *array_of_displacements,
MPI_Datatype *array_of_types,
MPI_Datatype *newtype)
 Fortran:
MPI_TYPE_STRUCT (COUNT,
ARRAY_OF_BLOCKLENGTHS,
ARRAY_OF_DISPLACEMENTS,
ARRAY_OF_TYPES, NEWTYPE, IERROR)
Struct Datatype Example
count = 2
array_of_blocklengths[0] = 1
array_of_types[0] = MPI_INT
array_of_blocklengths[1] = 3
array_of_types[1] = MPI_DOUBLE
But how do we compute the displacements?
– need to create a compound variable in our program
– explicitly compute memory addresses of every member
– subtract addresses to get displacements from origin
Address of a Variable
 C:
int MPI_Address (void *location,
MPI_Aint *address)
 Fortran:
MPI_ADDRESS( LOCATION, ADDRESS, IERROR)
<type> LOCATION (*)
INTEGER ADDRESS, IERROR
Committing a datatype
 Once a datatype has been constructed, it needs to be
committed before it is used in a message-passing call
 This is done using MPI_TYPE_COMMIT
 C:
int MPI_Type_commit (MPI_Datatype *datatype)
 Fortran:
MPI_TYPE_COMMIT (DATATYPE, IERROR)
INTEGER DATATYPE, IERROR
Virtual Topologies
Virtual Topologies
Convenient process naming.
Naming scheme to fit the communication
pattern.
Simplifies writing of code.
Can allow MPI to optimise communications.
How to use a Virtual Topology
Creating a topology produces a new
communicator.
MPI provides ``mapping functions''.
Mapping functions compute processor ranks,
based on the topology naming scheme.
Example
A 2-dimensional Cylinder
0
(0,0)
1
(0,1)
2
(0,2)
3
(0,3)
4
(1,0)
5
(1,1)
6
(1,2)
7
(1,3)
8
(2,0)
9
(2,1)
10
(2,2)
11
(2,3)
Topology types
Cartesian topologies
– each process is “connected” to its neighbours in a virtual grid.
• boundaries can be cyclic, or not.
• Optionally re-order ranks to allow MPI implementation to optimise
for underlying interconnectivity.
– processes are identified by cartesian coordinates.
Graph topologies
– general graphs
– not covered here
Creating a Cartesian Virtual Topology
 C:
int MPI_Cart_create(MPI_Comm comm_old,
int ndims, int *dims, int *periods,
int reorder, MPI_Comm *comm_cart)
 Fortran:
MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS,
PERIODS, REORDER, COMM_CART, IERROR)
INTEGER COMM_OLD, NDIMS, DIMS(*), COMM_CART, IERROR
LOGICAL PERIODS(*), REORDER
Balanced Processor Distribution
 C:
int MPI_Dims_create(int nnodes, int ndims,
int *dims)
 Fortran:
MPI_DIMS_CREATE(NNODES, NDIMS, DIMS, IERROR)
INTEGER NNODES, NDIMS, DIMS(*), IERROR
MPI_Dims_create
 Call tries to set dimensions as close to each other as
possible
dims before the call
function call
dims on return
(0, 0)
MPI_DIMS_CREATE( 6, 2, dims)
(3, 2)
(0, 0)
MPI_DIMS_CREATE( 7, 2, dims)
(7, 1)
(0, 3, 0)
MPI_DIMS_CREATE( 6, 3, dims)
(2, 3, 1)
(0, 3, 0)
MPI_DIMS_CREATE( 7, 3, dims)
erroneous call
 Non zero values in dims sets the number of processors
required in that direction.
– WARNING:- make sure dims is set to 0 before the call!
Cartesian Mapping Functions
Mapping process grid coordinates to ranks
 C:
int MPI_Cart_rank(MPI_Comm comm,
int *coords, int *rank)
 Fortran:
MPI_CART_RANK (COMM, COORDS, RANK, IERROR)
INTEGER COMM, COORDS(*), RANK, IERROR
Cartesian Mapping Functions
Mapping ranks to process grid coordinates
 C:
int MPI_Cart_coords(MPI_Comm comm, int rank,
int maxdims, int *coords)
 Fortran:
MPI_CART_COORDS(COMM, RANK, MAXDIMS,COORDS,
IERROR)
INTEGER COMM, RANK, MAXDIMS, COORDS(*), IERROR
Cartesian Mapping Functions
Computing ranks of my neighbouring processes
Following conventions of MPI_SendRecv
 C:
int MPI_Cart_shift(MPI_Comm comm,
int direction, int disp,
int *rank_source, int *rank_dest)
 Fortran:
MPI_CART_SHIFT(COMM, DIRECTION, DISP,
RANK_SOURCE, RANK_DEST, IERROR)
INTEGER COMM, DIRECTION, DISP,
RANK_SOURCE, RANK_DEST, IERROR
Cartesian Partitioning
Cut a grid up into “slices”.
A new communicator is produced for each slice.
Each slice can then perform its own collective
communications.
MPI_Cart_sub and MPI_CART_SUB generate
new communicators for the slices.
– Use array to specify which dimensions should be retained in the
new communicator.
Partitioning with MPI_CART_SUB
 C:
int MPI_Cart_sub (MPI_Comm comm,
int *remain_dims,
MPI_Comm *newcomm)
 Fortran:
MPI_CART_SUB (COMM, REMAIN_DIMS,
NEWCOMM,IERROR)
INTEGER COMM, NEWCOMM, IERROR
LOGICAL REMAIN_DIMS(*)
Collective Communications
Collective Communication
Communications involving a group of
processes.
Called by all processes in a communicator.
Examples:
– Barrier synchronisation.
– Broadcast, scatter, gather.
– Global sum, global maximum, etc.
Characteristics of Collective Comms
Collective action over a communicator.
All processes must communicate.
Synchronisation may or may not occur.
All collective operations are blocking.
No tags.
Receive buffers must be exactly the right size.
Barrier Synchronisation
 C:
int MPI_Barrier (MPI_Comm comm)
 Fortran:
MPI_BARRIER (COMM, IERROR)
INTEGER COMM, IERROR
Broadcast
 C:
int MPI_Bcast (void *buffer, int count,
MPI_Datatype datatype, int root,
MPI_Comm comm)
 Fortran:
MPI_BCAST (BUFFER, COUNT, DATATYPE, ROOT,
COMM, IERROR)
<type> BUFFER(*)
INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR
Scatter
ABCDE
ABCDE
A
B
C
D
E
Scatter
 C:
int MPI_Scatter(void *sendbuf,
int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcount,
MPI_Datatype recvtype, int root,
MPI_Comm comm)
 Fortran:
MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE,
RECVBUF, RECVCOUNT, RECVTYPE,
ROOT, COMM, IERROR)
<type> SENDBUF, RECVBUF
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT
INTEGER RECVTYPE, ROOT, COMM, IERROR
Gather
A
B
C
D
E
C
D
E
ABCDE
A
B
Gather
 C:
int MPI_Gather(void *sendbuf, int sendcount,
MPI_Datatype sendtype, void *recvbuf,
int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)
 Fortran:
MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE,
RECVBUF, RECVCOUNT, RECVTYPE,
ROOT, COMM, IERROR)
<type> SENDBUF, RECVBUF
INTEGER
SENDCOUNT, SENDTYPE, RECVCOUNT
INTEGER RECVTYPE, ROOT, COMM, IERROR
Global Reduction Operations
Used to compute a result involving data
distributed over a group of processes.
Examples:
– global sum or product
– global maximum or minimum
– global user-defined operation
Predefined Reduction Operations
MPI Name
Function
MPI_MAX
Maximum
MPI_MIN
Minimum
MPI_SUM
Sum
MPI_PROD
Product
MPI_LAND
Logical AND
MPI_BAND
Bitwise AND
MPI_LOR
Logical OR
MPI_BOR
Bitwise OR
MPI_LXOR
Logical exclusive OR
MPI_BXOR
Bitwise exclusive OR
MPI_MAXLOC
Maximum and location
MPI_MINLOC
Minimum and location
MPI_Reduce
 C:
int MPI_Reduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype,
MPI_Op op, int root, MPI_Comm comm)
 Fortran:
MPI_REDUCE(SENDBUF, RECVBUF, COUNT,
DATATYPE, OP, ROOT, COMM, IERROR)
<type> SENDBUF, RECVBUF
INTEGER
SENDCOUNT, SENDTYPE, RECVCOUNT
INTEGER RECVTYPE, ROOT, COMM, IERROR
MPI_REDUCE
Rank
Root
1
A B CD
A B CD
0
MPI_REDUCE
E F GH
E F GH
I J KL
I J KL
MNO P
MNO P
2
3
AoEoIoM
Example of Global Reduction
Integer global sum
 C:
MPI_Reduce(&x, &result, 1, MPI_INT,
MPI_SUM,0, MPI_COMM_WORLD)
 Fortran:
CALL MPI_REDUCE(x, result, 1, MPI_INTEGER,
MPI_SUM, 0,
MPI_COMM_WORLD, IERROR)
 Sum of all the x values is placed in result.
 The result is only placed there on processor 0.
User-Defined Reduction Operators
 Reducing using an arbitrary operator, 
 C - function of type MPI_User_function:
void my_op (void *invec,
void *inoutvec,int *len,
MPI_Datatype *datatype)
 Fortran - external subprogram of type
SUBROUTINE MY_OP(INVEC(*),INOUTVEC(*),
LEN, DATATYPE)
<type> INVEC(LEN), INOUTVEC(LEN)
INTEGER LEN, DATATYPE
Reduction Operator Functions
Operator function for  must act as:
for (i = 1 to len)
inoutvec(i) = inoutvec(i)

invec(i)
Operator  need not commute but must be
associative.
Registering User-Defined Operator
 Operator handles have type MPI_Op or INTEGER
 C:
int MPI_Op_create(MPI_User_function *my_op,
int commute, MPI_Op *op)
 Fortran:
MPI_OP_CREATE (MY_OP, COMMUTE, OP, IERROR)
EXTERNAL MY_OP
LOGICAL COMMUTE
INTEGER OP, IERROR
Variants of MPI_REDUCE
 MPI_Allreduce no root process
 MPI_Reduce_scatter result is scattered
 MPI_Scan “parallel prefix”
MPI_ALLREDUCE
Rank
0
A B CD
A B CD
MPI_ALLREDUCE
1
E F GH
E F GH
I J KL
I J KL
MNO P
MNO P
2
3
AoEoIoM
MPI_ALLREDUCE
Integer global sum
 C:
int MPI_Allreduce(void* sendbuf,
void* recvbuf, int count,
MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
 Fortran:
MPI_ALLREDUCE(SENDBUF, RECVBUF, COUNT,
DATATYPE, OP, COMM, IERROR)
MPI_SCAN
Rank
0
A B CD
A B CD
A
MPI_SCAN
1
E F GH
I J KL
AoE
E F GH
I J KL
2
AoEoI
3
MNO P
MNO P
AoEoIoM
MPI_SCAN
Integer partial sum
 C:
int MPI_Scan(void* sendbuf, void* recvbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
 Fortran:
MPI_SCAN(SENDBUF, RECVBUF, COUNT,
DATATYPE, OP, COMM, IERROR)
Message Passing Programming
Designing MPI Applications
Overview
Lecture will cover
–
–
–
–
–
MPI portability
maintenance of serial code
general design
debugging
verification
MPI Portability
Potential deadlock
– you may be assuming that MPI_Send is asynchronous
– it often is buffered for small messages
• but threshhold can vary with implementation
– a correct code should run if you replace all MPI_Send calls
with MPI_Ssend
Buffer space
– cannot assume that there will be space for MPI_Bsend
– Sun default buffer space is zero!
– be sure to use MPI_Buffer_Attach
• some advice in MPI standard regarding required size
Data Sizes
Cannot assume anything about data sizes
– eg C float / Fortran REAL are 8 bytes on Cray T3E
– can be an issue when defining struct types
– use MPI_Type_extent to find out the number of bytes
Changing precision
– when changing from, say, float to double, must change all
the MPI types from MPI_FLOAT to MPI_DOUBLE as well
Easiest to achieve with an include file
– eg every routine includes precision.h
Changing Precision: C
Define a header file called, eg, precision.h
– typedef float RealNumber
– #define MPI_REALNUMBER MPI_FLOAT
Include in every function
–
–
–
–
#include “precision.h”
...
RealNumber x;
MPI_Routine(&x, MPI_REALNUMBER, ...);
Global change of precision now easy
– edit 2 lines in one file: float -> double, MPI_FLOAT -> MPI_DOUBLE
Changing Precision: Fortran
Define a module called, eg, precision
– integer, parameter :: REALNUMBER=kind(1.0e0)
– integer, parameter :: MPI_REALNUMBER = MPI_REAL
Use in every subroutine
–
–
–
–
use precision
...
REAL(kind=REALNUMBER):: x
call MPI_ROUTINE(x, MPI_REALNUMBER, ...)
Global change of precision now easy
– change 1.0e0 -> 1.0d0, MPI_REAL -> MPI_DOUBLE_PRECISION
Testing Portability
Run on more than one machine
– assuming the implementations are different
– most Beowulf systems will use MPICH
• running on different Beowulf machines may not be a good test
More than one implementation on same machine
– very useful test, and can give interesting performance numbers
Serial Code
Adding MPI can destroy a code
– would like to maintain a serial version
– ie can compile and run identical code without an MPI library
– not simply running MPI code with P=1!
Need to separate off communications routines
– put them all in a separate file
– provide a dummy library for the serial code
– no explicit reference to MPI in main code
Example: Initialisation
! parallel routine
subroutine par_begin(size, procid)
implicit none
integer :: size, procid
include "mpif.h"
call mpi_init(ierr)
call mpi_comm_size(MPI_COMM_WORLD, size, ierr)
call mpi_comm_rank(MPI_COMM_WORLD, procid, ierr)
procid = procid + 1
end subroutine par_begin
! dummy routine for serial machine
subroutine par_begin(size, procid)
implicit none
integer :: size, procid
size = 1
procid = 1
end subroutine par_begin
Example: Global Sum
! parallel routine
subroutine par_dsum(dval)
implicit none
include "mpif.h"
double precision :: dval, dtmp
call mpi_allreduce(dval, dtmp, 1, MPI_DOUBLE_PRECISION, &
MPI_SUM, comm, ierr)
dval = dtmp
end subroutine par_dsum
! dummy routine for serial machine
subroutine par_dsum(dval)
implicit none
double precision dval
end subroutine par_dsum
Example Makefile
SEQSRC= \
demparams.f90 demrand.f90 demcoord.f90 demhalo.f90 \
demforce.f90 demlink.f90 demcell.f90 dempos.f90 demons.f90
MPISRC= \
demparallel.f90 \
demcomms.f90
FAKESRC= \
demfakepar.f90 \
demfakecomms.f90
#PARSRC=$(FAKESRC)
PARSRC=$(MPISRC)
Advantages of Comms
Library
Can compile serial program from same source
– makes parallel code more readable
Enables code to be ported to other libraries
– more efficient but less versatile routines may exist
– eg Cray-specific SHMEM library
– can even choose to only port a subset of the routines
Library can be optimised for different MPIs
– eg choose the fastest send (Ssend, Send, Bsend?)
Design
Separate the communications into a library
Make parallel code similar as possible to serial
– could use the same update routine in serial and parallel
• only the array bounds would be different
– may have a large impact on the design of your serial code
Don’t try and be too clever
– don’t agonise whether one more halo swap is really necessary
– just do it for the sake of robustness
General Considerations
Compute everything everywhere
– eg use routines such as Allreduce
– perhaps the value only really needs to be know on the master
• but using Allreduce makes things simpler
• little or no performance implications
Often easiest to make P a compile-time constant
– may not seem elegant but can make coding much easier
• eg definition of array bounds
– put definition in an include file
– a clever Makefile can reduce the need for recompilation
• only recompile routines that define arrays rather than just use them
• pass array bounds as arguments to all other routines
Debugging
Parallel debugging can be hard
Don’t assume it’s a parallel bug!
– run the serial code first
– then the parallel code with P=1
– then on a small number of processes …
Writing output to separate files can be useful
– eg log.00, log.01, log.02, …. for ranks 0, 1, 2, ...
– need some way easily to switch this on and off
Some parallel debuggers exist
– Prism is available on Sun machines
– Totalview is the leader across all platforms
Verification: Is My Code
Working?
Should the output be identical for any P?
– very hard to accomplish in practice due to rounding errors
• may have to look hard to see differences in the last few digits
– typically, results vary slightly with number of processes
– need some way of quantifiying the differences from serial code
– and some definition of “acceptable”
What about the same code for fixed P?
– identical output for two runs on same number of processes?
– should be achievable with some care
• not in specific cases like dynamic task farms
• possible problems with global sums
• MPI doesn’t force reproducibility, but some implementations can
– without this, debugging is almost impossible
Parallelisation
Some parallel approaches may be simple
– but not necessarily optimal for performance
– often need to consider what is the realistic range of P
Some people write incredibly complicated code
– step back and ask: what do I actually want to do?
– is there an existing MPI routine or collective communication?
– should I reconsider my approach if it prohibits me from using
existing routines, even if it is not quite so efficient?
Optimisation
Keep running your code
– on a number of input data sets
– with a range of MPI processes
If scaling is poor
– find out what parallel routines are the bottlenecks
– again, much easier with a separate comms library
If performance is poor
– work on the serial code
– return to parallel issues later on
Conclusions
Run on a variety of machines
Keep it simple
Maintain a serial version
Don’t assume all bugs are parallel bugs
Find a debugger you like (good luck to you)

similar documents