Non-blocking communication
So far, we have only discussed blocking point-to-point communication. In blocking communication process sends or receive messages and wait or the transmission to be completed before continuing.
Motivations
In some case, using blocking communication will lead to a deadlock. To illustrate this, let's change the one-dimensional diffusion example from the previous chapter and change it to use periodic boundary conditions.
With periodic boundary conditions the process and the start of the domain will
send the leftmost value to the last process (with rank world_size - 1
)
while the last process will send its rightmost value to the process with rank
0
. To implement this, we do the following changes:
-left_rank = my_rank > 0 ? my_rank - 1 : MPI_PROC_NULL;
+left_rank = my_rank > 0 ? my_rank - 1 : world_size - 1;
-right_rank = my_rank < world_size-1 ? my_rank + 1 : MPI_PROC_NULL;
+right_rank = my_rank < world_size-1 ? my_rank + 1 : 0;
As long as the eager protocol is used for communication, this change should not prevent our code to work properly. There is a huge probability that the eager protocol is used as it's the preferred protocol for small message sizes. However, relying on the assumption the MPI will use the eager protocol is not a good practice.
If we switch to synchronous sends for the communication:
-MPI_Send(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
+MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
-MPI_Send(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
+MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
we end up in a deadlock. The reason is that all the processes starts by a send
that block until communication complete. Consequently, there's no opportunity
for an MPI_Recv
to be called since all processes are stuck in an MPI_Send
.
In contrast, in the non-periodic scenario, the last process doesn't transmit any
data to the right neighbor and can directly proceed with the MPI_Recv
, thereby
enabling the communication to progress.
One way to get around the deadlock is to communicate in two steps with the even rank and odd ranks issue the send and receive in opposite order (even, send first and odd receive first)
if (my_rank%2 == 0) {
MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
} else {
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
}
Non-blocking send and receive functions
An alternative to the blocking point-to-point function are the non-blocking
send and receive. These functions have a name containing a capital "I" which
stand for "immediate" return. For example, the non-blocking equivalent of
an MPI_Send
is MPI_Isend
which as the following signature:
int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *request)
The signature of the MPI_Isend
function is similar to the one of its blocking
counterpart and the parameters of non-blocking variant have the same meaning as
the one used for the blocking send. The only difference is the request
parameter that was not present in the case of the blocking send. The
MPI_Request
that is returned by MPI_Isend
represents a handle on a
non-blocking operation.
What we do when we issue an MPI_Isend
is that we request a message to be sent
but we don't wait for this request to complete. Instead, MPI returns
MPI_Request
object that can be used at a later point in the execution of our
program to check that communication did complete.
The non-blocking equivalent of an MPI_Recv
is MPI_Irecv
which as the
following signature:
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Request * request)
Again, the parameters have the same meaning as for the non-blocking variant with
the addition of an MPI_Request
parameter in place of the MPI_Status
parameter.
Checking for communication completion
MPI_Wait
waits for a non-blocking operation, i.e., will block until the
underlying non-blocking operation completes. This function signature
with request
, a request handle on the non-blocking routine to wait on. This
handle is returned by a call to MPI_Isend
and MPI_Irecv
. status
is a
MPI_Status
in which the status returned by the non-blocking routine will be
stored. Like with non-blocking function, we can pass MPI_STATUS_IGNORE
if
we don't need the status.
For example, if we want to receive a single integer from the process of rank
0
, we can issue an MPI_Irecv
before calling MPI_Wait
to check the
completion of the receive operation.
MPI_Request request;
MPI_Irecv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &request);
// do something
MPI_Wait(&request, MPI_STATUS_IGNORE);
MPI_Waitall
is a version of MPI_Wait that can be applied to an array of
request handlers. This function has the following signature:
with count
the number of request handlers to wait on, array_of_requests
the request handlers on the non-blocking routines to wait on and
array_of_statuses
an array in which to return the status of the non-blocking
operations. If we don't need to examine the array_of_statuses
, we can use
the predefined constant MPI_STATUSES_IGNORE
. This constant is the equivalent
of MPI_STATUS_IGNORE
for an array of statuses.
For example, if we want to receive and send a single integer from and to the
process of rank 0
, we can issue an MPI_Isend
and MPI_Irecv
before calling
MPI_Wait
to check the completion of these operations.
MPI_Request requests[2];
MPI_Isend(&myvalue, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[0]);
MPI_Irecv(&othervalue, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[1]);
// do something
MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);
An alternative to MPI_Wait
is MPI_Test
. This function checks if a
non-blocking operation is complete. Unlike MPI_Wait
, MPI_Test
will not wait
for the underlying non-blocking operation to complete. This function has the
following signature:
MPI_Test
returns flag = true
if the operation identified by the request
request
is complete. In such a case, the status object status
is set to
contain information on the completed operation.
Non-blocking diffusion in one dimension
Going back to the diffusion example, we can refactor the communication part of the code to use non-blocking sends and receives to avoid a deadlock in the periodic case.
This refactor is quite simple:
- replace all
MPI_Send
andMPI_Recv
withMPI_Isend
andMPI_Irecv
- issue a
MPI_Waitall
to wait for the completion of the communication
MPI_Isend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0,
MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0,
MPI_COMM_WORLD, &reqs[1]);
MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1,
MPI_COMM_WORLD, &reqs[2]);
MPI_Irecv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1,
MPI_COMM_WORLD, &reqs[3]);
MPI_Waitall(4, reqs, MPI_STATUSES_IGNORE);
for (int i = 1; i <= my_size; i++) {
unew[i] = uold[i] + DIFF_COEF * dtdx2
* (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}
Overlapping communication and computation
In order to get the best performance with MPI, it is desirable to hide the communication. The objective is to create a scenario where computation can proceed concurrently with communication, allowing communication to progress in the background.
If we go back to the diffusion example, and study the way our problem is structured, we can see that
- all the values we need to update the value at the current time step were already computed at the previous time step
- the inner cells don't depend on the values of the ghost cells
From these observations, we can modify the code to initiate the non-blocking communication for the update the ghost cell and directly start updating the inner cells at the current time step. In a second step, we check that the communication has completed and update the outer cells that requires the value at the ghost cell.
MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0,
MPI_COMM_WORLD, &recv_reqs[0]);
MPI_Irecv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1,
MPI_COMM_WORLD, &recv_reqs[1]);
MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1,
MPI_COMM_WORLD, &send_reqs[0]);
MPI_Isend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0,
MPI_COMM_WORLD, &send_reqs[1]);
for (int i = 2; i < my_size; i++) {
unew[i] = uold[i] + DIFF_COEF * dtdx2
* (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}
MPI_Waitall(2, recv_reqs, MPI_STATUSES_IGNORE);
unew[1] = uold[1] + DIFF_COEF * dtdx2 * (uold[ 2] - 2.0 * uold[1] + uold[0]);
unew[my_size] = uold[my_size] + DIFF_COEF * dtdx2
* (uold[my_size+1] - 2.0 * uold[my_size] + uold[my_size-1]);
MPI_Waitall(2, send_reqs, MPI_STATUSES_IGNORE);
It's worth noting that we only need to verify the completion of receive
operations before updating the outer cell, as our primary concern is ensuring
that the ghost cells have been updated. The outer cells update does not modify the
values of uold
, making it unnecessary to check for the completion of the send
operation.
At the end of the time step, however, we will perform a pointer swap which means
that uold
will be modified and we need to check that the send operation has
completed.