Non-blocking communication
So far, we have only discussed blocking point-to-point communication. In blocking communication process sends or receive messages and wait or the transmission to be completed before continuing.
Motivations
In some case, using blocking communication will lead to a deadlock. To illustrate this, let's change the one-dimensional diffusion example from the previous chapter and change it to use periodic boundary conditions.
With periodic boundary conditions the process and the start of the domain will
send the leftmost value to the last process (with rank world_size - 1)
while the last process will send its rightmost value to the process with rank
0. To implement this, we do the following changes:
-left_rank = my_rank > 0 ? my_rank - 1 : MPI_PROC_NULL;
+left_rank = my_rank > 0 ? my_rank - 1 : world_size - 1;
-right_rank = my_rank < world_size-1 ? my_rank + 1 : MPI_PROC_NULL;
+right_rank = my_rank < world_size-1 ? my_rank + 1 : 0;
As long as the eager protocol is used for communication, this change should not prevent our code to work properly. There is a huge probability that the eager protocol is used as it's the preferred protocol for small message sizes. However, relying on the assumption the MPI will use the eager protocol is not a good practice.
If we switch to synchronous sends for the communication:
-MPI_Send(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
+MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
-MPI_Send(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
+MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
we end up in a deadlock. The reason is that all the processes starts by a send
that block until communication complete. Consequently, there's no opportunity
for an MPI_Recv to be called since all processes are stuck in an MPI_Send.
In contrast, in the non-periodic scenario, the last process doesn't transmit any
data to the right neighbor and can directly proceed with the MPI_Recv, thereby
enabling the communication to progress.
One way to get around the deadlock is to communicate in two steps with the even rank and odd ranks issue the send and receive in opposite order (even, send first and odd receive first)
if (my_rank%2 == 0) {
MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
} else {
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
}
Non-blocking send and receive functions
An alternative to the blocking point-to-point function are the non-blocking
send and receive. These functions have a name containing a capital "I" which
stand for "immediate" return. For example, the non-blocking equivalent of
an MPI_Send is MPI_Isend which as the following signature:
int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *request)
The signature of the MPI_Isend function is similar to the one of its blocking
counterpart and the parameters of non-blocking variant have the same meaning as
the one used for the blocking send. The only difference is the request
parameter that was not present in the case of the blocking send. The
MPI_Request that is returned by MPI_Isend represents a handle on a
non-blocking operation.
What we do when we issue an MPI_Isend is that we request a message to be sent
but we don't wait for this request to complete. Instead, MPI returns
MPI_Request object that can be used at a later point in the execution of our
program to check that communication did complete.
The non-blocking equivalent of an MPI_Recv is MPI_Irecv which as the
following signature:
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Request * request)
Again, the parameters have the same meaning as for the non-blocking variant with
the addition of an MPI_Request parameter in place of the MPI_Status
parameter.
Checking for communication completion
MPI_Wait waits for a non-blocking operation, i.e., will block until the
underlying non-blocking operation completes. This function signature
with request, a request handle on the non-blocking routine to wait on. This
handle is returned by a call to MPI_Isend and MPI_Irecv. status is a
MPI_Status in which the status returned by the non-blocking routine will be
stored. Like with non-blocking function, we can pass MPI_STATUS_IGNORE if
we don't need the status.
For example, if we want to receive a single integer from the process of rank
0, we can issue an MPI_Irecv before calling MPI_Wait to check the
completion of the receive operation.
MPI_Request request;
MPI_Irecv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &request);
// do something
MPI_Wait(&request, MPI_STATUS_IGNORE);
MPI_Waitall is a version of MPI_Wait that can be applied to an array of
request handlers. This function has the following signature:
with count the number of request handlers to wait on, array_of_requests
the request handlers on the non-blocking routines to wait on and
array_of_statuses an array in which to return the status of the non-blocking
operations. If we don't need to examine the array_of_statuses, we can use
the predefined constant MPI_STATUSES_IGNORE. This constant is the equivalent
of MPI_STATUS_IGNORE for an array of statuses.
For example, if we want to receive and send a single integer from and to the
process of rank 0, we can issue an MPI_Isend and MPI_Irecv before calling
MPI_Wait to check the completion of these operations.
MPI_Request requests[2];
MPI_Isend(&myvalue, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[0]);
MPI_Irecv(&othervalue, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[1]);
// do something
MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);
An alternative to MPI_Wait is MPI_Test. This function checks if a
non-blocking operation is complete. Unlike MPI_Wait, MPI_Test will not wait
for the underlying non-blocking operation to complete. This function has the
following signature:
MPI_Test returns flag = true if the operation identified by the request
request is complete. In such a case, the status object status is set to
contain information on the completed operation.
Non-blocking diffusion in one dimension
Going back to the diffusion example, we can refactor the communication part of the code to use non-blocking sends and receives to avoid a deadlock in the periodic case.
This refactor is quite simple:
- replace all
MPI_SendandMPI_RecvwithMPI_IsendandMPI_Irecv - issue a
MPI_Waitallto wait for the completion of the communication
MPI_Isend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0,
MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0,
MPI_COMM_WORLD, &reqs[1]);
MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1,
MPI_COMM_WORLD, &reqs[2]);
MPI_Irecv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1,
MPI_COMM_WORLD, &reqs[3]);
MPI_Waitall(4, reqs, MPI_STATUSES_IGNORE);
for (int i = 1; i <= my_size; i++) {
unew[i] = uold[i] + DIFF_COEF * dtdx2
* (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}
Overlapping communication and computation
In order to get the best performance with MPI, it is desirable to hide the communication. The objective is to create a scenario where computation can proceed concurrently with communication, allowing communication to progress in the background.
If we go back to the diffusion example, and study the way our problem is structured, we can see that
- all the values we need to update the value at the current time step were already computed at the previous time step
- the inner cells don't depend on the values of the ghost cells
From these observations, we can modify the code to initiate the non-blocking communication for the update the ghost cell and directly start updating the inner cells at the current time step. In a second step, we check that the communication has completed and update the outer cells that requires the value at the ghost cell.
MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0,
MPI_COMM_WORLD, &recv_reqs[0]);
MPI_Irecv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1,
MPI_COMM_WORLD, &recv_reqs[1]);
MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1,
MPI_COMM_WORLD, &send_reqs[0]);
MPI_Isend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0,
MPI_COMM_WORLD, &send_reqs[1]);
for (int i = 2; i < my_size; i++) {
unew[i] = uold[i] + DIFF_COEF * dtdx2
* (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}
MPI_Waitall(2, recv_reqs, MPI_STATUSES_IGNORE);
unew[1] = uold[1] + DIFF_COEF * dtdx2 * (uold[ 2] - 2.0 * uold[1] + uold[0]);
unew[my_size] = uold[my_size] + DIFF_COEF * dtdx2
* (uold[my_size+1] - 2.0 * uold[my_size] + uold[my_size-1]);
MPI_Waitall(2, send_reqs, MPI_STATUSES_IGNORE);
It's worth noting that we only need to verify the completion of receive
operations before updating the outer cell, as our primary concern is ensuring
that the ghost cells have been updated. The outer cells update does not modify the
values of uold, making it unnecessary to check for the completion of the send
operation.
At the end of the time step, however, we will perform a pointer swap which means
that uold will be modified and we need to check that the send operation has
completed.

