Non-blocking communication

So far, we have only discussed blocking point-to-point communication. In blocking communication process sends or receive messages and wait or the transmission to be completed before continuing.

Motivations

In some case, using blocking communication will lead to a deadlock. To illustrate this, let's change the one-dimensional diffusion example from the previous chapter and change it to use periodic boundary conditions.

With periodic boundary conditions the process and the start of the domain will send the leftmost value to the last process (with rank world_size - 1) while the last process will send its rightmost value to the process with rank 0. To implement this, we do the following changes:

-left_rank =  my_rank > 0            ? my_rank - 1 : MPI_PROC_NULL;
+left_rank =  my_rank > 0            ? my_rank - 1 : world_size - 1;
-right_rank = my_rank < world_size-1 ? my_rank + 1 : MPI_PROC_NULL;
+right_rank = my_rank < world_size-1 ? my_rank + 1 : 0;

As long as the eager protocol is used for communication, this change should not prevent our code to work properly. There is a huge probability that the eager protocol is used as it's the preferred protocol for small message sizes. However, relying on the assumption the MPI will use the eager protocol is not a good practice.

If we switch to synchronous sends for the communication:

-MPI_Send(&uold[my_size],  1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
+MPI_Ssend(&uold[my_size],  1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
 MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD, 
            MPI_STATUS_IGNORE);

-MPI_Send(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
+MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD);
 MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, 
            MPI_STATUS_IGNORE);

we end up in a deadlock. The reason is that all the processes starts by a send that block until communication complete. Consequently, there's no opportunity for an MPI_Recv to be called since all processes are stuck in an MPI_Send. In contrast, in the non-periodic scenario, the last process doesn't transmit any data to the right neighbor and can directly proceed with the MPI_Recv, thereby enabling the communication to progress.

One way to get around the deadlock is to communicate in two steps with the even rank and odd ranks issue the send and receive in opposite order (even, send first and odd receive first)

if (my_rank%2 == 0) {
  MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);
  MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank,  0, MPI_COMM_WORLD, 
             MPI_STATUS_IGNORE);

  MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank,  1, MPI_COMM_WORLD);
  MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, 
             MPI_STATUS_IGNORE);

} else {
  MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank,  0, MPI_COMM_WORLD,
             MPI_STATUS_IGNORE);
  MPI_Ssend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD);

  MPI_Recv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, 
             MPI_STATUS_IGNORE);
  MPI_Ssend(&uold[1], 1, MPI_DOUBLE, left_rank,  1, MPI_COMM_WORLD);
}

Non-blocking send and receive functions

An alternative to the blocking point-to-point function are the non-blocking send and receive. These functions have a name containing a capital "I" which stand for "immediate" return. For example, the non-blocking equivalent of an MPI_Send is MPI_Isend which as the following signature:

int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, 
              int tag, MPI_Comm comm, MPI_Request *request)

The signature of the MPI_Isend function is similar to the one of its blocking counterpart and the parameters of non-blocking variant have the same meaning as the one used for the blocking send. The only difference is the request parameter that was not present in the case of the blocking send. The MPI_Request that is returned by MPI_Isend represents a handle on a non-blocking operation.

What we do when we issue an MPI_Isend is that we request a message to be sent but we don't wait for this request to complete. Instead, MPI returns MPI_Request object that can be used at a later point in the execution of our program to check that communication did complete.

The non-blocking equivalent of an MPI_Recv is MPI_Irecv which as the following signature:

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source,
              int tag, MPI_Comm comm, MPI_Request * request)

Again, the parameters have the same meaning as for the non-blocking variant with the addition of an MPI_Request parameter in place of the MPI_Status parameter.

Checking for communication completion

MPI_Wait waits for a non-blocking operation, i.e., will block until the underlying non-blocking operation completes. This function signature

int MPI_Wait(MPI_Request *request, MPI_Status *status)

with request, a request handle on the non-blocking routine to wait on. This handle is returned by a call to MPI_Isend and MPI_Irecv. status is a MPI_Status in which the status returned by the non-blocking routine will be stored. Like with non-blocking function, we can pass MPI_STATUS_IGNORE if we don't need the status.

For example, if we want to receive a single integer from the process of rank 0, we can issue an MPI_Irecv before calling MPI_Wait to check the completion of the receive operation.

MPI_Request request;
MPI_Irecv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &request);

// do something

MPI_Wait(&request, MPI_STATUS_IGNORE);

MPI_Waitall is a version of MPI_Wait that can be applied to an array of request handlers. This function has the following signature:

int MPI_Waitall(int count, MPI_Request array_of_requests[], 
                MPI_Status array_of_statuses[])

with count the number of request handlers to wait on, array_of_requests the request handlers on the non-blocking routines to wait on and array_of_statuses an array in which to return the status of the non-blocking operations. If we don't need to examine the array_of_statuses, we can use the predefined constant MPI_STATUSES_IGNORE. This constant is the equivalent of MPI_STATUS_IGNORE for an array of statuses.

For example, if we want to receive and send a single integer from and to the process of rank 0, we can issue an MPI_Isend and MPI_Irecv before calling MPI_Wait to check the completion of these operations.

MPI_Request requests[2];
MPI_Isend(&myvalue,    1, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[0]);
MPI_Irecv(&othervalue, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[1]);

// do something

MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);

An alternative to MPI_Wait is MPI_Test. This function checks if a non-blocking operation is complete. Unlike MPI_Wait, MPI_Test will not wait for the underlying non-blocking operation to complete. This function has the following signature:

int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)

MPI_Test returns flag = true if the operation identified by the request request is complete. In such a case, the status object status is set to contain information on the completed operation.

Non-blocking diffusion in one dimension

Going back to the diffusion example, we can refactor the communication part of the code to use non-blocking sends and receives to avoid a deadlock in the periodic case.

This refactor is quite simple:

replace all MPI_Send and MPI_Recv with MPI_Isend and MPI_Irecv
issue a MPI_Waitall to wait for the completion of the communication

Source code for this example

diffusion1d_mpi_nonblocking.c

MPI_Isend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0, 
            MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, 
            MPI_COMM_WORLD, &reqs[1]);

MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, 
            MPI_COMM_WORLD, &reqs[2]);
MPI_Irecv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, 
            MPI_COMM_WORLD, &reqs[3]);

MPI_Waitall(4, reqs, MPI_STATUSES_IGNORE);

for (int i = 1; i <= my_size; i++) {
  unew[i] = uold[i] + DIFF_COEF * dtdx2
                    * (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}

Overlapping communication and computation

In order to get the best performance with MPI, it is desirable to hide the communication. The objective is to create a scenario where computation can proceed concurrently with communication, allowing communication to progress in the background.

If we go back to the diffusion example, and study the way our problem is structured, we can see that

all the values we need to update the value at the current time step were already computed at the previous time step
the inner cells don't depend on the values of the ghost cells

communication-computation overlap — The MPI Rendez-vous communication protocol

From these observations, we can modify the code to initiate the non-blocking communication for the update the ghost cell and directly start updating the inner cells at the current time step. In a second step, we check that the communication has completed and update the outer cells that requires the value at the ghost cell.

Source code for this example

diffusion1d_mpi_overlap.c

MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, 
            MPI_COMM_WORLD, &recv_reqs[0]);
MPI_Irecv(&uold[my_size+1], 1, MPI_DOUBLE, right_rank, 1, 
            MPI_COMM_WORLD, &recv_reqs[1]);

MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1,
            MPI_COMM_WORLD, &send_reqs[0]);
MPI_Isend(&uold[my_size], 1, MPI_DOUBLE, right_rank, 0,
            MPI_COMM_WORLD, &send_reqs[1]);

for (int i = 2; i < my_size; i++) {
  unew[i] = uold[i] + DIFF_COEF * dtdx2
                    * (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}

MPI_Waitall(2, recv_reqs, MPI_STATUSES_IGNORE);

unew[1] = uold[1] + DIFF_COEF * dtdx2 * (uold[ 2] - 2.0 * uold[1] + uold[0]);
unew[my_size] = uold[my_size] + DIFF_COEF * dtdx2 
                  * (uold[my_size+1] - 2.0 * uold[my_size] + uold[my_size-1]);

MPI_Waitall(2, send_reqs, MPI_STATUSES_IGNORE);

It's worth noting that we only need to verify the completion of receive operations before updating the outer cell, as our primary concern is ensuring that the ghost cells have been updated. The outer cells update does not modify the values of uold, making it unnecessary to check for the completion of the send operation.

At the end of the time step, however, we will perform a pointer swap which means that uold will be modified and we need to check that the send operation has completed.