Skip to content

Non-blocking point-to-point communication

In many parallel programs, processes must exchange data with one another in order to continue their computations. Traditionally, these communications are implemented using blocking operations such as MPI_Send and MPI_Recv. In a blocking communication, the process initiating the communication is forced to wait until the operation has completed before continuing execution.

While simple, this approach can be inefficient. In distributed-memory systems, message passing can be relatively slow compared to local computations. When every process waits for communications to complete before proceeding, significant amounts of time can be wasted in idle waiting, especially when the system scale is large or when communication latency is high.

However, in many cases, computations do not depend immediately on the communicated data. That means that communication and computation do not necessarily have to happen one after the other. If communication can proceed concurrently with computation, the total runtime can be reduced.

Non-blocking communication in MPI enables precisely this behavior: processes can initiate a data transfer and then perform other useful work while the message is being transmitted in the background. This capability allows us to overlap communication and computation and thereby improve performance.

Non-blocking Communication in MPI

MPI provides non-blocking versions of its basic communication routines. The most fundamental of these is MPI_Isend, the non-blocking send operation. When a process calls MPI_Isend, it starts sending data to another process, but the call returns immediately without waiting for the message to complete transmission.

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
              int source, int tag, MPI_Comm comm, MPI_Request *request)
Argument Meaning
buf (in) Initial address of send buffer
count (in) Maximum number of elements to send
datatype (in) Datatype of each send buffer entry
source (in) Rank of destination
tag (in) Message tag
comm (in) Communicator
request (out) Communication request

This means that after calling MPI_Isend, the process is free to perform other computations or to initiate additional communications while the MPI library takes care of the data transfer in the background.

Non-blocking communication Non-blocking communication

It’s important to note, however, that calling MPI_Isend does not guarantee that the message has been sent, or even that the buffer is safe to reuse. The communication progresses asynchronously, and the user must ensure that the data in the send buffer is not modified until the send operation has completed. To determine when this happens, completion routines such as MPI_Wait or MPI_Test are used.

Checking non-blocking communication completion

When a non-blocking communication is initiated, MPI returns an MPI request handle—an object of type MPI_Request. This handle represents the ongoing communication operation.

MPI uses these request objects to track the progress and completion status of asynchronous operations. The request can later be passed to one of the completion routines to check whether the operation has finished or to block until it does.

There are two main families of completion routines:

  • Wait routines, such as MPI_Wait and MPI_Waitall, which block the calling process until one or more requests have completed.
  • Test routines, such as MPI_Test and MPI_Testall, which allow polling of the completion status without blocking.

MPI_Wait is the simplest completion routine. It takes a single request as an argument and blocks until the operation it represents has completed. Upon completion, MPI fills an MPI_Status structure with information about the communication (such as the source rank, tag, and number of received elements).

int MPI_Wait(MPI_Request *request, MPI_Status *status)
Argument Meaning
request (in) The request
status (out) Status object associated with the request

When multiple non-blocking communications are in progress, it is often convenient to wait for all of them simultaneously. This is done using MPI_Waitall, which takes an array of request objects and blocks until all have completed. Both send and receive operations can be waited on in this way.

int MPI_Waitall(int count, MPI_Request array_of_requests[],
                MPI_Status *array_of_statuses)
Argument Meaning
count (in) Array of requests length
array_of_requests (in) Array of requests to wait for
array_of_statuses (out) Array of status objects

When the operation completes and a wait function is called, the associated MPI_Request is deallocated automatically and set to MPI_REQUEST_NULL.

The 1D diffusion equation with non-blocking communication

To illustrate how non-blocking communication can improve performance, let’s return to the 1D diffusion equation example. Recall that the numerical update at each grid point depends on its two nearest neighbors. In a domain-decomposed setup, these neighbors may belong to adjacent processes, requiring communication of boundary (ghost) cells at each time step.

In the simplest implementation, these exchanges were handled using blocking send and receive calls. Each process sent data to its right neighbor, then waited to receive data from its left neighbor, and vice versa. This approach worked correctly but could cause unnecessary waiting, as processes often remained idle while waiting for communication to finish.

By replacing the blocking sends with non-blocking sends (MPI_Isend), we allow the communication to start but let the process continue executing other instructions. The simplest mixed version still uses blocking receives for simplicity, but even this can reduce idle time because sends can overlap with the subsequent receives.

MPI_Request requests[2];

MPI_Isend(&uold[subdom_size],  1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD, &requests[0]);
MPI_Recv(&uold[0],             1, MPI_DOUBLE, left_rank,  0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Isend(&uold[1],            1, MPI_DOUBLE, left_rank,  1, MPI_COMM_WORLD, &requests[1]);
MPI_Recv(&uold[subdom_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);

Overlapping communication and computation

A more advanced optimization uses non-blocking sends and non-blocking receives together. This enables us to hide communication behind computation:

  • Each process initiates non-blocking sends and receives for its ghost cells.
  • While those messages are being transferred, it proceeds to compute the interior points of its subdomain—those that do not depend on the boundary data.
  • After finishing the interior computations, the process waits for the boundary communications to complete using MPI_Waitall.
  • Once the data in the ghost cells are ready, the process updates the boundary points.

In the previous example, we used a blocking MPI_Recv, but this operation could just as well be replaced with a non-blocking receive.

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
               int source, int tag, MPI_Comm comm, MPI_Request *request)
Argument Meaning
buf (out) Initial address of receive buffer
count (in) Maximum number of elements to receive
datatype (in) Datatype of each receive buffer entry
source (in) Rank of source
tag (in) Message tag
comm (in) Communicator
request (out) Communication request

With MPI_Isend and MPI_Irecv, we can rewrite the example to overlap communication with computation.

MPI_Request recv_requests[2];
MPI_Request send_requests[2];

MPI_Irecv(&uold[0],             1, MPI_DOUBLE, left_rank,  0, MPI_COMM_WORLD, &recv_requests[0]);
MPI_Irecv(&uold[subdom_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, &recv_requests[1]);

MPI_Isend(&uold[subdom_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD, &send_requests[0]);
MPI_Isend(&uold[1],           1, MPI_DOUBLE, left_rank,  1, MPI_COMM_WORLD, &send_requests[1]);

for (int i = 2; i <= subdom_size-1; i++) {
  unew[i] =  uold[i] + alphadt_dx2 * (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}

MPI_Waitall(2, recv_requests, MPI_STATUSES_IGNORE);

unew[1] =  uold[1] + alphadt_dx2 * (uold[2] - 2.0 * uold[1] + uold[0]);
unew[subdom_size] =  uold[subdom_size] + alphadt_dx2 
      * (uold[subdom_size+1] - 2.0 * uold[subdom_size] + uold[subdom_size-1]);

MPI_Waitall(2, send_requests, MPI_STATUSES_IGNORE);

In this way, communication time that would otherwise have been spent waiting is now effectively overlapped with computation. The overall runtime per time step is reduced, especially in large-scale systems where communication latency is significant.