Non-blocking point-to-point communication
In many parallel programs, processes must exchange data with one another in
order to continue their computations. Traditionally, these communications are
implemented using blocking operations such as MPI_Send and MPI_Recv. In a
blocking communication, the process initiating the communication is forced to
wait until the operation has completed before continuing execution.
While simple, this approach can be inefficient. In distributed-memory systems, message passing can be relatively slow compared to local computations. When every process waits for communications to complete before proceeding, significant amounts of time can be wasted in idle waiting, especially when the system scale is large or when communication latency is high.
However, in many cases, computations do not depend immediately on the communicated data. That means that communication and computation do not necessarily have to happen one after the other. If communication can proceed concurrently with computation, the total runtime can be reduced.
Non-blocking communication in MPI enables precisely this behavior: processes can initiate a data transfer and then perform other useful work while the message is being transmitted in the background. This capability allows us to overlap communication and computation and thereby improve performance.
Non-blocking Communication in MPI
MPI provides non-blocking versions of its basic communication routines. The most
fundamental of these is MPI_Isend, the non-blocking send operation. When a
process calls MPI_Isend, it starts sending data to another process, but the
call returns immediately without waiting for the message to complete
transmission.
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)
| Argument | Meaning |
|---|---|
buf |
(in) Initial address of send buffer |
count |
(in) Maximum number of elements to send |
datatype |
(in) Datatype of each send buffer entry |
source |
(in) Rank of destination |
tag |
(in) Message tag |
comm |
(in) Communicator |
request |
(out) Communication request |
This means that after calling MPI_Isend, the process is free to perform other
computations or to initiate additional communications while the MPI library
takes care of the data transfer in the background.

It’s important to note, however, that calling MPI_Isend does not guarantee
that the message has been sent, or even that the buffer is safe to reuse. The
communication progresses asynchronously, and the user must ensure that the data
in the send buffer is not modified until the send operation has completed. To
determine when this happens, completion routines such as MPI_Wait or
MPI_Test are used.
Checking non-blocking communication completion
When a non-blocking communication is initiated, MPI returns an MPI request
handle—an object of type MPI_Request. This handle represents the ongoing
communication operation.
MPI uses these request objects to track the progress and completion status of asynchronous operations. The request can later be passed to one of the completion routines to check whether the operation has finished or to block until it does.
There are two main families of completion routines:
- Wait routines, such as
MPI_WaitandMPI_Waitall, which block the calling process until one or more requests have completed. - Test routines, such as
MPI_TestandMPI_Testall, which allow polling of the completion status without blocking.
MPI_Wait is the simplest completion routine. It takes a single request as an
argument and blocks until the operation it represents has completed. Upon
completion, MPI fills an MPI_Status structure with information about the
communication (such as the source rank, tag, and number of received elements).
| Argument | Meaning |
|---|---|
request |
(in) The request |
status |
(out) Status object associated with the request |
When multiple non-blocking communications are in progress, it is often
convenient to wait for all of them simultaneously. This is done using
MPI_Waitall, which takes an array of request objects and blocks until all have
completed. Both send and receive operations can be waited on in this way.
| Argument | Meaning |
|---|---|
count |
(in) Array of requests length |
array_of_requests |
(in) Array of requests to wait for |
array_of_statuses |
(out) Array of status objects |
When the operation completes and a wait function is called, the associated
MPI_Request is deallocated automatically and set to MPI_REQUEST_NULL.
The 1D diffusion equation with non-blocking communication
To illustrate how non-blocking communication can improve performance, let’s return to the 1D diffusion equation example. Recall that the numerical update at each grid point depends on its two nearest neighbors. In a domain-decomposed setup, these neighbors may belong to adjacent processes, requiring communication of boundary (ghost) cells at each time step.
In the simplest implementation, these exchanges were handled using blocking send and receive calls. Each process sent data to its right neighbor, then waited to receive data from its left neighbor, and vice versa. This approach worked correctly but could cause unnecessary waiting, as processes often remained idle while waiting for communication to finish.
By replacing the blocking sends with non-blocking sends (MPI_Isend), we allow
the communication to start but let the process continue executing other
instructions. The simplest mixed version still uses blocking receives for
simplicity, but even this can reduce idle time because sends can overlap with
the subsequent receives.
MPI_Request requests[2];
MPI_Isend(&uold[subdom_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD, &requests[0]);
MPI_Recv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD, &requests[1]);
MPI_Recv(&uold[subdom_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);
Overlapping communication and computation
A more advanced optimization uses non-blocking sends and non-blocking receives together. This enables us to hide communication behind computation:
- Each process initiates non-blocking sends and receives for its ghost cells.
- While those messages are being transferred, it proceeds to compute the interior points of its subdomain—those that do not depend on the boundary data.
- After finishing the interior computations, the process waits for the boundary
communications to complete using
MPI_Waitall. - Once the data in the ghost cells are ready, the process updates the boundary points.
In the previous example, we used a blocking MPI_Recv, but this operation could
just as well be replaced with a non-blocking receive.
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)
| Argument | Meaning |
|---|---|
buf |
(out) Initial address of receive buffer |
count |
(in) Maximum number of elements to receive |
datatype |
(in) Datatype of each receive buffer entry |
source |
(in) Rank of source |
tag |
(in) Message tag |
comm |
(in) Communicator |
request |
(out) Communication request |
With MPI_Isend and MPI_Irecv, we can rewrite the example to overlap
communication with computation.
MPI_Request recv_requests[2];
MPI_Request send_requests[2];
MPI_Irecv(&uold[0], 1, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD, &recv_requests[0]);
MPI_Irecv(&uold[subdom_size+1], 1, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, &recv_requests[1]);
MPI_Isend(&uold[subdom_size], 1, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD, &send_requests[0]);
MPI_Isend(&uold[1], 1, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD, &send_requests[1]);
for (int i = 2; i <= subdom_size-1; i++) {
unew[i] = uold[i] + alphadt_dx2 * (uold[i+1] - 2.0 * uold[i] + uold[i-1]);
}
MPI_Waitall(2, recv_requests, MPI_STATUSES_IGNORE);
unew[1] = uold[1] + alphadt_dx2 * (uold[2] - 2.0 * uold[1] + uold[0]);
unew[subdom_size] = uold[subdom_size] + alphadt_dx2
* (uold[subdom_size+1] - 2.0 * uold[subdom_size] + uold[subdom_size-1]);
MPI_Waitall(2, send_requests, MPI_STATUSES_IGNORE);
In this way, communication time that would otherwise have been spent waiting is now effectively overlapped with computation. The overall runtime per time step is reduced, especially in large-scale systems where communication latency is significant.