Compiling and running the MPI "hello world"
The distributed memory paradigm
Distributed memory paradigm refers to a method of parallel programming where each processing unit has its own separate memory space. In the distributed memory paradigm, these memory spaces are not directly accessible by other processing units, and data cannot be shared between them without explicit communication.
In HPC it is quite obvious why we need to use distributed memory: a cluster consists of multiple nodes, linked together by a network fabric and each node has some memory but the memory on one node cannot be accessed by another node.
As a result, an explicit communication needs to take place to exchange data or information between processing units. Processes running on different nodes must send messages to share data or coordinate their activities.
What is MPI?
MPI, which stands for Message Passing Interface, is a standard describing a communication protocol as well as an API. It's the dominant model for parallel distributed memory scientific applications. MPI provides a set of functions that allow processes to send and receive messages to and from each other.
MPI is a standard, not a particular implementation. There are multiple MPI implementations available. Some of them are open source like OpenMPI and MPICH, other closed source and developed by vendors like Intel MPI and Cray MPICH.
Using MPI, parallelism is achieved by starting multiple application processes in parallel. Each of these processes will work on part of the data and communicate with the other processes when they need data owned by another process. This way of achieving parallelism is sometimes referred to as Single Program, Multiple Data (SPMD).
Initialization and finalization of an MPI program
Every MPI program starts with the same function which initializes the MPI execution environment. This function has the following signature
The MPI_Init
must be called before any other MPI functions can be called and
it should be called only once. The argc
and argv
prameters are a pointers to
the number of arguments and to the arguments. It is allowed to pass NULL
for
both the argc
and argv
parameters of MPI_Init
The counterpart of MPI_Init
is MPI_Finalize
which terminates the MPI
execution environment and has signature
Now that, we have introduced these two functions, we can say that the basic template for an MPI application is the following
#include <mpi.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
// application code
MPI_Finalize();
}
where we use the mpi.h
include in order to have the MPI_Init
and
MPI_Finalize
function defined.
MPI functions return code
Every MPI functions return an integer that serves to indicate success or
failure of the function. The error codes returned by MPI are left entirely
to the implementation with one exception: MPI_SUCCESS
when there is no
error.
An error code can be converted to a string using the function
which returns the error string associated with an error code. The string
parameter must have storage capacity that is at least MPI_MAX_ERROR_STRING
characters long. The number of characters actually written is returned in
resultlen
.
How many? Who am I? Where am I?
We now know how to initialize MPI. The next step is to be able to determine how
many processes have been started as well as uniquely identify these processes.
However, before we see how to get this information, we have to discuss the
concept of communicator: a group of processes that can communicate with each
other. MPI define a default communicator identified as MPI_COMM_WORLD
which
contains all processes launched when the program started.
A communicator has a size which is the number of processes in the group
described by the communicator. The MPI_Comm_size
function allows to determine
the size (number of processes) of a given communicator. The signature of this
function is
where comm
is the communicator. After the call to the function, size
will
have a value corresponding to the number of processes in the group described by
the communicator.
Below is an example of usage of the MPI_Comm_size
function. If four processes
have been launched when the program started, the value of world_size
after
the call to MPI_Comm_size
should be 4
.
Another important function, is the MPI_Comm_rank
one. It allows us to get the
rank of a process. A rank is an integer that is unique for each process in a
communicator. The signature of the MPI_Comm_rank
function is
where comm
is the communicator. After the call to the function, rank
will
have a value which corresponds to the rank of the calling process in the group
defined by the communicator. For a communicator with N
processes, the range of
values for the ranks will be between 0
and N-1
.
Below is an example usage of the MPI_Comm_rank
function.
If two processes have been launched when the program started, the first process
will have 0
as the value for rank while the second process will have 1
.
If we want to get the name of the piece of hardware on which a process is
running, we can use the MPI_Get_processor_name
function which has the
following signature
After the call to the function, name
will contain a string that identifies a
particular piece of hardware. This must be an array of size at least
MPI_MAX_PROCESSOR_NAME
. resultlen
is the length (in characters) of the
name. Example of usage of this function is presented below.
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
The MPI "hello world" program
Downloading the example on NIC5
This section and those that follow contains links to download the examples
presented in the code blocks. You can download these files by copying the
link adress with a right click on the link and use the wget
command on the
login node of NIC5 to download the example:
In the previous section, we have introduced MPI and see how we can query information on the number of processes and the rank of a particular process with MPI. In order t put everything together, we will use a simple MPI "hello world":
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(NULL, NULL);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from node %s, I'm rank %d out of %d ranks\n",
processor_name, world_rank, world_size);
MPI_Finalize();
}
This program is very simple. Each process will query the number of processes and its rank as well as the name of the node (processor) on which it is running. Then, it will print this information to the standard output.
Compiling MPI code
If we try to compile the MPI hello world with GCC, compilation will fail with the following output.
$ gcc -o mpi_hello mpi_hello.c
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Init'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'ompi_mpi_comm_world'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Comm_size'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'ompi_mpi_comm_world'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Comm_rank'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Get_processor_name'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Finalize'
collect2: error: ld returned 1 exit status
As you can guess from this output, GCC cannot find the definition and code for the MPI functions we used. The reason is that we did not provide the necessary compiler options to link our executable with the MPI library.
In order to get access to the MPI library, we need to load an MPI module. On NIC5 the recommended MPI implementation is OpenMPI which can be loaded in our environment using the OpenMPI module.
Once the module is loaded, we have access to the mpicc
command which is a
utility designed for compiling MPI code. To compile our MPI hello world example
we can use the following command to produce an executable named mpi_hello
.
What mpicc do
mpicc
is not really a compiler, it's a compiler wrapper around the
underlying C compiler (in our case gcc
). What it does is adding the
necessary MPI library flags and settings for MPI code compilation.
You can see the underlying compiler call mpicc
will perform by using the
-show
option:
$ mpicc -show
gcc -I/opt/cecisw/arch/easybuild/2021b/software/OpenMPI/4.1.2-GCC-11.2.0/include
-L/opt/cecisw/arch/easybuild/2021b/software/OpenMPI/4.1.2-GCC-11.2.0/lib
-L/opt/cecisw/arch/easybuild/2021b/software/hwloc/2.5.0-GCCcore-11.2.0/lib
-L/opt/cecisw/arch/easybuild/2021b/software/libevent/2.1.12-GCCcore-11.2.0/lib
-Wl,-rpath -Wl,/opt/cecisw/arch/easybuild/2021b/software/OpenMPI/4.1.2-GCC-11.2.0/lib
-Wl,-rpath -Wl,/opt/cecisw/arch/easybuild/2021b/software/hwloc/2.5.0-GCCcore-11.2.0/lib
-Wl,-rpath -Wl,/opt/cecisw/arch/easybuild/2021b/software/libevent/2.1.12-GCCcore-11.2.0/lib
-Wl,--enable-new-dtags -lmpi
mpicc
do the following
things:
- to make sure the MPI includes are found by the compiler (
-I
) - options fo additional search paths for the libraries (
-L
) - add run-time search paths to the
executale (
-Wl,-rpath
) - link the executable with the MPI library (
-lmpi
)
All GCC options can be used with mpicc
mpicc
is a compiler wrapper calling gcc
under the hood (see dropdown box
above). It means you can use any valid option for gcc
with mpicc
. For
example, you can use -O3
to specify the level of optimization.
If we run this executable, we will get the following output
which indicates that a single process (rank) was executed. Of course, we want to run the code in parallel and for that we need some kind of launchers in order to execute multiple instances of our program in parallel.
With OpenMPI, we can use mpirun
as the launcher. For example, to launch four MPI
processes we can use the command
$ mpirun -np 4 ./mpi_hello
Hello world from node nic5-login1, I'm rank 0 out of 4 ranks
Hello world from node nic5-login1, I'm rank 2 out of 4 ranks
Hello world from node nic5-login1, I'm rank 3 out of 4 ranks
Hello world from node nic5-login1, I'm rank 1 out of 4 ranks
where the -np 4
option instructs mpirun
to launch four processes. And indeed,
this time, our application output contains 4 lines, one line per processes.
Do no omit the -np
option on the login node
If you ommit the -np
option on the login node of NIC5, mpirun
will
automatically detect the 64 cores available on the login node and start
64 processes. This can overload the login node which is not supposed to be
used to run that many processes.
Submit an MPI job to the queue
Now that we have successfully compiled and run our first MPI application on the login node, the next step is to run it on a compute node.
In Slurm speaks, MPI ranks are called tasks and the number of ranks (tasks) can be specified using
whereNUM_MPI_RANKS
is the number of MPI processes (rank) we want to start.
Below is the job batch script we will use to run our MPI application with four
processes on the compute nodes.
#!/bin/bash
#SBATCH --job-name="MPI Hello World"
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=mpi_hello.out
module load OpenMPI
srun ./mpi_hello
Note that instead of using mpirun
, we used srun
to launch our executable.
The difference between the two is that mpirun
is the MPI launcher provided
by OpenMPI while srun
is the one provided by Slurm. For most use cases, the
two options are equivalent. In the example above, we can replace srun
with
mpirun
and it will produce exactly the same result.
Number of processes and Slurm
The number of processes to launch is automatically inferred from the Slurm
environment by srun
(and mpirun
) depending on the options provided via
the #SBATCH
directives.
If you want to run small tests on the login node use mpirun
.
srun
can only be used to run executables on the compute nodes.
Now that we have our job batch script ready, we can submit it to the queue for
an execution on a compute node. For that, we use the sbatch
command:
After the job finished execution, we can have a look at the output which is
in the mpi_hello.out
file:
$ cat mpi_hello.out
Hello world from node nic5-w032, I'm rank 0 out of 4 ranks
Hello world from node nic5-w060, I'm rank 1 out of 4 ranks
Hello world from node nic5-w060, I'm rank 3 out of 4 ranks
Hello world from node nic5-w060, I'm rank 2 out of 4 ranks
We can see that we have four lines in the output. Each line corresponds to one
of the 4 MPI processes (ranks) we requested (#SBATCH --ntasks=4
). We can also
see that three ranks (1 -> 3) ran on nic5-w060
while rank 0 ran on
nic5-w032
.
Your output will differ
When you will run the example yourself, the output you will obtain will differ from the one presented above. Depending on the available resources, your job will run on other compute nodes. There is also a possibly that all your ranks will run on the same compute node.
Playing with the Slurm parameters
Now that we have compiled and run our first MPI application on a compute nodes, let's investigate some additional Slurm option to be more specific about and the code should be run.
Imposing the number of nodes
In the previous example, we only specified the number of tasks to run
(--ntasks=4
) and let Slurm decide the number of nodes to use.
Slurm allows us to choose the number of nodes to use with
which specify the minimum number of nodes our job should use. We can also specify the maximum number of nodes using the following syntax.
As an example, we will modify the previous example to force Slurm to allocate three nodes to our job.
#!/bin/bash
#SBATCH --job-name="MPI Hello World"
#SBATCH --nodes=3-3
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=mpi_hello_3nodes.out
module load OpenMPI
srun ./mpi_hello
Here, the only change is the addition a line which instructs Slurm to allocate precisely three nodes for the job:
Next, we submit the job to the queue and wait for it to complete before inspecting the output.
$ sbatch mpi_hello_3nodes.job
...
$ cat mpi_hello_3nodes.out
Hello world from node nic5-w015, I'm rank 2 out of 4 ranks
Hello world from node nic5-w016, I'm rank 3 out of 4 ranks
Hello world from node nic5-w006, I'm rank 0 out of 4 ranks
Hello world from node nic5-w006, I'm rank 1 out of 4 ranks
As we can see, 3 nodes were used this time with processes with ranks 0 and 1
being executed on nic5-w006
while ranks 2 and 3 were executed on nic5-w015
and
nic5-w016
respectively.
Imposing the number of tasks (rank) per node
Another option is to specify the number of nodes and the number of MPI ranks (task) per node. This is done by using the directive
where NUM_MPI_RANKS_PER_NODE
will fix the number of processes launched on each
node. For example, to use four nodes and launch a single process on each of
these nodes, we can use the batch job script below.
#!/bin/bash
#SBATCH --job-name="MPI Hello World"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=mpi_hello_1taskpernode.out
module load OpenMPI
srun ./mpi_hello
We submit the job to the queue and wait for it to complete before inspecting the output.
$ sbatch mpi_hello_3nodes.job
...
$ cat mpi_hello_1taskpernode.out
Hello world from node nic5-w020, I'm rank 3 out of 4 ranks
Hello world from node nic5-w006, I'm rank 0 out of 4 ranks
Hello world from node nic5-w015, I'm rank 1 out of 4 ranks
Hello world from node nic5-w016, I'm rank 2 out of 4 ranks
We can see that, as requested in our batch script, four nodes have been allocated and that on each of these nodes, a single process was running.