Compiling and running the MPI "hello world"

The distributed memory paradigm

Distributed memory paradigm refers to a method of parallel programming where each processing unit has its own separate memory space. In the distributed memory paradigm, these memory spaces are not directly accessible by other processing units, and data cannot be shared between them without explicit communication.

In HPC it is quite obvious why we need to use distributed memory: a cluster consists of multiple nodes, linked together by a network fabric and each node has some memory but the memory on one node cannot be accessed by another node.

As a result, an explicit communication needs to take place to exchange data or information between processing units. Processes running on different nodes must send messages to share data or coordinate their activities.

What is MPI?

MPI, which stands for Message Passing Interface, is a standard describing a communication protocol as well as an API. It's the dominant model for parallel distributed memory scientific applications. MPI provides a set of functions that allow processes to send and receive messages to and from each other.

MPI is a standard, not a particular implementation. There are multiple MPI implementations available. Some of them are open source like OpenMPI and MPICH, other closed source and developed by vendors like Intel MPI and Cray MPICH.

Using MPI, parallelism is achieved by starting multiple application processes in parallel. Each of these processes will work on part of the data and communicate with the other processes when they need data owned by another process. This way of achieving parallelism is sometimes referred to as Single Program, Multiple Data (SPMD).

Initialization and finalization of an MPI program

Every MPI program starts with the same function which initializes the MPI execution environment. This function has the following signature

int MPI_Init(int *argc, char ***argv)

The MPI_Init must be called before any other MPI functions can be called and it should be called only once. The argc and argv prameters are a pointers to the number of arguments and to the arguments. It is allowed to pass NULL for both the argc and argv parameters of MPI_Init

The counterpart of MPI_Init is MPI_Finalize which terminates the MPI execution environment and has signature

int MPI_Finalize()

Now that, we have introduced these two functions, we can say that the basic template for an MPI application is the following

#include <mpi.h>

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);

  // application code

  MPI_Finalize();
}

where we use the mpi.h include in order to have the MPI_Init and MPI_Finalize function defined.

MPI functions return code

Every MPI functions return an integer that serves to indicate success or failure of the function. The error codes returned by MPI are left entirely to the implementation with one exception: MPI_SUCCESS when there is no error.

An error code can be converted to a string using the function

int MPI_Error_string(int errorcode, char *string, int *resultlen)

which returns the error string associated with an error code. The string parameter must have storage capacity that is at least MPI_MAX_ERROR_STRING characters long. The number of characters actually written is returned in resultlen.

How many? Who am I? Where am I?

We now know how to initialize MPI. The next step is to be able to determine how many processes have been started as well as uniquely identify these processes. However, before we see how to get this information, we have to discuss the concept of communicator: a group of processes that can communicate with each other. MPI define a default communicator identified as MPI_COMM_WORLD which contains all processes launched when the program started.

A communicator has a size which is the number of processes in the group described by the communicator. The MPI_Comm_size function allows to determine the size (number of processes) of a given communicator. The signature of this function is

int MPI_Comm_size(MPI_Comm comm, int *size)

where comm is the communicator. After the call to the function, size will have a value corresponding to the number of processes in the group described by the communicator.

Below is an example of usage of the MPI_Comm_size function. If four processes have been launched when the program started, the value of world_size after the call to MPI_Comm_size should be 4.

int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size)

Another important function, is the MPI_Comm_rank one. It allows us to get the rank of a process. A rank is an integer that is unique for each process in a communicator. The signature of the MPI_Comm_rank function is

int MPI_Comm_rank(MPI_Comm comm, int *rank)

where comm is the communicator. After the call to the function, rank will have a value which corresponds to the rank of the calling process in the group defined by the communicator. For a communicator with N processes, the range of values for the ranks will be between 0 and N-1.

Below is an example usage of the MPI_Comm_rank function.

int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank)

If two processes have been launched when the program started, the first process will have 0 as the value for rank while the second process will have 1.

If we want to get the name of the piece of hardware on which a process is running, we can use the MPI_Get_processor_name function which has the following signature

int MPI_Get_processor_name(char *name, int *resultlen)

After the call to the function, name will contain a string that identifies a particular piece of hardware. This must be an array of size at least MPI_MAX_PROCESSOR_NAME. resultlen is the length (in characters) of the name. Example of usage of this function is presented below.

char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);

The MPI "hello world" program

Downloading the example on NIC5

This section and those that follow contains links to download the examples presented in the code blocks. You can download these files by copying the link adress with a right click on the link and use the wget command on the login node of NIC5 to download the example:

wget PAST-URL-HERE

In the previous section, we have introduced MPI and see how we can query information on the number of processes and the rank of a particular process with MPI. In order t put everything together, we will use a simple MPI "hello world":

Source code for this example

mpi_hello.c

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
  MPI_Init(NULL, NULL);

  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  printf("Hello world from node %s, I'm rank %d out of %d ranks\n",
         processor_name, world_rank, world_size);

  MPI_Finalize();
}

This program is very simple. Each process will query the number of processes and its rank as well as the name of the node (processor) on which it is running. Then, it will print this information to the standard output.

Compiling MPI code

If we try to compile the MPI hello world with GCC, compilation will fail with the following output.

 $ gcc -o mpi_hello mpi_hello.c
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Init'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'ompi_mpi_comm_world'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Comm_size'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'ompi_mpi_comm_world'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Comm_rank'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Get_processor_name'
/tmp/ccOr7vCY.o:mpi_hello.c:function main: error: undefined reference to 'MPI_Finalize'
collect2: error: ld returned 1 exit status

As you can guess from this output, GCC cannot find the definition and code for the MPI functions we used. The reason is that we did not provide the necessary compiler options to link our executable with the MPI library.

In order to get access to the MPI library, we need to load an MPI module. On NIC5 the recommended MPI implementation is OpenMPI which can be loaded in our environment using the OpenMPI module.

module load OpenMPI

Once the module is loaded, we have access to the mpicc command which is a utility designed for compiling MPI code. To compile our MPI hello world example we can use the following command to produce an executable named mpi_hello.

mpicc -o mpi_hello mpi_hello.c

What mpicc do

mpicc is not really a compiler, it's a compiler wrapper around the underlying C compiler (in our case gcc). What it does is adding the necessary MPI library flags and settings for MPI code compilation.

You can see the underlying compiler call mpicc will perform by using the -show option:

 $ mpicc -show
gcc -I/opt/cecisw/arch/easybuild/2021b/software/OpenMPI/4.1.2-GCC-11.2.0/include 
-L/opt/cecisw/arch/easybuild/2021b/software/OpenMPI/4.1.2-GCC-11.2.0/lib 
-L/opt/cecisw/arch/easybuild/2021b/software/hwloc/2.5.0-GCCcore-11.2.0/lib 
-L/opt/cecisw/arch/easybuild/2021b/software/libevent/2.1.12-GCCcore-11.2.0/lib 
-Wl,-rpath -Wl,/opt/cecisw/arch/easybuild/2021b/software/OpenMPI/4.1.2-GCC-11.2.0/lib 
-Wl,-rpath -Wl,/opt/cecisw/arch/easybuild/2021b/software/hwloc/2.5.0-GCCcore-11.2.0/lib 
-Wl,-rpath -Wl,/opt/cecisw/arch/easybuild/2021b/software/libevent/2.1.12-GCCcore-11.2.0/lib 
-Wl,--enable-new-dtags -lmpi

From the result of the command you can see that mpicc do the following things:

to make sure the MPI includes are found by the compiler (-I)
options fo additional search paths for the libraries (-L)
add run-time search paths to the executale (-Wl,-rpath)
link the executable with the MPI library (-lmpi)

All GCC options can be used with mpicc

mpicc is a compiler wrapper calling gcc under the hood (see dropdown box above). It means you can use any valid option for gcc with mpicc. For example, you can use -O3 to specify the level of optimization.

If we run this executable, we will get the following output

 $ ./mpi_hello
Hello world from node nic5-login1, I'm rank 0 out of 1 ranks

which indicates that a single process (rank) was executed. Of course, we want to run the code in parallel and for that we need some kind of launchers in order to execute multiple instances of our program in parallel.

With OpenMPI, we can use mpirun as the launcher. For example, to launch four MPI processes we can use the command

 $ mpirun -np 4 ./mpi_hello
Hello world from node nic5-login1, I'm rank 0 out of 4 ranks
Hello world from node nic5-login1, I'm rank 2 out of 4 ranks
Hello world from node nic5-login1, I'm rank 3 out of 4 ranks
Hello world from node nic5-login1, I'm rank 1 out of 4 ranks

where the -np 4 option instructs mpirun to launch four processes. And indeed, this time, our application output contains 4 lines, one line per processes.

Do no omit the -np option on the login node

If you ommit the -np option on the login node of NIC5, mpirun will automatically detect the 64 cores available on the login node and start 64 processes. This can overload the login node which is not supposed to be used to run that many processes.

Submit an MPI job to the queue

Now that we have successfully compiled and run our first MPI application on the login node, the next step is to run it on a compute node.

In Slurm speaks, MPI ranks are called tasks and the number of ranks (tasks) can be specified using

#SBATCH --ntasks=NUM_MPI_RANKS

where NUM_MPI_RANKS is the number of MPI processes (rank) we want to start. Below is the job batch script we will use to run our MPI application with four processes on the compute nodes.

Source code for this example

mpi_hello.job

#!/bin/bash
#SBATCH --job-name="MPI Hello World"
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=mpi_hello.out

module load OpenMPI

srun ./mpi_hello

Note that instead of using mpirun, we used srun to launch our executable. The difference between the two is that mpirun is the MPI launcher provided by OpenMPI while srun is the one provided by Slurm. For most use cases, the two options are equivalent. In the example above, we can replace srun with mpirun and it will produce exactly the same result.

Number of processes and Slurm

The number of processes to launch is automatically inferred from the Slurm environment by srun (and mpirun) depending on the options provided via the #SBATCH directives.

If you want to run small tests on the login node use mpirun. srun can only be used to run executables on the compute nodes.

Now that we have our job batch script ready, we can submit it to the queue for an execution on a compute node. For that, we use the sbatch command:

 $ sbatch mpi_hello.job
Submitted batch job 5995568

After the job finished execution, we can have a look at the output which is in the mpi_hello.out file:

 $ cat mpi_hello.out
Hello world from node nic5-w032, I'm rank 0 out of 4 ranks
Hello world from node nic5-w060, I'm rank 1 out of 4 ranks
Hello world from node nic5-w060, I'm rank 3 out of 4 ranks
Hello world from node nic5-w060, I'm rank 2 out of 4 ranks

We can see that we have four lines in the output. Each line corresponds to one of the 4 MPI processes (ranks) we requested (#SBATCH --ntasks=4). We can also see that three ranks (1 -> 3) ran on nic5-w060 while rank 0 ran on nic5-w032.

Your output will differ

When you will run the example yourself, the output you will obtain will differ from the one presented above. Depending on the available resources, your job will run on other compute nodes. There is also a possibly that all your ranks will run on the same compute node.

Playing with the Slurm parameters

Now that we have compiled and run our first MPI application on a compute nodes, let's investigate some additional Slurm option to be more specific about and the code should be run.

Imposing the number of nodes

In the previous example, we only specified the number of tasks to run (--ntasks=4) and let Slurm decide the number of nodes to use.

Slurm allows us to choose the number of nodes to use with

#SBATCH --nodes=MIN_NUM_NODES

which specify the minimum number of nodes our job should use. We can also specify the maximum number of nodes using the following syntax.

#SBATCH --nodes=MIN_NUM_NODES-MAX_NUM_NODES

As an example, we will modify the previous example to force Slurm to allocate three nodes to our job.

Source code for this example

mpi_hello_3nodes.job

#!/bin/bash
#SBATCH --job-name="MPI Hello World"
#SBATCH --nodes=3-3
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=mpi_hello_3nodes.out

module load OpenMPI

srun ./mpi_hello

Here, the only change is the addition a line which instructs Slurm to allocate precisely three nodes for the job:

#SBATCH --nodes=3-3

Next, we submit the job to the queue and wait for it to complete before inspecting the output.

 $ sbatch mpi_hello_3nodes.job

 ...


 $ cat mpi_hello_3nodes.out
Hello world from node nic5-w015, I'm rank 2 out of 4 ranks
Hello world from node nic5-w016, I'm rank 3 out of 4 ranks
Hello world from node nic5-w006, I'm rank 0 out of 4 ranks
Hello world from node nic5-w006, I'm rank 1 out of 4 ranks

As we can see, 3 nodes were used this time with processes with ranks 0 and 1 being executed on nic5-w006 while ranks 2 and 3 were executed on nic5-w015 and nic5-w016 respectively.

Imposing the number of tasks (rank) per node

Another option is to specify the number of nodes and the number of MPI ranks (task) per node. This is done by using the directive

#SBATCH --ntasks-per-node=NUM_MPI_RANKS_PER_NODE

where NUM_MPI_RANKS_PER_NODE will fix the number of processes launched on each node. For example, to use four nodes and launch a single process on each of these nodes, we can use the batch job script below.

Source code for this example

mpi_hello_1taskpernode.job

#!/bin/bash
#SBATCH --job-name="MPI Hello World"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=mpi_hello_1taskpernode.out

module load OpenMPI

srun ./mpi_hello

We submit the job to the queue and wait for it to complete before inspecting the output.

 $ sbatch mpi_hello_3nodes.job

...

 $ cat mpi_hello_1taskpernode.out
Hello world from node nic5-w020, I'm rank 3 out of 4 ranks
Hello world from node nic5-w006, I'm rank 0 out of 4 ranks
Hello world from node nic5-w015, I'm rank 1 out of 4 ranks
Hello world from node nic5-w016, I'm rank 2 out of 4 ranks

We can see that, as requested in our batch script, four nodes have been allocated and that on each of these nodes, a single process was running.