Compile and run an OpenMP GPU application
Compile an OpenMP target offload code
For this tutorial, we will consider the basic saxpy
code. The source code of
this example is presented below and we consider that the source file name is
saxpy_gpu.c
.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
void saxpy(int n, float a, float *x, float *y) {
double elapsed = -1.0 * omp_get_wtime();
// We don't need to map the variable a as scalars are firstprivate by default
#pragma omp target teams distribute parallel for \
map(to:x[0:n]) map(tofrom:y[0:n])
for(int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
elapsed += omp_get_wtime();
printf("saxpy done in %6.3lf seconds.\n", elapsed);
}
int main() {
int n = 2000000;
float *x = (float*) malloc(n*sizeof(float));
float *y = (float*) malloc(n*sizeof(float));
float alpha = 2.0;
#pragma omp parallel for
for (int i = 0; i < n; i++) {
x[i] = 1;
y[i] = i;
}
saxpy(n, alpha, x, y);
free(x);
free(y);
return 0;
}
To compile the code, we will use the Clang compiler which was compiled to support OpenMP target offload targeting NVIDIA GPUs. To have access to the compiler, we load the corresponding module
The code can then be compiled with the following command:
where -fopenmp-targets=nvptx64-nvidia-cuda
enable OpenMP target offload for
NVIDIA GPUS.
An alternative is to use the (newer) --offload-arch
compiler flag to enable
OpenMP target offloading.
where we have to specify the target GPU architecture. For the Lucia NVIDIA A100
GPUs the compute capability is 8.0, so we use sm_80
as argument for the
--offload-arch
option.
The resulting saxpy_gpu
application can be executed directly on the login
node:
However, the login nodes do not have any GPU. As a consequence, the code was run
on the CPU. This absence of GPUs can be highlighted if we use the
OMP_TARGET_OFFLOAD
environment variable with a value of MANDATORY
. By using
this variable, we specify to the OpenMP runtime that using GPU is mandatory.
If we use the OMP_TARGET_OFFLOAD
environment variable and execute on the login
node, the execution fails:
$ OMP_TARGET_OFFLOAD=MANDATORY ./saxpy_gpu
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware. Found 1 image(s): (sm_80)
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)
To execute our example on the GPU, we need to allocate a GPU on a compute node. This is described in the next section.
Running the saxpy example
Note
Lucia uses the Slurm scheduler like NIC5. As a consequence, most of the content of the Slurm chapter apply to Lucia.
To submit a job to Lucia you need to specify which project needs to be "billed"
for your job. It was not the case on NIC5. This is done by using the --account
directive. The name of the project used for the course is ulghpsc
:
Warning
If you don't specify an account the submission of your job will be denied:
To use the GPU nodes, we need to use the gpu
partition with the --partition
directive and allocate a GPU with the --gpus
directive.
Warning
Using the gpu
partition requires to allocate a GPU to the job or
submission will be denied
Below is an example job to run the saxpy
example on a GPU compute node of
Lucia. We store this job script in a file with name lucia_gpu.job
.
#!/bin/bash
#SBATCH --job-name="saxpy GPU"
#SBATCH --output=saxpy_gpu.out
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --gpus=1
#SBATCH --time=15:00
#SBATCH --account=ulghpsc
module load EasyBuild/2023a
module load Clang/18.1.8-GCCcore-12.3.0-CUDA-12.2.0
export OMP_TARGET_OFFLOAD=MANDATORY
./saxpy_gpu
Note that we use the OMP_TARGET_OFFLOAD
environment variable to make the
application fail if GPU offloading is not possible. Using this variable is no
really required, but it's a way to be on the safe side and make sure that indeed
the application will run on a GPU.
To submit this job, we use the sbatch
command like on NIC5:
Get an interactive session on a compute node
For development purposes, when you want to quicky test your code, submitting a job to have access to a GPU might not be practical. To make you work easy, you can create an interactive on a GPU node using the following command.
where we request to have access to one GPU for one hour. You should have an output looking like this:
srun: job XXXXXXX queued and waiting for resources
...
srun: job XXXXXXX has been allocated resources
(JOB_ID: XXXXXXX) user@cnaXXX:~ #
You should now be on a compute node. You can check that a GPU is available using
the nvidia-smi
command.
$ nvidia-smi
Fri Nov 17 15:09:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:06:00.0 Off | 0 |
| N/A 27C P0 52W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
when you have finished your test, you can end the session by using the exit
command.
Interactive jobs need to queue like regular jobs
Interactive jobs are put in the queue like regular jobs (submitted with
sbatch
). In means that if all GPU compute nodes are allocated, you need
to wait in order to get an allocation.