Skip to content

Compile and run a CUDA GPU application

We will start our journey into CUDA programming with a simple Hello World program.

CUDA Hello World

hello_world.cu
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void hello_kernel() {
    printf("Hello from GPU thread %d out of %d " "of block %d out of %d\n", 
           threadIdx.x, blockDim.x, blockIdx.x, gridDim.x);
}

int main(int argc, char* argv[]) {
    const int num_threads = 4;
    const int num_blocks = 2;

    hello_kernel<<<num_blocks, num_threads>>>();

    cudaDeviceSynchronize();

    return 0;
}

To define our hello_kernel function as a CUDA kernel, we use the __global__ function qualifier. This qualifier tells the compiler that the function will be executed on the GPU (device) but invoked from the CPU (host). In other words, it marks the function as globally accessible,callable from host code and visible to both the host and device. This is why it is referred to as the "global" qualifier.

The CUDA execution model

CUDA execution model is organized hierarchically into grids, blocks, and threads.

  • A grid is a collection of thread blocks
  • Each block is a collection of threads
  • Every thread runs the same kernel function but operates on different portions of data, using its unique identifiers to determine which portion to process

This hierarchical structure allows CUDA to efficiently map computation across thousands of lightweight GPU threads.

Grid
 +-- Block 0
 |     +-- Thread 0
 |     +-- Thread 1
 |     +-- ...
 +-- Block 1
 |     +-- Thread 0
 |     +-- Thread 1
 |     +-- ...
 +-- ...

Built-in thread and block variables

CUDA provides several built-in variables to help each thread identify its position within the grid and block:

  • threadIdx.x – the thread index within its block (ranges from 0 to blockDim.x - 1)
  • blockDim.x – the number of threads in the current block
  • blockIdx.x – the block index within the grid (ranges from 0 to gridDim.x - 1)
  • gridDim.x – the number of blocks in the grid

Together, these values allow each thread to compute its unique position and operate on the correct data element.

The Hello World kernel

In our simple example, the kernel prints information about the thread and block structure of the grid:

__global__ void hello_kernel() {
    printf("Hello from GPU thread %d out of %d "
           "of block %d out of %d\n", 
           threadIdx.x, blockDim.x, blockIdx.x, gridDim.x);
}

Each GPU thread executes this function independently. When the kernel is launched, many threads run this code simultaneously, each printing its own identifiers.

Launching the kernel

A CUDA kernel is launched using the triple-chevron syntax (<<<...>>>), which specifies how many blocks and threads per block to use:

const int num_threads = 4;  // Threads per block
const int num_blocks = 2;   // Blocks per grid

hello_kernel<<<num_blocks, num_threads>>>();

In this configuration:

  • gridDim.x = 2 (two blocks in the grid)
  • blockDim.x = 4 (four threads per block)

Therefore, the GPU launches a total of 2 x 4 = 8 threads. Each thread prints a message identifying itself by its block and thread indices.

CUDA kernel launches are asynchronous with respect to the host: the call returns immediately and the host continues execution without waiting for the device to finish. Synchronization may be forced using cudaDeviceSynchronize() which blocks until the device has completed all preceding requested tasks.

Compile the Hello World example

To compile the Hello World example on Lyra, we need access to a CUDA-capable compiler. To make it available in our environment, we load the CUDA module using the module load command:

module load CUDA

The code can be compiled unsing the NVDIA compiler (nvcc):

 $ nvcc -o hello_world hello_world.cu
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

While the code compiled successfully, the compiler emitted a somewhat obscure warning. In CUDA, an architecture refers to a specific GPU hardware generation, identified by its Compute Capability. The warning indicates that architectures with a compute capability lower than 7.5 will soon be deprecated and removed from future CUDA releases.

The term offline compilation refers to the process of generating GPU binaries (machine code) at compile time, as opposed to just-in-time (JIT) compilation, where the intermediate PTX code is compiled into GPU machine code at runtime, just before the kernel is launched.

This warning appears because the compiler is currently targeting Compute Capability 5.2. Instead, we want to compile for the specific architecture of the Lyra RTX 6000 Ada GPUs. To determine the compute capability of these GPUs, we can use the deviceQuery utility, which is included in the CUDA demo suite:

 $ $EBROOTCUDA/extras/demo_suite/deviceQuery | grep "CUDA Capability"
CUDA Capability Major/Minor version number:    8.9

From the output, we can see that the GPU’s Compute Capability is 8.9. This means we should specify the architecture to the compiler using the --gpu-architecture=sm_89 flag:

nvcc --gpu-architecture=sm_89 -o hello_world hello_world.cu

Run the Hello World

The Lyra compute nodes are equipped with a single AMD EPYC 9354 32-core CPU, 128 GB of RAM, and one NVIDIA RTX 6000 Ada Generation GPU with 48 GB of memory. The login nodes also include a GPU. We can obtain detailed information about the GPU using the nvidia-smi command:

 $ nvidia-smi
Mon Nov  3 13:35:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   36C    P8             24W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

We can run our Hello World program as any regular program:

 $ ./hello_world
Hello from GPU thread 0 out of 4 of block 0 out of 2
Hello from GPU thread 1 out of 4 of block 0 out of 2
Hello from GPU thread 2 out of 4 of block 0 out of 2
Hello from GPU thread 3 out of 4 of block 0 out of 2
Hello from GPU thread 0 out of 4 of block 1 out of 2
Hello from GPU thread 1 out of 4 of block 1 out of 2
Hello from GPU thread 2 out of 4 of block 1 out of 2
Hello from GPU thread 3 out of 4 of block 1 out of 2

Each line corresponds to one GPU thread reporting its position within the overall execution grid.

Submit a job

While you can use the GPUs on the login nodes for testing and debugging during development, these GPUs are shared among all Lyra users, so it is possible that another user is using the GPU at the same time. This can slow down your code. For accurate performance testing, you should submit a job to the compute nodes, where the GPU is dedicated to your job.

To gather information on the partition and compute nodes, we can use the sinfo command just as we did on NIC5. However, Lyra's configuration is different, so the output will differ:

 $ sinfo
PARTITION  AVAIL  TIMELIMIT   CPUS(A/I/O/T)   CPU_LOAD    GRES_USED         NODES   STATE  NODELIST                                          
batch*     up     5-00:00:00  111/17/0/128    23.36-25.00 gpu:a6000ada:1        4    mix-  ly-w[101,103,115,220]                             
batch*     up     5-00:00:00  502/170/0/672   22.93-24.47 gpu:a6000ada:0       21    mix-  ly-w[106-108,111-112,116,202,204,207-219]         
batch*     up     5-00:00:00  72/24/0/96      19.52-22.22 gpu:a6000ada:0        3     mix  ly-w[109-110,117]                                 
batch*     up     5-00:00:00  384/0/0/384     1.89-30.07  gpu:a6000ada:1       12   alloc  ly-w[102,104-105,113-114,118-120,201,203,205-206] 

For example a line such as this

batch*     up     5-00:00:00  111/17/0/128    23.36-25.00 gpu:a6000ada:1        4    mix  ly-w[101,103,115,220]

means that 4 nodes are partially used (mix). Each node has the A6000 Ada GPU in use (gpu:a6000ada:1) while a line such a this:

batch*     up     5-00:00:00  502/170/0/672   22.93-24.47 gpu:a6000ada:0       21    mix  ly-w[106-108,111-112,116,202,204,207-219]

means means that 21 nodes are partially used but the GPU on these node is not in use (gpu:a6000ada:0).

The Lyra compute nodes can run both CPU and GPU workloads. To execute a GPU workload, a GPU must be explicitly allocated to the job. This can be done by including the following directive in your job script:

#SBATCH --gpus=1

Lyra does not have a high-speed interconnect, so multi-node jobs are not supported. As a result, we will restrict ourselves to using a single GPU and launch a single task per job:

hello_world.job
#!/bin/bash
#
#SBATCH --job-name="CUDA Hello World"
#SBATCH --ntasks=1
#SBATCH --gpus=1
#SBATCH --time=01:00
#SBATCH --output="cuda_hello_world.out"

module load CUDA

echo "Running on node: $(hostname)"

./hello_world
Now that our job script is ready, we can submit it using sbatch, just as we did when submitting jobs on NIC5:

sbatch hello_world.job

Once the job is finished the output should contain

 $ cat cuda_hello_world.out
Running on node: ly-w219
Hello from GPU thread 0 out of 4 of block 0 out of 2
Hello from GPU thread 1 out of 4 of block 0 out of 2
Hello from GPU thread 2 out of 4 of block 0 out of 2
Hello from GPU thread 3 out of 4 of block 0 out of 2
Hello from GPU thread 0 out of 4 of block 1 out of 2
Hello from GPU thread 1 out of 4 of block 1 out of 2
Hello from GPU thread 2 out of 4 of block 1 out of 2
Hello from GPU thread 3 out of 4 of block 1 out of 2