Compile and run a CUDA GPU application
We will start our journey into CUDA programming with a simple Hello World program.
CUDA Hello World
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void hello_kernel() {
printf("Hello from GPU thread %d out of %d " "of block %d out of %d\n",
threadIdx.x, blockDim.x, blockIdx.x, gridDim.x);
}
int main(int argc, char* argv[]) {
const int num_threads = 4;
const int num_blocks = 2;
hello_kernel<<<num_blocks, num_threads>>>();
cudaDeviceSynchronize();
return 0;
}
To define our hello_kernel function as a CUDA kernel, we use the __global__ function qualifier. This qualifier tells
the compiler that the function will be executed on the GPU (device) but invoked from the CPU (host). In other words, it
marks the function as globally accessible,callable from host code and visible to both the host and device. This is why
it is referred to as the "global" qualifier.
The CUDA execution model
CUDA execution model is organized hierarchically into grids, blocks, and threads.
- A grid is a collection of thread blocks
- Each block is a collection of threads
- Every thread runs the same kernel function but operates on different portions of data, using its unique identifiers to determine which portion to process
This hierarchical structure allows CUDA to efficiently map computation across thousands of lightweight GPU threads.
Grid
+-- Block 0
| +-- Thread 0
| +-- Thread 1
| +-- ...
+-- Block 1
| +-- Thread 0
| +-- Thread 1
| +-- ...
+-- ...
Built-in thread and block variables
CUDA provides several built-in variables to help each thread identify its position within the grid and block:
threadIdx.x– the thread index within its block (ranges from0toblockDim.x - 1)blockDim.x– the number of threads in the current blockblockIdx.x– the block index within the grid (ranges from0togridDim.x - 1)gridDim.x– the number of blocks in the grid
Together, these values allow each thread to compute its unique position and operate on the correct data element.
The Hello World kernel
In our simple example, the kernel prints information about the thread and block structure of the grid:
__global__ void hello_kernel() {
printf("Hello from GPU thread %d out of %d "
"of block %d out of %d\n",
threadIdx.x, blockDim.x, blockIdx.x, gridDim.x);
}
Each GPU thread executes this function independently. When the kernel is launched, many threads run this code simultaneously, each printing its own identifiers.
Launching the kernel
A CUDA kernel is launched using the triple-chevron syntax (<<<...>>>), which
specifies how many blocks and threads per block to use:
const int num_threads = 4; // Threads per block
const int num_blocks = 2; // Blocks per grid
hello_kernel<<<num_blocks, num_threads>>>();
In this configuration:
gridDim.x = 2(two blocks in the grid)blockDim.x = 4(four threads per block)
Therefore, the GPU launches a total of 2 x 4 = 8 threads. Each thread prints a
message identifying itself by its block and thread indices.
CUDA kernel launches are asynchronous with respect to the host: the call returns
immediately and the host continues execution without waiting for the device to
finish. Synchronization may be forced using cudaDeviceSynchronize() which blocks
until the device has completed all preceding requested tasks.
Compile the Hello World example
To compile the Hello World example on Lyra, we need access to a CUDA-capable
compiler. To make it available in our environment, we load the CUDA module using
the module load command:
The code can be compiled unsing the NVDIA compiler (nvcc):
$ nvcc -o hello_world hello_world.cu
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
While the code compiled successfully, the compiler emitted a somewhat obscure warning. In CUDA, an architecture refers to a specific GPU hardware generation, identified by its Compute Capability. The warning indicates that architectures with a compute capability lower than 7.5 will soon be deprecated and removed from future CUDA releases.
The term offline compilation refers to the process of generating GPU binaries (machine code) at compile time, as opposed to just-in-time (JIT) compilation, where the intermediate PTX code is compiled into GPU machine code at runtime, just before the kernel is launched.
This warning appears because the compiler is currently targeting Compute
Capability 5.2. Instead, we want to compile for the specific architecture of the
Lyra RTX 6000 Ada GPUs. To determine the compute capability of these GPUs, we
can use the deviceQuery utility, which is included in the CUDA demo suite:
$ $EBROOTCUDA/extras/demo_suite/deviceQuery | grep "CUDA Capability"
CUDA Capability Major/Minor version number: 8.9
From the output, we can see that the GPU’s Compute Capability is 8.9. This means
we should specify the architecture to the compiler using the
--gpu-architecture=sm_89 flag:
Run the Hello World
The Lyra compute nodes are equipped with a single AMD EPYC 9354 32-core CPU, 128
GB of RAM, and one NVIDIA RTX 6000 Ada Generation GPU with 48 GB of memory. The
login nodes also include a GPU. We can obtain detailed information about the GPU
using the nvidia-smi command:
$ nvidia-smi
Mon Nov 3 13:35:44 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... On | 00000000:01:00.0 Off | Off |
| 30% 36C P8 24W / 300W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
We can run our Hello World program as any regular program:
$ ./hello_world
Hello from GPU thread 0 out of 4 of block 0 out of 2
Hello from GPU thread 1 out of 4 of block 0 out of 2
Hello from GPU thread 2 out of 4 of block 0 out of 2
Hello from GPU thread 3 out of 4 of block 0 out of 2
Hello from GPU thread 0 out of 4 of block 1 out of 2
Hello from GPU thread 1 out of 4 of block 1 out of 2
Hello from GPU thread 2 out of 4 of block 1 out of 2
Hello from GPU thread 3 out of 4 of block 1 out of 2
Each line corresponds to one GPU thread reporting its position within the overall execution grid.
Submit a job
While you can use the GPUs on the login nodes for testing and debugging during development, these GPUs are shared among all Lyra users, so it is possible that another user is using the GPU at the same time. This can slow down your code. For accurate performance testing, you should submit a job to the compute nodes, where the GPU is dedicated to your job.
To gather information on the partition and compute nodes, we can use the sinfo
command just as we did on NIC5. However, Lyra's configuration is different, so
the output will differ:
$ sinfo
PARTITION AVAIL TIMELIMIT CPUS(A/I/O/T) CPU_LOAD GRES_USED NODES STATE NODELIST
batch* up 5-00:00:00 111/17/0/128 23.36-25.00 gpu:a6000ada:1 4 mix- ly-w[101,103,115,220]
batch* up 5-00:00:00 502/170/0/672 22.93-24.47 gpu:a6000ada:0 21 mix- ly-w[106-108,111-112,116,202,204,207-219]
batch* up 5-00:00:00 72/24/0/96 19.52-22.22 gpu:a6000ada:0 3 mix ly-w[109-110,117]
batch* up 5-00:00:00 384/0/0/384 1.89-30.07 gpu:a6000ada:1 12 alloc ly-w[102,104-105,113-114,118-120,201,203,205-206]
For example a line such as this
means that 4 nodes are partially used (mix). Each node has the A6000 Ada GPU
in use (gpu:a6000ada:1) while a line such a this:
batch* up 5-00:00:00 502/170/0/672 22.93-24.47 gpu:a6000ada:0 21 mix ly-w[106-108,111-112,116,202,204,207-219]
means means that 21 nodes are partially used but the GPU on these node is not in
use (gpu:a6000ada:0).
The Lyra compute nodes can run both CPU and GPU workloads. To execute a GPU workload, a GPU must be explicitly allocated to the job. This can be done by including the following directive in your job script:
Lyra does not have a high-speed interconnect, so multi-node jobs are not supported. As a result, we will restrict ourselves to using a single GPU and launch a single task per job:
#!/bin/bash
#
#SBATCH --job-name="CUDA Hello World"
#SBATCH --ntasks=1
#SBATCH --gpus=1
#SBATCH --time=01:00
#SBATCH --output="cuda_hello_world.out"
module load CUDA
echo "Running on node: $(hostname)"
./hello_world
sbatch, just as we
did when submitting jobs on NIC5:
Once the job is finished the output should contain
$ cat cuda_hello_world.out
Running on node: ly-w219
Hello from GPU thread 0 out of 4 of block 0 out of 2
Hello from GPU thread 1 out of 4 of block 0 out of 2
Hello from GPU thread 2 out of 4 of block 0 out of 2
Hello from GPU thread 3 out of 4 of block 0 out of 2
Hello from GPU thread 0 out of 4 of block 1 out of 2
Hello from GPU thread 1 out of 4 of block 1 out of 2
Hello from GPU thread 2 out of 4 of block 1 out of 2
Hello from GPU thread 3 out of 4 of block 1 out of 2