Slurm
When you connect to NIC5, you land to the login node (nic5-login1
). The login
node is shared by all users connected to NIC5. This node is the entry point to
NIC5 and is not intended to run resource intensive calculation. Calculation
requiring significant resources should be run on the compute nodes.
The allocation of the resource of the compute node of NIC5 (and almost all HPC clusters) is organized by a piece of software called a resource manager or job scheduler. Users submit jobs, which are run unattended, by the job scheduler at the time, and on the resources, decided by the scheduler algorithm. In the case of NIC5, the resource manager and job scheduler is Slurm.
Gathering information about a cluster
When submitting a job to an HPC cluster, you need to have some information about its organization and hardware resources; For example:
- How many compute nodes are available?
- The number of cores available on each of these compute nodes
- How much memory does the compute node have?
- The maximum time a job can run on a compute node
Gathering information about a cluster managed by Slurm is done using the sinfo
command. This command will print information about the compute nodes and
their state:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 2-00:00:00 12 mix nic5-w[030,041,043-046,048-049,051-052,065,069]
batch* up 2-00:00:00 58 alloc nic5-w[001-029,031-040,042,047,050,053-064,066-068,070]
hmem up 2-00:00:00 3 idle nic5-w[071-073]
bio up 62-00:00:00 1 idle nic5-w074
-
The
PARTITION
column indicate the partition the compute node belongs to. The partitions can be considered job queues, each of which has an assortment of constraints such as type of hardware, job size limit, job time limit. NIC5 is organized in three partitions:batch
which is the default partition (indicated by the*
next to the partition name) contains all the nodes with 256 GB of memory.hmem
is the partition that groups all the high-memory node (1TB).bio
contains a private compute node reserved to a particular research group.
-
The
AVAIL
column refers to the availability of the partition. A partition is theup
state is available. - The
TIMELIMIT
column gives you the maximum time a job can run for a particular partition. The format isDD-HH:MM:SS
. A time limit of2-00:00:00
means that the maximum time a job can run is 2 days. - The
NODES
column indicate the number of nodes in a particular state for a given partition. -
The
NODES
column gives you the state of nodes:alloc
means that the nodes are fully allocated: all CPU cores on these nodes are used by running jobs.mix
means that the nodes are partially allocated: some cores of theses nodes are free.idle
means that there are no jobs running on that node: all cores are free
-
The
NODELIST
list the nodes in a given state and partition.
A more concise output can be obtained using the -s
option.
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
batch* up 2-00:00:00 70/0/0/70 nic5-w[001-070]
hmem up 2-00:00:00 0/3/0/3 nic5-w[071-073]
bio up 62-00:00:00 0/1/0/1 nic5-w074
With the -s
option, only one line is printed for each partition with the
column NODES(A/I/O/T)
presenting the number of node in each states.
A
gives the number of allocated nodes, which in this case means nodes with at least one core allocated to a job.I
gives the number of idle nodes.O
gives the number of nodes in another state thanA
orI
. Usually, it's nodes that are down or not available because there is something wrong with them, either on the software side or hardware side.T
gives the total number of nodes regardless of their state
If you want to gather information about the maximum number of CPU cores and memory available, you can use a custom output format
$ sinfo --format="%10P %.5a %.11l %.6D %.4c %.10m"
PARTITION AVAIL TIMELIMIT NODES CPUS MEMORY
batch* up 2-00:00:00 70 64 257700
hmem up 2-00:00:00 3 64 1031900
bio up 62-00:00:00 1 256 2064000
--Node
and
--long
options.
sinfo --Node --long
Fri Sep 29 17:52:34 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
nic5-w001 1 batch* allocated 64 2:32:1 257700 0 1 amd,rome none
nic5-w002 1 batch* allocated 64 2:32:1 257700 0 1 amd,rome none
nic5-w003 1 batch* allocated 64 2:32:1 257700 0 1 amd,rome none
...
nic5-w072 1 hmem allocated 64 2:32:1 103190 0 1 amd,rome none
nic5-w073 1 hmem idle 64 2:32:1 103190 0 1 amd,rome none
nic5-w074 1 bio idle 256 2:64:2 206400 0 1 amd,rome none
Submitting a job
A job is composed of two components.
- A resource component: the number of cores, the number of nodes, memory, the maximum time during which a job needs the resource...
- Compute component: setup of the environment in which the application needs to be run and how the command(s) to run.
A job is specified by a batch script, stored as a text file, which includes specifications for the required resources and the command or commands to be executed. This job is subsequently submitted the Slurm scheduler that will examine the resource requirements from the batch script and verify the availability of the resource. If the requested resources are available and your job has a sufficiently high priority, it will begin execution immediately. However, if the resources are currently unavailable, your job will be placed in a queue and will remain there until the required resources become accessible.
You first job batch script
To illustrate the process of a job submission, let's consider the following job batch script:
#!/bin/bash
#SBATCH --job-name="My first job"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=firstjob.out
echo "Hello! I'm job with ID ${SLURM_JOBID}."
echo "I'm running on compute node(s) ${SLURM_JOB_NODELIST}."
To submit this job, we will put the content of the code block above in a text file
that. We will name this file firstjob_submit.sh
and use the simple command
line text editor (nano
). To create the file, run the command
Then, paste the content for the batch script. To exit nano
, press the
Ctrl+X keys then press Y to confirm you want to save the change and
finally, press Enter to confirm the name of the file you want to write.
Submit your first job
Now that our job batch script is ready, we can submit it using the sabtch
command
If everything works as intended, the sbatch
command should produce an output
similar to
The number at the end of the output is called a job ID and is a unique identifier assigned by Slurm to every submitted job. This identifier can be used later to alter, cancel or get information about a job.
Common error
We often see users submitting jobs using bash JOBSCRIPT
instead of
sbatch JOBSCRIPT
. While these two commands are similar, the first one will
cause your job script to run on the login node resulting in poor perfomance.
After submitting your job, you should see the sbatch
output after a few
seconds
If not, you probably used bash
instead of sbatch
. You will have no
output or the output of the commands run in you batch script then you should
immediatly terminate the process by pressing the keys Ctrl+C.
A closer look at the script
Now that we have submitted our first job, let's take a moment to look at the
batch script in detail. The first line is called a shebang line that is
commonly used in UNIX-like system to indicate the interpreter to use. A shebang
line is always the first line of a script and starts with #!
followed by the
name/path to the interpreter executable.
In our case we want to use bash
as the scripting language so we specify bash
as the
interpreter in the shebang line.
Shebang is mandatory
The shebang line is mandatory. If you omit it, sbatch will fail with the following error message
The next 5 lines are prefixed by #SBATCH
and are directives for Slurm. For
example, the first of these line
sets the name of the job to My first job
. The next two lines request some
resource for the job.
We will not discuss the meaning of the --ntasks
option right now. It will come
later when we discuss MPI. The important thing to understand right now is that
we request one CPU core with --cpus-per-task=1
. On the next line, we specify
the time limit for the job.
Here, we set the time limit to one minute. The format for this option is
DD-HH:MM:SS
. For example
1-12:00:00
is requesting 1 day and 12 hours06:00:00
is requesting 6 hours15:00
is requesting 15 minutes
The --time
option specifies the upper limit. If the job run for longer than
this limit, then the scheduler will terminate it.
The last directive (--output
) specify the output file we want to use for this
job.
The final two lines of our script are the commands we want to run on the compute node.
echo "Hello! I'm job with ID ${SLURM_JOBID}."
echo "I'm running on compute node(s) ${SLURM_JOB_NODELIST}."
The echo
command is used to print something to the standard output (similar to
printf
in C). ${SLURM_JOBID}
and ${SLURM_JOB_NODELIST}
are variables set
automatically by Slurm. If we look at the output produced our job using the
cat
command which prints the content of a file to the terminal we get
Here, ${SLURM_JOBID}
has been replaced by our job ID (5971751
) and
${SLURM_JOB_NODELIST}
by the name of the compute node on which the job
ran (nic5-w070
).
Inspecting the queue
It's very likely that the job we submitted in the previous section started almost immediately as we only requested one core. It is also very short as the only thing we do in this script is printing some information about the job.
In order to artificially increase the duration of the job, we will add
sleep 60
(wait 60 seconds) at the end of it
#!/bin/bash
#SBATCH --job-name="My first job"
#SBATCH --ntasks=200
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00
#SBATCH --output=firstjob.out
echo "Hello! I'm job with ID ${SLURM_JOBID}."
echo "I'm running on compute node(s) ${SLURM_JOB_NODELIST}."
sleep 60
and submit it again
Now, we will inquire about the status of our job using the squeue
command which
is the command to inspect the Slurm jobs queue. By default, this command prints
information about all the jobs running or waiting in the queue. To only get
information about your jobs, you need to add the --me
option.
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5971997 batch My first olouant PD 0:00 1 (Resources)
Looking at the ST
column, we see that that job is in a PD
state, which is an
abbreviation for "Pending". The job is waiting for a compute node to be
available. It is confirmed by the NODELIST(REASON)
column that indicates that
the job is waiting for resources to become available. Another possible output
might be
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5971997 batch My first olouant R 0:04 1 nic5-w070
Here, the job is in a R
state (running) and has been running on node
nic5-w070
for 4 seconds.
A last possibility is that your job is already completed, either because all the
commands successfully terminated, or because of an error in your job script. In
these cases, no job will be visible in the output of the squeue
command:
Canceling a job
Sometimes, you realize you made a mistake in your job script and want to correct it. However, correcting your job script while your job is pending in the queue or when your job is already running will have no effect. In order to correct your mistake, you need to cancel your job.
The Slurm command to use to cancel a job is scancel
followed by the job ID of
the job you which to cancel. For example, to cancel the job submitted in the
previous section, we can use the command
If you don't remember the ID of the job you want to cancel, you can always use the
squeue --me
command to retrieve it.
Summary
Command | Description |
---|---|
sinfo |
Gather information about the nodes and partitions |
sbatch BATCH_SCRIPT_FILE |
Submit a job defined in the file BATCH_SCRIPT_FILE to the queue |
squeue --me |
List your job in the queue |
scancel JOBID |
Cancel a job with job ID JOBID |