Hardware performance counters with LIKWID
Introduction
When profiling an application, we often focus on the runtime of each individual functions in the code in order to determine the bottleneck of the application. While measure runtime is an important metric, it doesn't provide any information on the reason why a function is slow. This is where hardware performance counters can come to the rescue.
Hardware performance counters are special-purpose registers built into the CPU that store the count of hardware-related activities. They are designed to gather information about specific events occurring in the hardware. Each time the event occurs, the counter is incremented. For example, using performance counters, we can gather information like the amount of data cache misses or number of floating point instructions.
The way hardware performance counters are implemented is very hardware dependent. Different CPUs may have different performance counters. A particular CPU may have performance counters not available on another CPU. The consequence is that the programmer needs to know all the details of the way performance counters works for the specific architecture they are using. Fortunately, some libraries and software provide an abstraction layer in order to be able to do measurement without the need to understand all the intricate details of the hardware.
Performance counters with LIKWID
LIKWID is an open-source performance monitoring and benchmarking suite that abstracts some of the differences between different manufacturers.
Listing available performance group
LIKWID offers preconfigured event sets, called performance groups, with useful
preselected event sets and derived metrics. This performance groups can be
listed using the likwid-perfctr -a
command. In order to get access to this
command, we first have to load the module:
Once the module is loaded, we have access to the likwid-perfctr
command. The
first thing we can do with this command is list all the performance counter
groups available using the -a
option.
$ likwid-perfctr -a
Group name Description
--------------------------------------------------------------------------------
BRANCH Branch prediction miss rate/ratio
CACHE Data cache miss rate/ratio
CLOCK Cycles per instruction
CPI Cycles per instruction
DATA Load to store ratio
DIVIDE Divide unit information
ENERGY Power and Energy consumption
FLOPS_DP Double Precision MFLOP/s
FLOPS_SP Single Precision MFLOP/s
ICACHE Instruction cache miss rate/ratio
L2 L2 cache bandwidth in MBytes/s (experimental)
L3 L3 cache bandwidth in MBytes/s
MEM Main memory bandwidth in MBytes/s (experimental)
MEM_DP Main memory bandwidth in MBytes/s (experimental)
MEM_SP Main memory bandwidth in MBytes/s (experimental)
NUMA Local and remote memory accesses (experimental)
TLB TLB miss rate/ratio
Get information about a group
We can select one of the groups presented in the list above ask for more
information about it by selecting the group with the -g GROUPNAME
option
together with the -H
(for help) option:
$ likwid-perfctr -g CACHE -H
Group CACHE:
Formulas:
data cache requests = DATA_CACHE_ACCESSES
data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
data cache misses = DATA_CACHE_REFILLS_ALL
data cache miss rate = DATA_CACHE_REFILLS_ALL / RETIRED_INSTRUCTIONS
data cache miss ratio = DATA_CACHE_REFILLS_ALL / DATA_CACHE_ACCESSES
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio
Measuring performance counters
On the login node
The general syntax of the likwid-perfctr
command for performance counters
measurement is as follows
where PERFGROUP
is the performance counters you want to measure. COMMAND
is
the executable for which you want to do the measurement. For example, to measure
the L1 cache accesses for the hostname
command we can use
$ likwid-perfctr -C 32 -g CACHE hostname
--------------------------------------------------------------------------------
CPU name: AMD EPYC 7542 32-Core Processor
CPU type: AMD K17 (Zen2) architecture
CPU clock: 2.90 GHz
--------------------------------------------------------------------------------
nic5-login1
--------------------------------------------------------------------------------
Group 1: CACHE
+------------------------+---------+-------------+
| Event | Counter | HWThread 32 |
+------------------------+---------+-------------+
| ACTUAL_CPU_CLOCK | FIXC1 | 0 |
| MAX_CPU_CLOCK | FIXC2 | 0 |
| RETIRED_INSTRUCTIONS | PMC0 | 383991 |
| CPU_CLOCKS_UNHALTED | PMC1 | 405876 |
| DATA_CACHE_ACCESSES | PMC2 | 138904 |
| DATA_CACHE_REFILLS_ALL | PMC3 | 3469 |
+------------------------+---------+-------------+
+-------------------------+-------------+
| Metric | HWThread 32 |
+-------------------------+-------------+
| Runtime (RDTSC) [s] | 0.0024 |
| Runtime unhalted [s] | 0 |
| Clock [MHz] | - |
| CPI | 1.0570 |
| data cache requests | 138904 |
| data cache request rate | 0.3617 |
| data cache misses | 3469 |
| data cache miss rate | 0.0090 |
| data cache miss ratio | 0.0250 |
+-------------------------+-------------+
Note the use of the -C 32
option. The role of this option is to pin the
executable to a particular core and indicate to likwid which core the measure.
If we don't use this option, all available cores will be measured. In our
example, we pin the command to core number 32 (first core of the second socket).
The output of likwid-perfctr
first present the values collected for each
performance counters collected. In a second part, the output present metrics
computed for the performance counters collected.
If we have an OpenMP application, then we need to pin to multiple cores. For
example, to 4 threads, we can pin our application and the measurement on cores
32, 33, 34 and 35 using the -C 32-35
option.
$ export OMP_NUM_THREADS=4
$ likwid-perfctr -C 32-35 -g CACHE ./omp_app
--------------------------------------------------------------------------------
CPU name: AMD EPYC 7542 32-Core Processor
CPU type: AMD K17 (Zen2) architecture
CPU clock: 2.90 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: CACHE
+------------------------+---------+-------------+-------------+-------------+-------------+
| Event | Counter | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+------------------------+---------+-------------+-------------+-------------+-------------+
| ACTUAL_CPU_CLOCK | FIXC1 | 0 | 0 | 0 | 0 |
| MAX_CPU_CLOCK | FIXC2 | 0 | 0 | 0 | 0 |
| RETIRED_INSTRUCTIONS | PMC0 | 161191317 | 78095363 | 78074577 | 75929953 |
| CPU_CLOCKS_UNHALTED | PMC1 | 120392720 | 95388787 | 95188872 | 74983664 |
| DATA_CACHE_ACCESSES | PMC2 | 55717473 | 19338050 | 19312044 | 19086053 |
| DATA_CACHE_REFILLS_ALL | PMC3 | 459846 | 21921 | 19367 | 21457 |
+------------------------+---------+-------------+-------------+-------------+-------------+
+-----------------------------+---------+-----------+----------+-----------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+-----------------------------+---------+-----------+----------+-----------+--------------+
| ACTUAL_CPU_CLOCK STAT | FIXC1 | 0 | 0 | 0 | 0 |
| MAX_CPU_CLOCK STAT | FIXC2 | 0 | 0 | 0 | 0 |
| RETIRED_INSTRUCTIONS STAT | PMC0 | 393291210 | 75929953 | 161191317 | 9.832280e+07 |
| CPU_CLOCKS_UNHALTED STAT | PMC1 | 385954043 | 74983664 | 120392720 | 9.648851e+07 |
| DATA_CACHE_ACCESSES STAT | PMC2 | 113453620 | 19086053 | 55717473 | 28363405 |
| DATA_CACHE_REFILLS_ALL STAT | PMC3 | 522591 | 19367 | 459846 | 130647.7500 |
+-----------------------------+---------+-----------+----------+-----------+--------------+
+-------------------------+-------------+-------------+-------------+-------------+
| Metric | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+-------------------------+-------------+-------------+-------------+-------------+
| Runtime (RDTSC) [s] | 0.0999 | 0.0999 | 0.0999 | 0.0999 |
| Runtime unhalted [s] | 0 | 0 | 0 | 0 |
| Clock [MHz] | - | - | - | - |
| CPI | 0.7469 | 1.2214 | 1.2192 | 0.9875 |
| data cache requests | 55717473 | 19338050 | 19312044 | 19086053 |
| data cache request rate | 0.3457 | 0.2476 | 0.2474 | 0.2514 |
| data cache misses | 459846 | 21921 | 19367 | 21457 |
| data cache miss rate | 0.0029 | 0.0003 | 0.0002 | 0.0003 |
| data cache miss ratio | 0.0083 | 0.0011 | 0.0010 | 0.0011 |
+-------------------------+-------------+-------------+-------------+-------------+
+------------------------------+-----------+----------+----------+-------------+
| Metric | Sum | Min | Max | Avg |
+------------------------------+-----------+----------+----------+-------------+
| Runtime (RDTSC) [s] STAT | 0.3996 | 0.0999 | 0.0999 | 0.0999 |
| Runtime unhalted [s] STAT | 0 | 0 | 0 | 0 |
| Clock [MHz] STAT | 0 | inf | 0 | 0 |
| CPI STAT | 4.1750 | 0.7469 | 1.2214 | 1.0437 |
| data cache requests STAT | 113453620 | 19086053 | 55717473 | 28363405 |
| data cache request rate STAT | 1.0921 | 0.2474 | 0.3457 | 0.2730 |
| data cache misses STAT | 522591 | 19367 | 459846 | 130647.7500 |
| data cache miss rate STAT | 0.0037 | 0.0002 | 0.0029 | 0.0009 |
| data cache miss ratio STAT | 0.0115 | 0.0010 | 0.0083 | 0.0029 |
+------------------------------+-----------+----------+----------+-------------+
Batch job (OpenMP)
Using likwid-perfctr
in a batch job to measure the hardware counter for
an OpenMP application is similar to the way it's done on a login node. However,
as we cannot know which CPU core will be allocated can remove the -C
option
as the CPU core available are already limited by Slurm.
The example below, shows the use of likwid-perfctr
in a batch script using four
OpenMP threads.
#!/bin/bash -l
#
#SBATCH --job-name="LIKWID OpenMP"
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00
module load likwid
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
likwid-perfctr -g CACHE ./omp_app
Batch job (MPI)
LIKWID includes the likwid-mpirun
utility that can be used to do measurement
of hardware performance counter of MPI applications. Like with likwid-perfctr
,
the group is selected using the -g
option. The number of processes (MPI ranks)
to launch can set defined using the -np
option.
The example script below show the use of likwid-mpirun
in a batch script
launching four processes (MPI ranks) distributed on two nodes.
#!/bin/bash -l
#
#SBATCH --job-name="LIKWID MPI"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=15:00
module load likwid
module load OpenMPI
likwid-mpirun --mpi slurm -np ${SLURM_NTASKS} \
-g CACHE ./mpi_app
For hybrid MPI+OpenMP jobs, the strategy is to unset the OMP_NUM_THREADS
environment variable which is set to 1
by default and use the -t
option
of likwid-mpirun
to specify the number of OpenMP threads. In a Slurm batch
script, we can use this option with ${SLURM_CPUS_PER_TASK}
to retrieve the
number of threads from Slurm.
The example job script below launch an application with four processes (MPI ranks) distributed on two nodes. Each process uses four OpenMP threads.
#!/bin/bash -l
#
#SBATCH --job-name="LIKWID MPI+OpenMP"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00
module load likwid
module load OpenMPI
unset OMP_NUM_THREADS
# make sure --cpus-per-task is propagated
export SRUN_CPUS_PER_TASKS=${SLURM_CPUS_PER_TASKS}
likwid-mpirun --mpi slurm -np ${SLURM_NTASKS} \
-t ${SLURM_CPUS_PER_TASK} \
-g CACHE ./mpi_omp_app
Using markers
In most cases, we don't want to perform measurement for the entire application but only for selected regions of the code. In this situation, we can use the LIKWID Marker API. The Marker API consists of function calls and defines that enable the measuring of code regions.
For example, the following defines are available:
LIKWID_MARKER_INIT
: initialize the Marker APILIKWID_MARKER_REGISTER(char* tag)
: register a region name tag to the Marker APILIKWID_MARKER_START(char* tag)
: start a named region identified by tagLIKWID_MARKER_STOP(char* tag)
: stop a named region identified by tagLIKWID_MARKER_CLOSE
: finalize the Marker API
To illustrate the use of the marker API, we will use a code that performs two
types of operation with the elements of two array: a sum and a product. The first
step is to initialize the marker API with LIKWID_MARKER_INIT
then we register
two regions with tags sum
and prod
. We register these tags in a parallel
OpenMP region sio that they are registered for all the threads. The two regions
are defined using the LIKWID_MARKER_START
and LIKWID_MARKER_STOP
. At the end
of the program, the marker API is finalized with LIKWID_MARKER_CLOSE
.
#include <stdlib.h>
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
#ifdef LIKWID_PERFMON
#include <likwid-marker.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#endif
int main(int argc, char *argv[]) {
const int size = 1000;
double *a = malloc(size * sizeof(double));
double *b = malloc(size * sizeof(double));
double *sum = malloc(size * sizeof(double));
double *prod = malloc(size * sizeof(double));
LIKWID_MARKER_INIT;
#pragma omp parallel
{
LIKWID_MARKER_REGISTER("sum");
LIKWID_MARKER_REGISTER("prod");
#pragma omp barrier
#pragma omp for
for (int i = 0; i < size; i++) {
a[i] = (double)(i + 1);
b[i] = a[i];
}
LIKWID_MARKER_START("sum");
#pragma omp for
for (int i = 0; i < size; i++) {
sum[i] = a[i] + b[i];
}
LIKWID_MARKER_STOP("sum");
LIKWID_MARKER_START("prod");
#pragma omp for
for (int i = 0; i < size; i++) {
prod[i] = a[i] * b[i];
}
LIKWID_MARKER_STOP("prod");
}
printf(" Sum: first = %lf - last = %lf\n", sum[0], sum[size-1]);
printf(" Product: first = %lf - last = %lf\n", prod[0], prod[size-1]);
LIKWID_MARKER_CLOSE;
free(a); free(b); free(sum); free(prod);
return 0;
}
The example above can be compiled with and without the marker API, which is
enabled by defining the LIKWID_PERFMON
macro at compile time. In addition,
we need to provide the path to the LIKWID library and include as well as
linking against the LIKWID library:
export LIKWID_LIB=$(dirname $(which likwid-perfctr))/../lib/
export LIKWID_INC=$(dirname $(which likwid-perfctr))/../include/
gcc -O3 -fopenmp -DLIKWID_PERFMON \
-L$LIKWID_LIB -I$LIKWID_INC \
-o markers markers_example.c -llikwid
To enable the marker API during the measurement, we need to pass the -m
option. This option can be used with likwid-perfctr
and likwid-mpirun
.
The output of this command will contain two sections, one section per region defined by the marker API:
Region sum, Group 1: CACHE
+-------------------+-------------+-------------+-------------+-------------+
| Region Info | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+-------------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] | 0.000008 | 0.000027 | 0.000001 | 0.000015 |
| call count | 1 | 1 | 1 | 1 |
+-------------------+-------------+-------------+-------------+-------------+
...
Region prod, Group 1: CACHE
+-------------------+-------------+-------------+-------------+-------------+
| Region Info | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+-------------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] | 0.000021 | 0.000007 | 0.000014 | 0.000000 |
| call count | 1 | 1 | 1 | 1 |
+-------------------+-------------+-------------+-------------+-------------+