Hardware performance counters with LIKWID

Introduction

When profiling an application, we often focus on the runtime of each individual functions in the code in order to determine the bottleneck of the application. While measure runtime is an important metric, it doesn't provide any information on the reason why a function is slow. This is where hardware performance counters can come to the rescue.

Hardware performance counters are special-purpose registers built into the CPU that store the count of hardware-related activities. They are designed to gather information about specific events occurring in the hardware. Each time the event occurs, the counter is incremented. For example, using performance counters, we can gather information like the amount of data cache misses or number of floating point instructions.

The way hardware performance counters are implemented is very hardware dependent. Different CPUs may have different performance counters. A particular CPU may have performance counters not available on another CPU. The consequence is that the programmer needs to know all the details of the way performance counters works for the specific architecture they are using. Fortunately, some libraries and software provide an abstraction layer in order to be able to do measurement without the need to understand all the intricate details of the hardware.

Performance counters with LIKWID

LIKWID is an open-source performance monitoring and benchmarking suite that abstracts some of the differences between different manufacturers.

Listing available performance group

LIKWID offers preconfigured event sets, called performance groups, with useful preselected event sets and derived metrics. This performance groups can be listed using the likwid-perfctr -a command. In order to get access to this command, we first have to load the module:

module load likwid

Once the module is loaded, we have access to the likwid-perfctr command. The first thing we can do with this command is list all the performance counter groups available using the -a option.

 $ likwid-perfctr -a
Group name      Description
--------------------------------------------------------------------------------
  BRANCH        Branch prediction miss rate/ratio
   CACHE        Data cache miss rate/ratio
   CLOCK        Cycles per instruction
     CPI        Cycles per instruction
    DATA        Load to store ratio
  DIVIDE        Divide unit information
  ENERGY        Power and Energy consumption
FLOPS_DP        Double Precision MFLOP/s
FLOPS_SP        Single Precision MFLOP/s
  ICACHE        Instruction cache miss rate/ratio
      L2        L2 cache bandwidth in MBytes/s (experimental)
      L3        L3 cache bandwidth in MBytes/s
     MEM        Main memory bandwidth in MBytes/s (experimental)
  MEM_DP        Main memory bandwidth in MBytes/s (experimental)
  MEM_SP        Main memory bandwidth in MBytes/s (experimental)
    NUMA        Local and remote memory accesses (experimental)
     TLB        TLB miss rate/ratio

Get information about a group

We can select one of the groups presented in the list above ask for more information about it by selecting the group with the -g GROUPNAME option together with the -H (for help) option:

 $ likwid-perfctr -g CACHE -H
Group CACHE:
Formulas:
data cache requests = DATA_CACHE_ACCESSES
data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
data cache misses = DATA_CACHE_REFILLS_ALL
data cache miss rate = DATA_CACHE_REFILLS_ALL / RETIRED_INSTRUCTIONS
data cache miss ratio = DATA_CACHE_REFILLS_ALL / DATA_CACHE_ACCESSES
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio

Measuring performance counters

The general syntax of the likwid-perfctr command for performance counters measurement is as follows

likwid-perfctr -g PERFGROUP COMMAND

where PERFGROUP is the performance counters you want to measure. COMMAND is the executable for which you want to do the measurement. For example, to measure the L1 cache accesses for the hostname command we can use

 $ likwid-perfctr -C 32 -g CACHE hostname
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 7542 32-Core Processor
CPU type:       AMD K17 (Zen2) architecture
CPU clock:      2.90 GHz
--------------------------------------------------------------------------------
nic5-login1
--------------------------------------------------------------------------------
Group 1: CACHE
+------------------------+---------+-------------+
|          Event         | Counter | HWThread 32 |
+------------------------+---------+-------------+
|    ACTUAL_CPU_CLOCK    |  FIXC1  |           0 |
|      MAX_CPU_CLOCK     |  FIXC2  |           0 |
|  RETIRED_INSTRUCTIONS  |   PMC0  |      383991 |
|   CPU_CLOCKS_UNHALTED  |   PMC1  |      405876 |
|   DATA_CACHE_ACCESSES  |   PMC2  |      138904 |
| DATA_CACHE_REFILLS_ALL |   PMC3  |        3469 |
+------------------------+---------+-------------+

+-------------------------+-------------+
|          Metric         | HWThread 32 |
+-------------------------+-------------+
|   Runtime (RDTSC) [s]   |      0.0024 |
|   Runtime unhalted [s]  |           0 |
|       Clock [MHz]       |      -      |
|           CPI           |      1.0570 |
|   data cache requests   |      138904 |
| data cache request rate |      0.3617 |
|    data cache misses    |        3469 |
|   data cache miss rate  |      0.0090 |
|  data cache miss ratio  |      0.0250 |
+-------------------------+-------------+

Note the use of the -C 32 option. The role of this option is to pin the executable to a particular core and indicate to likwid which core the measure. If we don't use this option, all available cores will be measured. In our example, we pin the command to core number 32 (first core of the second socket).

The output of likwid-perfctr first present the values collected for each performance counters collected. In a second part, the output present metrics computed for the performance counters collected.

If we have an OpenMP application, then we need to pin to multiple cores. For example, to 4 threads, we can pin our application and the measurement on cores 32, 33, 34 and 35 using the -C 32-35 option.

 $ export OMP_NUM_THREADS=4
 $ likwid-perfctr -C 32-35 -g CACHE ./omp_app
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 7542 32-Core Processor
CPU type:       AMD K17 (Zen2) architecture
CPU clock:      2.90 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: CACHE
+------------------------+---------+-------------+-------------+-------------+-------------+
|          Event         | Counter | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+------------------------+---------+-------------+-------------+-------------+-------------+
|    ACTUAL_CPU_CLOCK    |  FIXC1  |           0 |           0 |           0 |           0 |
|      MAX_CPU_CLOCK     |  FIXC2  |           0 |           0 |           0 |           0 |
|  RETIRED_INSTRUCTIONS  |   PMC0  |   161191317 |    78095363 |    78074577 |    75929953 |
|   CPU_CLOCKS_UNHALTED  |   PMC1  |   120392720 |    95388787 |    95188872 |    74983664 |
|   DATA_CACHE_ACCESSES  |   PMC2  |    55717473 |    19338050 |    19312044 |    19086053 |
| DATA_CACHE_REFILLS_ALL |   PMC3  |      459846 |       21921 |       19367 |       21457 |
+------------------------+---------+-------------+-------------+-------------+-------------+

+-----------------------------+---------+-----------+----------+-----------+--------------+
|            Event            | Counter |    Sum    |    Min   |    Max    |      Avg     |
+-----------------------------+---------+-----------+----------+-----------+--------------+
|    ACTUAL_CPU_CLOCK STAT    |  FIXC1  |         0 |        0 |         0 |            0 |
|      MAX_CPU_CLOCK STAT     |  FIXC2  |         0 |        0 |         0 |            0 |
|  RETIRED_INSTRUCTIONS STAT  |   PMC0  | 393291210 | 75929953 | 161191317 | 9.832280e+07 |
|   CPU_CLOCKS_UNHALTED STAT  |   PMC1  | 385954043 | 74983664 | 120392720 | 9.648851e+07 |
|   DATA_CACHE_ACCESSES STAT  |   PMC2  | 113453620 | 19086053 |  55717473 |     28363405 |
| DATA_CACHE_REFILLS_ALL STAT |   PMC3  |    522591 |    19367 |    459846 |  130647.7500 |
+-----------------------------+---------+-----------+----------+-----------+--------------+

+-------------------------+-------------+-------------+-------------+-------------+
|          Metric         | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+-------------------------+-------------+-------------+-------------+-------------+
|   Runtime (RDTSC) [s]   |      0.0999 |      0.0999 |      0.0999 |      0.0999 |
|   Runtime unhalted [s]  |           0 |           0 |           0 |           0 |
|       Clock [MHz]       |      -      |      -      |      -      |      -      |
|           CPI           |      0.7469 |      1.2214 |      1.2192 |      0.9875 |
|   data cache requests   |    55717473 |    19338050 |    19312044 |    19086053 |
| data cache request rate |      0.3457 |      0.2476 |      0.2474 |      0.2514 |
|    data cache misses    |      459846 |       21921 |       19367 |       21457 |
|   data cache miss rate  |      0.0029 |      0.0003 |      0.0002 |      0.0003 |
|  data cache miss ratio  |      0.0083 |      0.0011 |      0.0010 |      0.0011 |
+-------------------------+-------------+-------------+-------------+-------------+

+------------------------------+-----------+----------+----------+-------------+
|            Metric            |    Sum    |    Min   |    Max   |     Avg     |
+------------------------------+-----------+----------+----------+-------------+
|   Runtime (RDTSC) [s] STAT   |    0.3996 |   0.0999 |   0.0999 |      0.0999 |
|   Runtime unhalted [s] STAT  |         0 |        0 |        0 |           0 |
|       Clock [MHz] STAT       |         0 |    inf   |        0 |           0 |
|           CPI STAT           |    4.1750 |   0.7469 |   1.2214 |      1.0437 |
|   data cache requests STAT   | 113453620 | 19086053 | 55717473 |    28363405 |
| data cache request rate STAT |    1.0921 |   0.2474 |   0.3457 |      0.2730 |
|    data cache misses STAT    |    522591 |    19367 |   459846 | 130647.7500 |
|   data cache miss rate STAT  |    0.0037 |   0.0002 |   0.0029 |      0.0009 |
|  data cache miss ratio STAT  |    0.0115 |   0.0010 |   0.0083 |      0.0029 |
+------------------------------+-----------+----------+----------+-------------+

Batch job (OpenMP)

Using likwid-perfctr in a batch job to measure the hardware counter for an OpenMP application is similar to the way it's done on a login node. However, as we cannot know which CPU core will be allocated can remove the -C option as the CPU core available are already limited by Slurm.

The example below, shows the use of likwid-perfctr in a batch script using four OpenMP threads.

Source code for this example

likwid_omp.job

#!/bin/bash -l
#
#SBATCH --job-name="LIKWID OpenMP"
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00

module load likwid

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

likwid-perfctr -g CACHE ./omp_app

Batch job (MPI)

LIKWID includes the likwid-mpirun utility that can be used to do measurement of hardware performance counter of MPI applications. Like with likwid-perfctr, the group is selected using the -g option. The number of processes (MPI ranks) to launch can set defined using the -np option.

The example script below show the use of likwid-mpirun in a batch script launching four processes (MPI ranks) distributed on two nodes.

Source code for this example

likwid_mpi.job

#!/bin/bash -l
#
#SBATCH --job-name="LIKWID MPI"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=15:00

module load likwid
module load OpenMPI

likwid-mpirun --mpi slurm -np ${SLURM_NTASKS} \
                          -g CACHE ./mpi_app

For hybrid MPI+OpenMP jobs, the strategy is to unset the OMP_NUM_THREADS environment variable which is set to 1 by default and use the -t option of likwid-mpirun to specify the number of OpenMP threads. In a Slurm batch script, we can use this option with ${SLURM_CPUS_PER_TASK} to retrieve the number of threads from Slurm.

The example job script below launch an application with four processes (MPI ranks) distributed on two nodes. Each process uses four OpenMP threads.

Source code for this example

likwid_mpi_omp.job

#!/bin/bash -l
#
#SBATCH --job-name="LIKWID MPI+OpenMP"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00

module load likwid
module load OpenMPI

unset OMP_NUM_THREADS

# make sure --cpus-per-task is propagated
export SRUN_CPUS_PER_TASKS=${SLURM_CPUS_PER_TASKS}

likwid-mpirun --mpi slurm -np ${SLURM_NTASKS} \
                          -t ${SLURM_CPUS_PER_TASK} \
                          -g CACHE ./mpi_omp_app

Using markers

In most cases, we don't want to perform measurement for the entire application but only for selected regions of the code. In this situation, we can use the LIKWID Marker API. The Marker API consists of function calls and defines that enable the measuring of code regions.

For example, the following defines are available:

LIKWID_MARKER_INIT: initialize the Marker API
LIKWID_MARKER_REGISTER(char* tag): register a region name tag to the Marker API
LIKWID_MARKER_START(char* tag): start a named region identified by tag
LIKWID_MARKER_STOP(char* tag): stop a named region identified by tag
LIKWID_MARKER_CLOSE: finalize the Marker API

To illustrate the use of the marker API, we will use a code that performs two types of operation with the elements of two array: a sum and a product. The first step is to initialize the marker API with LIKWID_MARKER_INIT then we register two regions with tags sum and prod. We register these tags in a parallel OpenMP region sio that they are registered for all the threads. The two regions are defined using the LIKWID_MARKER_START and LIKWID_MARKER_STOP. At the end of the program, the marker API is finalized with LIKWID_MARKER_CLOSE.

Source code for this example

markers_example.c

#include <stdlib.h>
#include <stdio.h>

#ifdef _OPENMP
#include <omp.h>
#endif

#ifdef LIKWID_PERFMON
#include <likwid-marker.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#endif

int main(int argc, char *argv[]) {
  const int size = 1000;

  double *a = malloc(size * sizeof(double));
  double *b = malloc(size * sizeof(double));

  double *sum  = malloc(size * sizeof(double));
  double *prod = malloc(size * sizeof(double));

  LIKWID_MARKER_INIT;

  #pragma omp parallel
  {
    LIKWID_MARKER_REGISTER("sum");
    LIKWID_MARKER_REGISTER("prod");

    #pragma omp barrier

    #pragma omp for
    for (int i = 0; i < size; i++) {
      a[i] = (double)(i + 1);
      b[i] = a[i];
    }

    LIKWID_MARKER_START("sum");

    #pragma omp for
    for (int i = 0; i < size; i++) {
      sum[i] = a[i] + b[i];
    }

    LIKWID_MARKER_STOP("sum");
    LIKWID_MARKER_START("prod");

    #pragma omp for
    for (int i = 0; i < size; i++) {
      prod[i] = a[i] * b[i];
    }

    LIKWID_MARKER_STOP("prod");
  }

  printf("      Sum: first = %lf - last = %lf\n", sum[0], sum[size-1]);
  printf("  Product: first = %lf - last = %lf\n", prod[0], prod[size-1]);

  LIKWID_MARKER_CLOSE;

  free(a); free(b); free(sum); free(prod);

  return 0;
}

The example above can be compiled with and without the marker API, which is enabled by defining the LIKWID_PERFMON macro at compile time. In addition, we need to provide the path to the LIKWID library and include as well as linking against the LIKWID library:

export LIKWID_LIB=$(dirname $(which likwid-perfctr))/../lib/
export LIKWID_INC=$(dirname $(which likwid-perfctr))/../include/

gcc -O3 -fopenmp -DLIKWID_PERFMON \
        -L$LIKWID_LIB -I$LIKWID_INC \
        -o markers markers_example.c -llikwid

To enable the marker API during the measurement, we need to pass the -m option. This option can be used with likwid-perfctr and likwid-mpirun.

export OMP_NUM_THREADS=4

likwid-perfctr -m -g CACHE -C 32-35 ./markers

The output of this command will contain two sections, one section per region defined by the marker API:

Region sum, Group 1: CACHE
+-------------------+-------------+-------------+-------------+-------------+
|    Region Info    | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+-------------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] |    0.000008 |    0.000027 |    0.000001 |    0.000015 |
|     call count    |           1 |           1 |           1 |           1 |
+-------------------+-------------+-------------+-------------+-------------+

...

Region prod, Group 1: CACHE
+-------------------+-------------+-------------+-------------+-------------+
|    Region Info    | HWThread 32 | HWThread 33 | HWThread 34 | HWThread 35 |
+-------------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] |    0.000021 |    0.000007 |    0.000014 |    0.000000 |
|     call count    |           1 |           1 |           1 |           1 |
+-------------------+-------------+-------------+-------------+-------------+

Hardware performance counters with LIKWID

Introduction

Performance counters with LIKWID

Listing available performance group

Get information about a group

Measuring performance counters

On the login node

Batch job (OpenMP)

Batch job (MPI)

Using markers