Parallel application performance analysis with Score-P and Scalasca

Identifiying performance bottleneck is critical to run an application at scale. This chapter is a short tutorial explaining how to perform the performance analysis using Score-P and Scalasca.

Score-P is a measurement infrastructure for profiling and event tracing of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multiprocessing (MPI), thread parallelism (OpenMP) but also accelerators (GPGPU).

Scalasca is a tool that to help the performance optimization of parallel application by identifying potential performance bottlenecks. Scalasca relies on Score-P for the measurements of the application behavior.

Reference run

In his tutorial, we will consider a hybrid OpenMP+MPI application.

Before running any profiling, we will first do reference measurement, i.e., run the application as we would do in production: without instrumentation. We will compile the application using the following commands

module load OpenMPI
mpicc -O3 -fopenmp -o myapp source.c

to produce an executable with name myapp. Next, we will submit a job to the queue using the following job script:

Source code for this example

reference.job

#!/bin/bash -l
#
#SBATCH --job-name="Scalasca profile"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00
#SBATCH --mem-per-cpu=2000

module load OpenMPI

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

srun ./myapp

The job script above run the application using 8 ranks, distributed on 4 nodes (2 ranks per node). Each rank will use 4 threads (8 ranks x 4 threads = 32 cores). We also set OMP_PROC_BIND to true to prevent the migration of the threads to a different core during the run. With thread binding set to true, a thread will always use the same core.

We can submit the job to the queue using the command:

sbatch reference.job

This reference run will serve later as a reference to determine if the measurements overheads are significant or not. For now we note that this reference

Run a first profile

When profiling application, we can use two methods to collect data. The first one is sampling. Sampling is a statistical profiling. By taking regular snapshots of the applications call stack, we can create a statistical profile of where the application spends most of its time.

One of the main advantages of a sampling experiment is the low overhead that is fixed by the choice of sampling rate. On the other hand, sampling is non-deterministic and can only provide a statistical picture of the application behaviour.

The second option is to use tracing. Tracing revolves around specific program events like entering or exiting a function. This allows the collection of accurate information about specific areas of the code every time the event occurs. This allows for a more accurate and more detailed information as data are collected from every traced function call not a statistical average. Tracing may require the program to be instrumented, i.e., introducing probe point in the binary to collect data about its execution.

Score-P, which is the tool we will use to collect data relies on instrumentation. Instrumentation will add probe points to our executable in order to do performance measurement. The Score-P instrumenter command scorep automatically takes care of compilation and linking to produce an instrumented executable. To use it we simply prefix our compilation line with the scorep command:

scorep mpicc -O3 -fopenmp -o myapp_instr source.c

This time we choose to add a _instr suffix to the name of the executable so that we can differentiate the non-instrumented and instrumented binaries.

In a first step, we will perform a profiling of this instrumented binary. Profiling collect basic information about our application like the number of time a function is called and how much times this function takes to execute. The profiling experiment goal is to determine if measurement introduces overhead.

An example job script is presented below to submit the profiling experiment to the queue is presented below.

Source code for this example

scalasca_profile.job

#!/bin/bash -l
#
#SBATCH --job-name="Scalasca profile"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=30:00
#SBATCH --mem-per-cpu=2000

# Loading Scalasca will also load OpenMPI
module load Scalasca

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

scalasca -analyze srun ./myapp_instr

There are two differences with the job script used for the reference:

we doubled the time allocated to the job, going from 15 minutes to 30 minutes, in case the measurement introduce a large overhead
we added scalasca -analyze in front of srun so that scalasca can configure and do the processing of the performance experiments

The next step is to submit the job to the queue

sbatch scalasca_profile.job

and wait for the execution to complete. Once the execution is completed, a scorep_myapp_instr_8x4_sum directory will contain the data collected. The name of this directory will depend on the executable name and the number of processes and number of threads. The _sum suffix means it's a summary (profiling) experiment.

We can perform the post-processing step using the scalasca -examine commmand:

module load Scalasca
scalasca -examine -s scorep_myapp_instr_8x4_sum/

The -s option is used to produce a textual report and skip running the graphical user interface (GUI) application as running GUI application on NIC5 is not ideal. This command will produce two files:

summary.cubex: a post-processed runtime summary result
scorep.score: a detailed measurement score report

On thing we can also note is that we probably have a significant overhead for the instrumented application. The non-instrumented one took 7 minutes, while the instrumented application it took 26 minutes. In order to investigate the problem, we can examine the report (score) file:

 $ head -n 25 ./scorep_myapp_instr_8x4_sum/scorep.score

Estimated aggregate size of event trace:                   3330GB
Estimated requirements for largest trace buffer (max_buf): 418GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       418GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid
 intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
 maximum supported memory or reduce requirements using USR regions filters.)

flt     type      max_buf[B]          visits  time[s] time[%] time/visit[us]  region
         ALL 448,221,306,979 137,470,265,525 47757.80   100.0           0.35  ALL
         USR 447,952,616,462 137,419,074,397 23817.62    49.9           0.17  USR
         OMP     256,609,792      47,542,272 23653.47    49.5         497.53  OMP
         COM      11,721,190       3,606,520    13.13     0.0           3.64  COM
         MPI         359,494          42,328   273.57     0.6        6463.19  MPI
      SCOREP              41               8     0.00     0.0          51.01  SCOREP

         USR 145,711,040,072  44,677,967,872  9565.12    20.0           0.21  binvcrhs
         USR 145,711,040,072  44,677,967,872  7873.16    16.5           0.18  matmul_sub
         USR 145,711,040,072  44,677,967,872  5873.59    12.3           0.13  matvec_sub
         USR   3,916,768,154   1,152,495,616   208.91     0.4           0.18  lhsinit
         USR   3,916,768,154   1,152,495,616   163.12     0.3           0.14  binvrhs
         USR   3,510,517,192   1,080,078,336   133.68     0.3           0.12  exact_solution
         OMP      22,361,088       2,056,192     0.59     0.0           0.29  ?$omp?parallel?@exch_qbc.f90:206
         OMP      22,361,088       2,056,192     0.59     0.0           0.29  ?$omp?parallel?@exch_qbc.f90:217
         OMP      22,361,088       2,056,192     0.60     0.0           0.29  ?$omp?parallel?@exch_qbc.f90:245

The first thing we gather from this report is that Score-P estimates that 3330 GB will be required to store the trace data when we do a trace experiment which means that 418 GB of memory/process is required to avoid intermediate flushes to the disk.

From the next section, we can see that most of the data comes from six functions which are called often (high visits count) for a very short period of time (low time/visit). These six functions contribute to 50% of the execution time, most of it, very likely to be measurement overhead due to frequently executed small functions. We can also see that these six functions contribute massively to the trace data storage requirements (max_buf column).

Optimize the measurement

Adapt thess steps to you application

The steps descibed in this section are very application dependent. You might not need a filter and the functions to filter will depend on the application under study.

In order to reduce the trace size and measurement overhead, we can create a filter file to exclude the functions that are called often but do not actually take a long time to execute. The filter file that we will name scorep.filter looks like this:

scorep.filter

SCOREP_REGION_NAMES_BEGIN
  EXCLUDE
    binvcrhs
    matmul_sub
    matvec_sub
    lhsinit
    binvrhs
    exact_solution
SCOREP_REGION_NAMES_END

We can immediately see the effect of filtering by running the post-processing command again, but this time, with the filter applied (-f scorep.filter option).

scalasca -examine -s -f scorep.filter scorep_myapp_instr_8x4_sum/

Then if we look at the report, we can see that the total size to the trace has dropped from 3330 GB to 2053 MB. The memory requirement to avoid intermediate flushes is now 265 MB.

 $ head -n 7 ./scorep_myapp_instr_8x4_sum/scorep.score

Estimated aggregate size of event trace:                   2053MB
Estimated requirements for largest trace buffer (max_buf): 257MB
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       265MB
(hint: When tracing set SCOREP_TOTAL_MEMORY=265MB to avoid intermediate flushes
 or reduce requirements using USR regions filters.)

Visualizing the data

We can use Cube GUI to open the summary cube file generated by the profiling experiment. Cube GUI is available for all major operating systems:

The first step is to copy the summary cube file (summary.cubex) located in the experiment directory (scorep_myapp_instr_8x4_sum) from NIC5 to your computer. See this section to see how to copy files from NIC5 to your computer.

Next, start the Cube application on your computer and open the summary.cubex file.

The Cube GUI application present the data in three panes (hence the name "cube"). A pane with performance metrics, where a number of metrics are available, such as computation and communication time. A pane with the call path, which contains the call tree of your application. Finally, a pane with the system resource, which contains compute nodes, processes and threads, depending on the parallel programming model.

Basic usage of the Cube GUI is a follow:

We select a metric in the left pane.
The center pane will display the total contribution of each function of your application to the selected metric.
The right pane will display the contribution of each process elements (compute nodes, processes or/and threads) to the metric(s) and function(s) selected.

With a profiling experiment, we only have an handful of metrics available. Moreover, as discussed before, we can not rely consider these measumrements as reliable as we have observed a significant overhead during profiling.

Tracing run

In order to collect more information and obtain more metrics, we can run a tracing experiment. The modified job script to perform this tracing experiment is presented below.

Source code for this example

scalasca_trace.job

#!/bin/bash -l
#
#SBATCH --job-name="Scalasca profile"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00
#SBATCH --mem-per-cpu=2000

module load Scalasca

export SCOREP_TOTAL_MEMORY=265MB

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}

scalasca -analyze -q -t -f scorep.filter srun ./myapp_instr

The difference with the profiling experiment we ran before is that we disabled profiling using the -q option and enabled tracing with the -t. We also used the -f FILTERFILE to provide a filter file with a list of functions to exclude from the measurement.

The order of the options is important

The order in which the -q and -t options are used is important. The -q option disables both profiling and tracing. The -t option enable tracing. If the options are passed in reserve order, then tracing will be disabled.

We can submit the job to the queue using the command:

sbatch scalasca_trace.job

and wait for the execution to complete. This time, the execution time was very similar to the execution time of the reference run for the non-instrumented application. This means that the instrumentation did not introduce significant overhead and that the data collected should be close to the actual values for a non-instrumented application.

We can perform the post-processing step. This time the data collected will be stored in the scorep_myapp_instr_8x4_trace directory.

scalasca -examine -s scorep_myapp_instr_8x4_trace

This will produce a file with name trace.cubex. We can copy this file to our computer for visualization with Cube.

Compared to the profiling run, we have more metrics available. Information about the metrics can be found here.