Parallel application performance analysis with Score-P and Scalasca
Identifiying performance bottleneck is critical to run an application at scale. This chapter is a short tutorial explaining how to perform the performance analysis using Score-P and Scalasca.
Score-P is a measurement infrastructure for profiling and event tracing of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multiprocessing (MPI), thread parallelism (OpenMP) but also accelerators (GPGPU).
Scalasca is a tool that to help the performance optimization of parallel application by identifying potential performance bottlenecks. Scalasca relies on Score-P for the measurements of the application behavior.
Reference run
In his tutorial, we will consider a hybrid OpenMP+MPI application.
Before running any profiling, we will first do reference measurement, i.e., run the application as we would do in production: without instrumentation. We will compile the application using the following commands
to produce an executable with name myapp
. Next, we will submit a job to the
queue using the following job script:
#!/bin/bash -l
#
#SBATCH --job-name="Scalasca profile"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00
#SBATCH --mem-per-cpu=2000
module load OpenMPI
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
srun ./myapp
The job script above run the application using 8 ranks, distributed on 4 nodes
(2 ranks per node). Each rank will use 4 threads (8 ranks x 4 threads = 32
cores). We also set OMP_PROC_BIND
to true
to prevent the migration of the
threads to a different core during the run. With thread binding set to true
, a
thread will always use the same core.
We can submit the job to the queue using the command:
This reference run will serve later as a reference to determine if the measurements overheads are significant or not. For now we note that this reference
Run a first profile
When profiling application, we can use two methods to collect data. The first one is sampling. Sampling is a statistical profiling. By taking regular snapshots of the applications call stack, we can create a statistical profile of where the application spends most of its time.
One of the main advantages of a sampling experiment is the low overhead that is fixed by the choice of sampling rate. On the other hand, sampling is non-deterministic and can only provide a statistical picture of the application behaviour.
The second option is to use tracing. Tracing revolves around specific program events like entering or exiting a function. This allows the collection of accurate information about specific areas of the code every time the event occurs. This allows for a more accurate and more detailed information as data are collected from every traced function call not a statistical average. Tracing may require the program to be instrumented, i.e., introducing probe point in the binary to collect data about its execution.
Score-P, which is the tool we will use to collect data relies on
instrumentation. Instrumentation will add probe points to our executable in
order to do performance measurement. The Score-P instrumenter command scorep
automatically takes care of compilation and linking to produce an instrumented
executable. To use it we simply prefix our compilation line with the scorep
command:
This time we choose to add a _instr
suffix to the name of the executable so
that we can differentiate the non-instrumented and instrumented binaries.
In a first step, we will perform a profiling of this instrumented binary. Profiling collect basic information about our application like the number of time a function is called and how much times this function takes to execute. The profiling experiment goal is to determine if measurement introduces overhead.
An example job script is presented below to submit the profiling experiment to the queue is presented below.
#!/bin/bash -l
#
#SBATCH --job-name="Scalasca profile"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=30:00
#SBATCH --mem-per-cpu=2000
# Loading Scalasca will also load OpenMPI
module load Scalasca
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
scalasca -analyze srun ./myapp_instr
There are two differences with the job script used for the reference:
- we doubled the time allocated to the job, going from 15 minutes to 30 minutes, in case the measurement introduce a large overhead
- we added
scalasca -analyze
in front ofsrun
so that scalasca can configure and do the processing of the performance experiments
The next step is to submit the job to the queue
and wait for the execution to complete. Once the execution is completed, a
scorep_myapp_instr_8x4_sum
directory will contain the data collected. The name
of this directory will depend on the executable name and the number of processes
and number of threads. The _sum
suffix means it's a summary (profiling)
experiment.
We can perform the post-processing step using the scalasca -examine
commmand:
The -s
option is used to produce a textual report and skip running the
graphical user interface (GUI) application as running GUI application on NIC5 is
not ideal. This command will produce two files:
summary.cubex
: a post-processed runtime summary resultscorep.score
: a detailed measurement score report
On thing we can also note is that we probably have a significant overhead for the instrumented application. The non-instrumented one took 7 minutes, while the instrumented application it took 26 minutes. In order to investigate the problem, we can examine the report (score) file:
$ head -n 25 ./scorep_myapp_instr_8x4_sum/scorep.score
Estimated aggregate size of event trace: 3330GB
Estimated requirements for largest trace buffer (max_buf): 418GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 418GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid
intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
maximum supported memory or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 448,221,306,979 137,470,265,525 47757.80 100.0 0.35 ALL
USR 447,952,616,462 137,419,074,397 23817.62 49.9 0.17 USR
OMP 256,609,792 47,542,272 23653.47 49.5 497.53 OMP
COM 11,721,190 3,606,520 13.13 0.0 3.64 COM
MPI 359,494 42,328 273.57 0.6 6463.19 MPI
SCOREP 41 8 0.00 0.0 51.01 SCOREP
USR 145,711,040,072 44,677,967,872 9565.12 20.0 0.21 binvcrhs
USR 145,711,040,072 44,677,967,872 7873.16 16.5 0.18 matmul_sub
USR 145,711,040,072 44,677,967,872 5873.59 12.3 0.13 matvec_sub
USR 3,916,768,154 1,152,495,616 208.91 0.4 0.18 lhsinit
USR 3,916,768,154 1,152,495,616 163.12 0.3 0.14 binvrhs
USR 3,510,517,192 1,080,078,336 133.68 0.3 0.12 exact_solution
OMP 22,361,088 2,056,192 0.59 0.0 0.29 ?$omp?parallel?@exch_qbc.f90:206
OMP 22,361,088 2,056,192 0.59 0.0 0.29 ?$omp?parallel?@exch_qbc.f90:217
OMP 22,361,088 2,056,192 0.60 0.0 0.29 ?$omp?parallel?@exch_qbc.f90:245
The first thing we gather from this report is that Score-P estimates that 3330 GB will be required to store the trace data when we do a trace experiment which means that 418 GB of memory/process is required to avoid intermediate flushes to the disk.
From the next section, we can see that most of the data comes from six functions
which are called often (high visits count) for a very short period of time (low
time/visit). These six functions contribute to 50% of the execution time, most
of it, very likely to be measurement overhead due to frequently executed small
functions. We can also see that these six functions contribute massively to the
trace data storage requirements (max_buf
column).
Optimize the measurement
Adapt thess steps to you application
The steps descibed in this section are very application dependent. You might not need a filter and the functions to filter will depend on the application under study.
In order to reduce the trace size and measurement overhead, we can create a
filter file to exclude the functions that are called often but do not actually
take a long time to execute. The filter file that we will name scorep.filter
looks like this:
SCOREP_REGION_NAMES_BEGIN
EXCLUDE
binvcrhs
matmul_sub
matvec_sub
lhsinit
binvrhs
exact_solution
SCOREP_REGION_NAMES_END
We can immediately see the effect of filtering by running the post-processing
command again, but this time, with the filter applied (-f scorep.filter
option).
Then if we look at the report, we can see that the total size to the trace has dropped from 3330 GB to 2053 MB. The memory requirement to avoid intermediate flushes is now 265 MB.
$ head -n 7 ./scorep_myapp_instr_8x4_sum/scorep.score
Estimated aggregate size of event trace: 2053MB
Estimated requirements for largest trace buffer (max_buf): 257MB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 265MB
(hint: When tracing set SCOREP_TOTAL_MEMORY=265MB to avoid intermediate flushes
or reduce requirements using USR regions filters.)
Visualizing the data
We can use Cube GUI to open the summary cube file generated by the profiling experiment. Cube GUI is available for all major operating systems:
The first step is to copy the summary cube file (summary.cubex
) located in the
experiment directory (scorep_myapp_instr_8x4_sum
) from NIC5 to your computer.
See this section to see how to copy files from NIC5
to your computer.
Next, start the Cube application on your computer and open the summary.cubex
file.
The Cube GUI application present the data in three panes (hence the name "cube"). A pane with performance metrics, where a number of metrics are available, such as computation and communication time. A pane with the call path, which contains the call tree of your application. Finally, a pane with the system resource, which contains compute nodes, processes and threads, depending on the parallel programming model.
Basic usage of the Cube GUI is a follow:
- We select a metric in the left pane.
- The center pane will display the total contribution of each function of your application to the selected metric.
- The right pane will display the contribution of each process elements (compute nodes, processes or/and threads) to the metric(s) and function(s) selected.
With a profiling experiment, we only have an handful of metrics available. Moreover, as discussed before, we can not rely consider these measumrements as reliable as we have observed a significant overhead during profiling.
Tracing run
In order to collect more information and obtain more metrics, we can run a tracing experiment. The modified job script to perform this tracing experiment is presented below.
#!/bin/bash -l
#
#SBATCH --job-name="Scalasca profile"
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00
#SBATCH --mem-per-cpu=2000
module load Scalasca
export SCOREP_TOTAL_MEMORY=265MB
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
scalasca -analyze -q -t -f scorep.filter srun ./myapp_instr
The difference with the profiling experiment we ran before is that we disabled
profiling using the -q
option and enabled tracing with the -t
. We also used
the -f FILTERFILE
to provide a filter file with a list of functions to exclude
from the measurement.
The order of the options is important
The order in which the -q
and -t
options are used is important. The
-q
option disables both profiling and tracing. The -t
option enable
tracing. If the options are passed in reserve order, then tracing
will be disabled.
We can submit the job to the queue using the command:
and wait for the execution to complete. This time, the execution time was very similar to the execution time of the reference run for the non-instrumented application. This means that the instrumentation did not introduce significant overhead and that the data collected should be close to the actual values for a non-instrumented application.
We can perform the post-processing step. This time the data collected will be
stored in the scorep_myapp_instr_8x4_trace
directory.
This will produce a file with name trace.cubex
. We can copy this file to our
computer for visualization with Cube.
Compared to the profiling run, we have more metrics available. Information about the metrics can be found here.