Skip to content

Scalability Analysis

Once you've parallelized your code, it's important to quantify how efficient this parallelization is.For a parallel application, the speedup achieved bu running on $N$ core, $S(N)$ can be calculated measuring the time it takes to run on one core ($T(1)$) and dividing it by the time it takes to run the calculation on $N$ cores, $T(N)$.

$$ S(N) = \frac{T(1)}{T(N)} $$

For a perfect parallel application, the speedup should be equal to $N$, i.e., the speedup from parallelization would be linear. In practice, this is rarely the case. This is the reason why we introduce a second metric, the parallel efficiency $P_{eff}$ which measure how efficient the parallelization is compared to the ideal case.

$$ P_{eff}(N) = \frac{1}{N} \cdot \frac{T(1)}{T(N)} $$

As a general rule, you should not run a parallel application at a scale for which the parallel efficiency is lower than 70%. For example, if you can run your application on 64 ranks with a 76% efficiency and on 128 ranks with a 65% parallel efficiency, you should run it on 64 ranks instead of wasting resources with 128 ranks. Even for a small HPC cluster like NIC5 with a total cost of ownership (TCO) of ~1.4M€, the difference over the lifetime of the machine between wasting 24% and 35% represents a significant amount of money:

$$ (1.4\text{M€} \cdot 35\%) - (1.4\text{M€} \cdot 24\%) = 154\ 000\text{€} $$

A the scale of the largest supercomputer in Europe, LUMI at the time of writing, with a TCO of 200M€ it represents

$$ (200\text{M€} \cdot 35\%) - (200\text{M€} \cdot 24\%) = 22\text{M€} $$

The waste on money is not the only factor that should be taken into account. The energetic cost of your job can also be significant and an inefficient job might consume the same amount of energy than an efficient one. This means that you may be wasting money (cost of energy) as well as producing extra CO2 emissions.

In this chapter, we analyze the scalability of the miniWeathear Mini App, which mimics the basic dynamics seen in atmospheric weather and climate.

For the exam

This chapter presents typical results you may obtain while performing the scaling analysis of your own code but doesn't presents an analysis underlying cause of the observed behavior. However, for the exam, you should not limit yourself to data collection. You need to explain the behavior of your code based on what was taught in the courses.

Archive to reproduce the results of this chapter
Excel file with the results

General Recommendations

Scalability analysis should be performed in a way that avoids the influence from external factors such as other jobs running on the same node. For this reason it is recommended to run the analysis using the --exclusive option of sbatch which will allocate full node(s) to the job. For example, on NIC5, you can use the following directives in your job:

#SBATCH --exclusive
#SBATCH --partition=hmem
#SBATCH --mem=0

Here, we use the hmem partition as it is easier to get a full-node allocation using that partition. We also allocate the entire memory on the node with the --mem=0 option.

Strong Scaling

To conduct the strong scaling analysis, we increase the number of OpenMP threads and/or MPI ranks while maintaining a constant problem size. Strong scaling measure the reduction of the time to solution that can be achieved by using more computational resources.

OpenMP Strong scaling

For the OpenMP strong scaling analysis, we will increase in the number of OpenMP threads as well as the thread binding to investigate whether proximity or separation of threads affects the scalability of the miniWeather application.

Setting the OMP_PROC_BIND environment variable to spread with

export OMP_PROC_BIND=spread

ensures that threads are spaced widely across cores. For instance, on NIC5, which has 64 cores per node, configuring 4 threads with spread binding results in assignments to cores 0, 16, 32, and 48.

The table below present the results of the OpenMP strong scaling with this value of OMP_PROC_BIND.

Num. threads Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 448.278 448.278 1.00 1.00 100.00%
2 305.802 224.139 1.47 2.00 73.30%
4 184.969 112.070 2.42 4.00 60.59%
8 98.357 56.035 4.56 8.00 56.97%
16 73.980 28.017 6.06 16.00 37.87%
32 45.404 14.009 9.87 32.00 30.85%
64 33.496 7.004 13.38 64.00 20.91%

Based on the table above, we can conclude that the miniWeather application does not scale well in this configuration. The parallel efficiency falls below 60% as soon as we use 4 threads or more.

An alternative is to perform the same measurement with the threads bound close to each other by using

export OMP_PROC_BIND=close
which will lead to the results presented in the table below.

Num. threads Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 448.660 448.660 1.00 1.00 100.00%
2 226.300 224.330 1.98 2.00 99.13%
4 114.200 112.165 3.93 4.00 98.22%
8 63.225 56.083 7.10 8.00 88.70%
16 36.429 28.041 12.32 16.00 76.98%
32 26.844 14.021 16.71 32.00 52.23%
64 35.563 7.010 12.62 64.00 19.71%

This configuration lead to better results than the previous one. The parallel efficiency stay above 70% up to 16 threads. Moreover the parallelization with 2 and 4 threads is very efficient with a parallel efficiency >98%.

Conclusions

  • The most efficient thread binding is close.
  • Using 4 or 8 threads is most efficient option (high parallel efficiency).
  • We should not use more than 16 threads as using more thread leads to a parallel efficiency <70%.

MPI Strong Scaling

We can now perform the same analysis with MPI, by increasing the number of ranks while keeping the problem size constant. The results obtained on a single node are presented in the table below.

Num. ranks Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 449.174 449.174 1.00 1.00 100.00%
2 226.807 224.587 1.98 2.00 99.02%
4 114.392 112.294 3.93 4.00 98.17%
8 58.277 56.147 7.71 8.00 96.34%
16 29.782 28.073 15.08 16.00 94.26%
32 15.423 14.037 29.12 32.00 91.01%
64 8.890 7.018 50.53 64.00 78.95%

We can see that the code scales better with MPI than with OpenMP. On 64 ranks the parallel efficiency is above 78% while with OpenMP it was ~20%.

The beauty of MPI is that we are not limited to a single compute nodes. The table below presents the results obtained by using 2 compute nodes.

Num. ranks Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 449.174 449.174 1.00 1.00 100.00%
2 226.895 224.587 1.98 2.00 98.98%
4 114.515 112.294 3.92 4.00 98.06%
8 58.027 56.147 7.74 8.00 96.76%
16 29.923 28.073 15.01 16.00 93.82%
32 15.767 14.037 28.49 32.00 89.03%
64 8.644 7.018 51.96 64.00 81.19%
128 4.962 3.509 90.52 128.00 70.72%

Overall, the application scales as well on two nodes than on a single node. With an equivalent number of ranks, the parallel efficiencies on a single node and on two nodes are very similar.

Note that the result for 64 and 128 ranks might not reflect the actual parallel efficiency of the application. As we are increasing the number of ranks, the amount of work per MPI rank might be so small that the overhead of communication becomes important compared to the time that is spent doing the actual computation.

Conclusions

  • The application MPI scalability is better than the OpenMP one.
  • The behahvior on single compute node and on 2 compute nodes is similar.

Hybrid MPI+OpenMP Strong Scaling

Now that we have scaling results for OpenMP and MPI, we can perform the same analysis by mixing the two models. As we have learned previously that with OpenMP, the parallel efficiency with more than 8 threads is very low, we limit the analysis for 2, 4 and 8 threads. The results are presented in the tables below.

2 OpenMP threads per MPI rank

Num. ranks Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 226.362 224.587 1.98 2.00 99.22%
2 114.655 112.294 3.92 4.00 97.94%
4 58.109 56.147 7.73 8.00 96.62%
8 29.632 28.073 15.16 16.00 94.74%
16 15.206 14.037 29.54 32.00 92.31%
32 8.138 7.018 55.19 64.00 86.24%

4 OpenMP threads per MPI rank

Num. ranks Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 114.054 112.294 3.94 4.00 98.46%
2 58.240 56.147 7.71 8.00 96.41%
4 29.626 28.073 15.16 16.00 94.76%
8 15.027 14.037 29.89 32.00 93.41%
16 8.009 7.018 56.08 64.00 87.63%

8 OpenMP threads per MPI rank

Num. ranks Exec. time Ideal time Speedup Ideal Speedup Parallel Eff.
1 62.600 56.147 7.18 8.00 89.69%
2 32.996 28.073 13.61 16.00 85.08%
4 17.968 14.037 25.00 32.00 78.12%
8 10.973 7.018 40.93 64.00 63.96%

For the table above, we can see that mixing OpenMP and MPI leads to better scaling than using OpenMP alone. Using two and 4 threads results in timing close to the one obtained when running with MPI alone. While the difference is small, one can argue than the application scales better when running with 4 threads than running with MPI alone.

Conclusions

  • On a single node, running in hybrid OpenMP+MPI mode is better than running only with OpenMP.
  • When using 2 or 4 threads, the strong scaling behavior is similar or slightly better than running in pure MPI mode.

Weak Scaling

Weak scaling involves increasing the number of threads and/or MPI ranks while proportionally increasing the problem size to match the added computational resources. In theory, as the amount of work per threads/ranks will remain constant, the execution time of the application should be the same as well.

MPI weak scaling

The table below show the MPI weak scaling results of the miniWeather application up to 128 ranks (2 compute nodes).

Num. ranks Exec. time Weak scaling eff.
1 230.391 100.00%
2 227.082 101.46%
4 228.239 100.94%
8 227.195 101.41%
16 234.806 98.12%
32 236.069 97.59%
64 241.561 95.38%
128 248.337 92.77%

The results above indicate good weak scaling performance, with minor inefficiencies emerging as the system size increase: the weak scaling efficiency above 90% at 128 ranks is quite acceptable.

Hybrid MPI+OpenMP Wak Scaling

From the strong scaling analysis, we have concluded that OpenMP does not scale well above 8 threads. As a consequence we will not perform a weak scaling analysis only we OpenMP. However, we can perform the analysis of the weak scaling behavior when running in hybrid OpenMP+MPI mode. The results are presented in the table below.

Num. ranks Num. Threads Exec. time Weak scaling eff.
1 2 226.731 101.56%
4 247.551 93.02%
8 260.567 88.37%
2 2 227.599 101.18%
4 231.382 99.52%
8 261.184 88.17%
4 2 227.459 101.24%
4 234.047 98.39%
8 263.285 87.46%
8 2 235.167 97.92%
4 237.614 96.91%
8 267.544 86.07%
16 2 234.312 98.28%
4 242.185 95.08%
8 292.457 78.74%
32 2 238.763 96.45%
4 241.512 95.35%
64 2 241.352 95.41%

From the table above we can see that for 2 and 4 threads, the weak scaling performance is similar (or slightly better) to the results obtained in pure MPI mode. These results are consistent with the one obtained for the strong scaling.