Scalability Analysis
Once you've parallelized your code, it's important to quantify how efficient this parallelization is.For a parallel application, the speedup achieved bu running on $N$ core, $S(N)$ can be calculated measuring the time it takes to run on one core ($T(1)$) and dividing it by the time it takes to run the calculation on $N$ cores, $T(N)$.
$$ S(N) = \frac{T(1)}{T(N)} $$
For a perfect parallel application, the speedup should be equal to $N$, i.e., the speedup from parallelization would be linear. In practice, this is rarely the case. This is the reason why we introduce a second metric, the parallel efficiency $P_{eff}$ which measure how efficient the parallelization is compared to the ideal case.
$$ P_{eff}(N) = \frac{1}{N} \cdot \frac{T(1)}{T(N)} $$
As a general rule, you should not run a parallel application at a scale for which the parallel efficiency is lower than 70%. For example, if you can run your application on 64 ranks with a 76% efficiency and on 128 ranks with a 65% parallel efficiency, you should run it on 64 ranks instead of wasting resources with 128 ranks. Even for a small HPC cluster like NIC5 with a total cost of ownership (TCO) of ~1.4M€, the difference over the lifetime of the machine between wasting 24% and 35% represents a significant amount of money:
$$ (1.4\text{M€} \cdot 35\%) - (1.4\text{M€} \cdot 24\%) = 154\ 000\text{€} $$
A the scale of the largest supercomputer in Europe, LUMI at the time of writing, with a TCO of 200M€ it represents
$$ (200\text{M€} \cdot 35\%) - (200\text{M€} \cdot 24\%) = 22\text{M€} $$
The waste on money is not the only factor that should be taken into account. The energetic cost of your job can also be significant and an inefficient job might consume the same amount of energy than an efficient one. This means that you may be wasting money (cost of energy) as well as producing extra CO2 emissions.
In this chapter, we analyze the scalability of the miniWeathear Mini App, which mimics the basic dynamics seen in atmospheric weather and climate.
For the exam
This chapter presents typical results you may obtain while performing the scaling analysis of your own code but doesn't presents an analysis underlying cause of the observed behavior. However, for the exam, you should not limit yourself to data collection. You need to explain the behavior of your code based on what was taught in the courses.
Archive to reproduce the results of this chapter
Excel file with the results
General Recommendations
Scalability analysis should be performed in a way that avoids the influence from
external factors such as other jobs running on the same node. For this reason
it is recommended to run the analysis using the --exclusive
option of sbatch
which will allocate full node(s) to the job. For example, on NIC5, you can use
the following directives in your job:
Here, we use the hmem
partition as it is easier to get a full-node allocation
using that partition. We also allocate the entire memory on the node with the
--mem=0
option.
Strong Scaling
To conduct the strong scaling analysis, we increase the number of OpenMP threads and/or MPI ranks while maintaining a constant problem size. Strong scaling measure the reduction of the time to solution that can be achieved by using more computational resources.
OpenMP Strong scaling
For the OpenMP strong scaling analysis, we will increase in the number of OpenMP threads as well as the thread binding to investigate whether proximity or separation of threads affects the scalability of the miniWeather application.
Setting the OMP_PROC_BIND
environment variable to spread
with
ensures that threads are spaced widely across cores. For instance, on NIC5, which has 64 cores per node, configuring 4 threads with spread binding results in assignments to cores 0, 16, 32, and 48.
The table below present the results of the OpenMP strong scaling with this value of
OMP_PROC_BIND
.
Num. threads | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 448.278 | 448.278 | 1.00 | 1.00 | 100.00% |
2 | 305.802 | 224.139 | 1.47 | 2.00 | 73.30% |
4 | 184.969 | 112.070 | 2.42 | 4.00 | 60.59% |
8 | 98.357 | 56.035 | 4.56 | 8.00 | 56.97% |
16 | 73.980 | 28.017 | 6.06 | 16.00 | 37.87% |
32 | 45.404 | 14.009 | 9.87 | 32.00 | 30.85% |
64 | 33.496 | 7.004 | 13.38 | 64.00 | 20.91% |
Based on the table above, we can conclude that the miniWeather application does not scale well in this configuration. The parallel efficiency falls below 60% as soon as we use 4 threads or more.
An alternative is to perform the same measurement with the threads bound close to each other by using
which will lead to the results presented in the table below.Num. threads | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 448.660 | 448.660 | 1.00 | 1.00 | 100.00% |
2 | 226.300 | 224.330 | 1.98 | 2.00 | 99.13% |
4 | 114.200 | 112.165 | 3.93 | 4.00 | 98.22% |
8 | 63.225 | 56.083 | 7.10 | 8.00 | 88.70% |
16 | 36.429 | 28.041 | 12.32 | 16.00 | 76.98% |
32 | 26.844 | 14.021 | 16.71 | 32.00 | 52.23% |
64 | 35.563 | 7.010 | 12.62 | 64.00 | 19.71% |
This configuration lead to better results than the previous one. The parallel efficiency stay above 70% up to 16 threads. Moreover the parallelization with 2 and 4 threads is very efficient with a parallel efficiency >98%.
Conclusions
- The most efficient thread binding is
close
. - Using 4 or 8 threads is most efficient option (high parallel efficiency).
- We should not use more than 16 threads as using more thread leads to a parallel efficiency <70%.
MPI Strong Scaling
We can now perform the same analysis with MPI, by increasing the number of ranks while keeping the problem size constant. The results obtained on a single node are presented in the table below.
Num. ranks | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 449.174 | 449.174 | 1.00 | 1.00 | 100.00% |
2 | 226.807 | 224.587 | 1.98 | 2.00 | 99.02% |
4 | 114.392 | 112.294 | 3.93 | 4.00 | 98.17% |
8 | 58.277 | 56.147 | 7.71 | 8.00 | 96.34% |
16 | 29.782 | 28.073 | 15.08 | 16.00 | 94.26% |
32 | 15.423 | 14.037 | 29.12 | 32.00 | 91.01% |
64 | 8.890 | 7.018 | 50.53 | 64.00 | 78.95% |
We can see that the code scales better with MPI than with OpenMP. On 64 ranks the parallel efficiency is above 78% while with OpenMP it was ~20%.
The beauty of MPI is that we are not limited to a single compute nodes. The table below presents the results obtained by using 2 compute nodes.
Num. ranks | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 449.174 | 449.174 | 1.00 | 1.00 | 100.00% |
2 | 226.895 | 224.587 | 1.98 | 2.00 | 98.98% |
4 | 114.515 | 112.294 | 3.92 | 4.00 | 98.06% |
8 | 58.027 | 56.147 | 7.74 | 8.00 | 96.76% |
16 | 29.923 | 28.073 | 15.01 | 16.00 | 93.82% |
32 | 15.767 | 14.037 | 28.49 | 32.00 | 89.03% |
64 | 8.644 | 7.018 | 51.96 | 64.00 | 81.19% |
128 | 4.962 | 3.509 | 90.52 | 128.00 | 70.72% |
Overall, the application scales as well on two nodes than on a single node. With an equivalent number of ranks, the parallel efficiencies on a single node and on two nodes are very similar.
Note that the result for 64 and 128 ranks might not reflect the actual parallel efficiency of the application. As we are increasing the number of ranks, the amount of work per MPI rank might be so small that the overhead of communication becomes important compared to the time that is spent doing the actual computation.
Conclusions
- The application MPI scalability is better than the OpenMP one.
- The behahvior on single compute node and on 2 compute nodes is similar.
Hybrid MPI+OpenMP Strong Scaling
Now that we have scaling results for OpenMP and MPI, we can perform the same analysis by mixing the two models. As we have learned previously that with OpenMP, the parallel efficiency with more than 8 threads is very low, we limit the analysis for 2, 4 and 8 threads. The results are presented in the tables below.
2 OpenMP threads per MPI rank
Num. ranks | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 226.362 | 224.587 | 1.98 | 2.00 | 99.22% |
2 | 114.655 | 112.294 | 3.92 | 4.00 | 97.94% |
4 | 58.109 | 56.147 | 7.73 | 8.00 | 96.62% |
8 | 29.632 | 28.073 | 15.16 | 16.00 | 94.74% |
16 | 15.206 | 14.037 | 29.54 | 32.00 | 92.31% |
32 | 8.138 | 7.018 | 55.19 | 64.00 | 86.24% |
4 OpenMP threads per MPI rank
Num. ranks | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 114.054 | 112.294 | 3.94 | 4.00 | 98.46% |
2 | 58.240 | 56.147 | 7.71 | 8.00 | 96.41% |
4 | 29.626 | 28.073 | 15.16 | 16.00 | 94.76% |
8 | 15.027 | 14.037 | 29.89 | 32.00 | 93.41% |
16 | 8.009 | 7.018 | 56.08 | 64.00 | 87.63% |
8 OpenMP threads per MPI rank
Num. ranks | Exec. time | Ideal time | Speedup | Ideal Speedup | Parallel Eff. |
---|---|---|---|---|---|
1 | 62.600 | 56.147 | 7.18 | 8.00 | 89.69% |
2 | 32.996 | 28.073 | 13.61 | 16.00 | 85.08% |
4 | 17.968 | 14.037 | 25.00 | 32.00 | 78.12% |
8 | 10.973 | 7.018 | 40.93 | 64.00 | 63.96% |
For the table above, we can see that mixing OpenMP and MPI leads to better scaling than using OpenMP alone. Using two and 4 threads results in timing close to the one obtained when running with MPI alone. While the difference is small, one can argue than the application scales better when running with 4 threads than running with MPI alone.
Conclusions
- On a single node, running in hybrid OpenMP+MPI mode is better than running only with OpenMP.
- When using 2 or 4 threads, the strong scaling behavior is similar or slightly better than running in pure MPI mode.
Weak Scaling
Weak scaling involves increasing the number of threads and/or MPI ranks while proportionally increasing the problem size to match the added computational resources. In theory, as the amount of work per threads/ranks will remain constant, the execution time of the application should be the same as well.
MPI weak scaling
The table below show the MPI weak scaling results of the miniWeather application up to 128 ranks (2 compute nodes).
Num. ranks | Exec. time | Weak scaling eff. |
---|---|---|
1 | 230.391 | 100.00% |
2 | 227.082 | 101.46% |
4 | 228.239 | 100.94% |
8 | 227.195 | 101.41% |
16 | 234.806 | 98.12% |
32 | 236.069 | 97.59% |
64 | 241.561 | 95.38% |
128 | 248.337 | 92.77% |
The results above indicate good weak scaling performance, with minor inefficiencies emerging as the system size increase: the weak scaling efficiency above 90% at 128 ranks is quite acceptable.
Hybrid MPI+OpenMP Wak Scaling
From the strong scaling analysis, we have concluded that OpenMP does not scale well above 8 threads. As a consequence we will not perform a weak scaling analysis only we OpenMP. However, we can perform the analysis of the weak scaling behavior when running in hybrid OpenMP+MPI mode. The results are presented in the table below.
Num. ranks | Num. Threads | Exec. time | Weak scaling eff. |
---|---|---|---|
1 | 2 | 226.731 | 101.56% |
4 | 247.551 | 93.02% | |
8 | 260.567 | 88.37% | |
2 | 2 | 227.599 | 101.18% |
4 | 231.382 | 99.52% | |
8 | 261.184 | 88.17% | |
4 | 2 | 227.459 | 101.24% |
4 | 234.047 | 98.39% | |
8 | 263.285 | 87.46% | |
8 | 2 | 235.167 | 97.92% |
4 | 237.614 | 96.91% | |
8 | 267.544 | 86.07% | |
16 | 2 | 234.312 | 98.28% |
4 | 242.185 | 95.08% | |
8 | 292.457 | 78.74% | |
32 | 2 | 238.763 | 96.45% |
4 | 241.512 | 95.35% | |
64 | 2 | 241.352 | 95.41% |
From the table above we can see that for 2 and 4 threads, the weak scaling performance is similar (or slightly better) to the results obtained in pure MPI mode. These results are consistent with the one obtained for the strong scaling.