Scalability Analysis

Once you've parallelized your code, it's important to quantify how efficient this parallelization is.For a parallel application, the speedup achieved bu running on $N$ core, $S(N)$ can be calculated measuring the time it takes to run on one core ($T(1)$) and dividing it by the time it takes to run the calculation on $N$ cores, $T(N)$.

$$ S(N) = \frac{T(1)}{T(N)} $$

For a perfect parallel application, the speedup should be equal to $N$, i.e., the speedup from parallelization would be linear. In practice, this is rarely the case. This is the reason why we introduce a second metric, the parallel efficiency $P_{eff}$ which measure how efficient the parallelization is compared to the ideal case.

$$ P_{eff}(N) = \frac{1}{N} \cdot \frac{T(1)}{T(N)} $$

As a general rule, you should not run a parallel application at a scale for which the parallel efficiency is lower than 70%. For example, if you can run your application on 64 ranks with a 76% efficiency and on 128 ranks with a 65% parallel efficiency, you should run it on 64 ranks instead of wasting resources with 128 ranks. Even for a small HPC cluster like NIC5 with a total cost of ownership (TCO) of ~1.4M€, the difference over the lifetime of the machine between wasting 24% and 35% represents a significant amount of money:

$$ (1.4\text{M€} \cdot 35\%) - (1.4\text{M€} \cdot 24\%) = 154\ 000\text{€} $$

A the scale of the largest supercomputer in Europe, LUMI at the time of writing, with a TCO of 200M€ it represents

$$ (200\text{M€} \cdot 35\%) - (200\text{M€} \cdot 24\%) = 22\text{M€} $$

The waste on money is not the only factor that should be taken into account. The energetic cost of your job can also be significant and an inefficient job might consume the same amount of energy than an efficient one. This means that you may be wasting money (cost of energy) as well as producing extra CO₂ emissions.

In this chapter, we analyze the scalability of the miniWeathear Mini App, which mimics the basic dynamics seen in atmospheric weather and climate.

For the exam

This chapter presents typical results you may obtain while performing the scaling analysis of your own code but doesn't presents an analysis underlying cause of the observed behavior. However, for the exam, you should not limit yourself to data collection. You need to explain the behavior of your code based on what was taught in the courses.

Archive to reproduce the results of this chapter
Excel file with the results

General Recommendations

Scalability analysis should be performed in a way that avoids the influence from external factors such as other jobs running on the same node. For this reason it is recommended to run the analysis using the --exclusive option of sbatch which will allocate full node(s) to the job. For example, on NIC5, you can use the following directives in your job:

#SBATCH --exclusive
#SBATCH --partition=hmem
#SBATCH --mem=0

Here, we use the hmem partition as it is easier to get a full-node allocation using that partition. We also allocate the entire memory on the node with the --mem=0 option.

Strong Scaling

To conduct the strong scaling analysis, we increase the number of OpenMP threads and/or MPI ranks while maintaining a constant problem size. Strong scaling measure the reduction of the time to solution that can be achieved by using more computational resources.

OpenMP Strong scaling

For the OpenMP strong scaling analysis, we will increase in the number of OpenMP threads as well as the thread binding to investigate whether proximity or separation of threads affects the scalability of the miniWeather application.

Setting the OMP_PROC_BIND environment variable to spread with

export OMP_PROC_BIND=spread

ensures that threads are spaced widely across cores. For instance, on NIC5, which has 64 cores per node, configuring 4 threads with spread binding results in assignments to cores 0, 16, 32, and 48.

The table below present the results of the OpenMP strong scaling with this value of OMP_PROC_BIND.

Num. threads	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	448.278	448.278	1.00	1.00	100.00%
2	305.802	224.139	1.47	2.00	73.30%
4	184.969	112.070	2.42	4.00	60.59%
8	98.357	56.035	4.56	8.00	56.97%
16	73.980	28.017	6.06	16.00	37.87%
32	45.404	14.009	9.87	32.00	30.85%
64	33.496	7.004	13.38	64.00	20.91%

Based on the table above, we can conclude that the miniWeather application does not scale well in this configuration. The parallel efficiency falls below 60% as soon as we use 4 threads or more.

An alternative is to perform the same measurement with the threads bound close to each other by using

export OMP_PROC_BIND=close

which will lead to the results presented in the table below.

Num. threads	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	448.660	448.660	1.00	1.00	100.00%
2	226.300	224.330	1.98	2.00	99.13%
4	114.200	112.165	3.93	4.00	98.22%
8	63.225	56.083	7.10	8.00	88.70%
16	36.429	28.041	12.32	16.00	76.98%
32	26.844	14.021	16.71	32.00	52.23%
64	35.563	7.010	12.62	64.00	19.71%

This configuration lead to better results than the previous one. The parallel efficiency stay above 70% up to 16 threads. Moreover the parallelization with 2 and 4 threads is very efficient with a parallel efficiency >98%.

Conclusions

The most efficient thread binding is close.
Using 4 or 8 threads is most efficient option (high parallel efficiency).
We should not use more than 16 threads as using more thread leads to a parallel efficiency <70%.

MPI Strong Scaling

We can now perform the same analysis with MPI, by increasing the number of ranks while keeping the problem size constant. The results obtained on a single node are presented in the table below.

Num. ranks	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	449.174	449.174	1.00	1.00	100.00%
2	226.807	224.587	1.98	2.00	99.02%
4	114.392	112.294	3.93	4.00	98.17%
8	58.277	56.147	7.71	8.00	96.34%
16	29.782	28.073	15.08	16.00	94.26%
32	15.423	14.037	29.12	32.00	91.01%
64	8.890	7.018	50.53	64.00	78.95%

We can see that the code scales better with MPI than with OpenMP. On 64 ranks the parallel efficiency is above 78% while with OpenMP it was ~20%.

The beauty of MPI is that we are not limited to a single compute nodes. The table below presents the results obtained by using 2 compute nodes.

Num. ranks	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	449.174	449.174	1.00	1.00	100.00%
2	226.895	224.587	1.98	2.00	98.98%
4	114.515	112.294	3.92	4.00	98.06%
8	58.027	56.147	7.74	8.00	96.76%
16	29.923	28.073	15.01	16.00	93.82%
32	15.767	14.037	28.49	32.00	89.03%
64	8.644	7.018	51.96	64.00	81.19%
128	4.962	3.509	90.52	128.00	70.72%

Overall, the application scales as well on two nodes than on a single node. With an equivalent number of ranks, the parallel efficiencies on a single node and on two nodes are very similar.

Note that the result for 64 and 128 ranks might not reflect the actual parallel efficiency of the application. As we are increasing the number of ranks, the amount of work per MPI rank might be so small that the overhead of communication becomes important compared to the time that is spent doing the actual computation.

Conclusions

The application MPI scalability is better than the OpenMP one.
The behahvior on single compute node and on 2 compute nodes is similar.

Hybrid MPI+OpenMP Strong Scaling

Now that we have scaling results for OpenMP and MPI, we can perform the same analysis by mixing the two models. As we have learned previously that with OpenMP, the parallel efficiency with more than 8 threads is very low, we limit the analysis for 2, 4 and 8 threads. The results are presented in the tables below.

2 OpenMP threads per MPI rank

Num. ranks	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	226.362	224.587	1.98	2.00	99.22%
2	114.655	112.294	3.92	4.00	97.94%
4	58.109	56.147	7.73	8.00	96.62%
8	29.632	28.073	15.16	16.00	94.74%
16	15.206	14.037	29.54	32.00	92.31%
32	8.138	7.018	55.19	64.00	86.24%

4 OpenMP threads per MPI rank

Num. ranks	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	114.054	112.294	3.94	4.00	98.46%
2	58.240	56.147	7.71	8.00	96.41%
4	29.626	28.073	15.16	16.00	94.76%
8	15.027	14.037	29.89	32.00	93.41%
16	8.009	7.018	56.08	64.00	87.63%

8 OpenMP threads per MPI rank

Num. ranks	Exec. time	Ideal time	Speedup	Ideal Speedup	Parallel Eff.
1	62.600	56.147	7.18	8.00	89.69%
2	32.996	28.073	13.61	16.00	85.08%
4	17.968	14.037	25.00	32.00	78.12%
8	10.973	7.018	40.93	64.00	63.96%

For the table above, we can see that mixing OpenMP and MPI leads to better scaling than using OpenMP alone. Using two and 4 threads results in timing close to the one obtained when running with MPI alone. While the difference is small, one can argue than the application scales better when running with 4 threads than running with MPI alone.

Conclusions

On a single node, running in hybrid OpenMP+MPI mode is better than running only with OpenMP.
When using 2 or 4 threads, the strong scaling behavior is similar or slightly better than running in pure MPI mode.

Weak Scaling

Weak scaling involves increasing the number of threads and/or MPI ranks while proportionally increasing the problem size to match the added computational resources. In theory, as the amount of work per threads/ranks will remain constant, the execution time of the application should be the same as well.

MPI weak scaling

The table below show the MPI weak scaling results of the miniWeather application up to 128 ranks (2 compute nodes).

Num. ranks	Exec. time	Weak scaling eff.
1	230.391	100.00%
2	227.082	101.46%
4	228.239	100.94%
8	227.195	101.41%
16	234.806	98.12%
32	236.069	97.59%
64	241.561	95.38%
128	248.337	92.77%

The results above indicate good weak scaling performance, with minor inefficiencies emerging as the system size increase: the weak scaling efficiency above 90% at 128 ranks is quite acceptable.

Hybrid MPI+OpenMP Wak Scaling

From the strong scaling analysis, we have concluded that OpenMP does not scale well above 8 threads. As a consequence we will not perform a weak scaling analysis only we OpenMP. However, we can perform the analysis of the weak scaling behavior when running in hybrid OpenMP+MPI mode. The results are presented in the table below.

Num. ranks	Num. Threads	Exec. time	Weak scaling eff.
1	2	226.731	101.56%
	4	247.551	93.02%
	8	260.567	88.37%
2	2	227.599	101.18%
	4	231.382	99.52%
	8	261.184	88.17%
4	2	227.459	101.24%
	4	234.047	98.39%
	8	263.285	87.46%
8	2	235.167	97.92%
	4	237.614	96.91%
	8	267.544	86.07%
16	2	234.312	98.28%
	4	242.185	95.08%
	8	292.457	78.74%
32	2	238.763	96.45%
	4	241.512	95.35%
64	2	241.352	95.41%

From the table above we can see that for 2 and 4 threads, the weak scaling performance is similar (or slightly better) to the results obtained in pure MPI mode. These results are consistent with the one obtained for the strong scaling.