Parallelism

Question

Parallelism

Closed this issue 2 years ago · 10 comments

gdtk/eilmer - I have requested more than 2 nodes worth of processors to run a case, but it seems as though in each run, gdtk only uses 2 nodes worth of CPUs. For instance, I have requested 6 nodes each with 40 cores resulting in 240 total, but eilmer only subscribes 80.

What is causing this? I have had the same issue on multiple clusters. I use something similar to the first PBS example in the gdtk website to setup the run.

Answer 1 · 2022-05-26T15:24:00.000Z

@uqngibbo Do you have tips on this?

Answer 2 · 2022-05-26T23:29:39.000Z

Have you tried running with mpirun -np 240 (assuming you've configured the MPI mapping for 240 processes). I know most clusters recommend not doing this and allow the scheduler to handle assignment, but might be worth trying. It is not an issue I've had running Eilmer thus far.

Answer 3 · 2022-05-26T23:30:19.000Z

Hi JR,
We'll probably need a bit more information. I can help from the Eilmer side. The cluster you're using will have its own nuances and we might need to work that out via private email. I'll give you the principal ideas for Eilmer's parallelism.

Eilmer does domain decomposition at the block level.
You require number_blocks >= number_cores

There are two cases simulations fall into: one where number_blocks precisely matches number_cores, one where number_blocks > number_cores.

In the first case (number_blocks = number_cores), this is usually achieved by partitioning a large unstructured grid (*) with the Metis partitioning tool. This is favoured by the people running many-core simulations. Alternatively, a hand-curated multi-block structured grid can fall into this category. People have had success by paying careful attention to use of FBArray{} in their input scripts. The intent is the same: try to achieve blocks with approximately equal cell counts so that the load on each core is approximately equal.

In the second case (number_blocks > number_cores), this requires use of the Eilmer load balancing tools. These cases arise from using FBArray or sometimes from GridPro grids where the block count is very large. The balancing can be done in-script, or directly after "--prep" with a call to e4loadbalance. This mode works best when number_blocks is MUCH greater than number_cores. The load balancer attempts to map several/many blocks to cores in a configuration that balances the computational load per core.

So in your case, you need at least 240 blocks if you want to use 240 cores. If you've met this condition, then we'll need to look at your cluster configuration and if you've exceeded some queuing limits.

(*) Unstructured grid in terms of data storage. It is common for people to use a third-party structured grid generator, but then write it out in unstructred grid format (SU2) so that they can partition with Metis.

Answer 4 · 2022-05-27T01:07:48.000Z

This is not something I've encountered before. Do you mind sharing both your LUA configuration file and your PBS submission script?

Answer 5 · 2022-05-29T01:36:33.000Z

@rjgollan-on-github I will try this out. It is possible that I never modified this part of the test case that I am trying when I increased the number of cores if this is the reason, I will reply to confirm if it worked. @uqngibbo As for the LUA and PBS script, I will have to share those when I return to my desktop Monday. There is a chance it could be partly due to the fact I worked with my cluster manager to create a Singularity container to install/run Eilmer. This was necessary as I do not have root access, so a container seemed to be more appropriate choice at the time. First things first however, I will respond with the files as soon as I return to my lab Monday.

Thank you both for the rapid replies.

Answer 6 · 2022-05-29T08:56:35.000Z

JR,
You should not need root access to run Eilmer on clusters. We have installed the code on many cluster for years without root access. You do require an MPI installation on the cluster and some basic libraries, but these should be standard on anything that calls itself a cluster.

Answer 7 · 2022-05-31T13:30:17.000Z

@rjgollan-on-github , perhaps we should resolve this through email? It seems that the issue is specific to the two separate clusters I run on. Both have the same error on install, but are managed by two entirely separate entities. Hopefully the error is on my end. How should I reach you?

Answer 8 · 2022-05-31T14:03:42.000Z

My email address is on my UQ researchers page. Search my name + UQ

Answer 9 · 2022-07-01T00:33:53.000Z

Any luck with this @jr144393?

Answer 10 · 2022-07-06T04:26:17.000Z

Issue closed as inactive.