pypr/pysph

Confusing time-performance of SpatialHashNNPS (or LinkedListNNPS) when I run in serial and in parallel

fabiotrovato opened this issue · 8 comments

Hello,

I am using pysph for my project, which reduces mostly to using one of LinkedListNNPS or SpatialHashNNPS to find those points, belonging to a spherical grid, that are neighbours of a protein's atoms.

What I have observed so far can be summarized by the following test performed on an HPC cluster with SLURM scheduler.
If I run my script serially on 1 proc, I get an execution time of neighbour search of 0.00285557 sec. This number is an average over 96 repetitions, that is, in my script the first and only process loops over the same search 96 times.
If I run my script in parallel on 96 procs (I use mpi4py), I get an execution time of neighbour search of 0.212207 sec. This time, every process executes only once the neighbour search. In total, I have 96 such numbers, which average to ~0.2.

I tried to find a reason for this 100-fold difference, but I cannot really understand it. Why a single call of SpatialHashNNPS requires ~0.002 sec if I run serially, but ~0.2 sec if I run in parallel. Shouldn't the time for a single call be mostly independent from the number of processors I use?
The only thing that might explain this is that SpatialHashNNPS is parallelized and therefore dependent on the number of processors. Is it so?
What I would like is to have a performance of ~0.002, regardless of the number of processors I use.

I can provide a script to test/reproduce my results, if necessary.

Thank you,
Fabio

How specifically are you doing the timing? When running in parallel, when the compute accelerations method on the integrator is called, it calls parallel_manager.update(), this will actually take a lot of time as there are more processors involved. However, I do not know how you are testing or timing this to be able to be say for sure. Once the particles are exchanged, the nnps itself should take the same time in serial as it does in parallel modulo the number of remote particles there are, i.e. in serial there are no remote particles to worry about but in parallel there will be "more" particles due to the remote particles.

I am timing using the module time, as follows:

ta = time.time()
pa_src = get_particle_array(name='source', x=src_rx, y=src_ry, z=src_rz, h=1.0)
pa_dst = get_particle_array(name='destin', x=dst_rx, y=dst_ry, z=dst_rz, h=1.0)
nps = nnps.SpatialHashNNPS(dim=3, particles=[pa_src,pa_dst], radius_scale=rcut)
tb = time.time()
print "Time for nnps.SpatialHashNNPS",tb-ta

Not really sure about what happens during the exchanges you mention, but I would like to be able to run with the same performances as in serial. Simply imagine the example I gave at the beginning, where I am calling SpatialHashNNPS many times, in a loop. I have parallelized this loop with mpi4py, but of course all my effort would be wasted with a 100-fold drop in performance due only to pysph.
Am I doing anything wrong, perhaps? Something trivial I should avoid?
If not, is there a way to tell the parallel_manager to run serially at each call?

Sorry, I still do not understand. There should really be no change when you run nnps on data that you manage. How exactly are you distributing the data? Are you using pysph the framework, i.e. you have an application subclass and are calling that with mpirun? Or is this your own Python code instantiating PySPH objects and calling them and distributing things with mpi4py by yourself? Could you email me a minimal script that reproduces this problem offline? The whole point of the pysph design is that only the parallel manager does any parallel communication. The rest are the same serial pieces with local and remote data. A simple reproducible example would really help.

I'm using pysph in my own code. And to simplify the discussion I was not distributing anything with mpi4py in this code. This means I simply run my neighbour search on either (i) 1 proc with a for loop of 96 iterations or (ii) 96 procs with a for loop of 1 iteration.

The attached archive contains the code test_pysph.py. You can run it using the provided slurm submission script: type "sbatch test_pysph.slurm". Note that in the .slurm file you need to choose either 1 or 96 procs as per point (i) and (ii). Correspondingly, in the .py code you need to uncomment/comment the right for loop at lines 185/186.

You will also need the module mdtraj, used to read in the two pdb files involved in the neighbour search (one is the protein 1ake and the other is sphere.pdb).

Hope this helps.
Fabio

test_pysph.tar.gz

PS I'll read the answer tomorrow, thank you.

Hello Prabhu,

did you have a chance to look at my script and and see if there's anything weird that might explain the performance I see in serial and parallel?

Hi,
Sorry I will get to this tomorrow and let you know. I have been rather busy.

Hi,

I am posting here the solution I have found, because other people might benefit from it and because it was not easy to spot.
First of all, I experienced the mentioned problem on just one of two HPC clusters. The cause seemed to be some weird (and environment-specific) interaction between OpenMP and pysph, most likely the packages imported into pysph. Not sure which one and if it's just one package, but numpy is a candidate as per this website https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy. By adding the following line to the very beginning of my code solved the huge perfomance slowdown:

import os
os.environ["OMP_NUM_THREADS"] = "1"

Cheers,
Fabio

I guess this can be closed now that we've debugged your problem and found a solution?