MPI Issue on Perlmutter
Closed this issue · 1 comments
I am seeing big holes in fields when painting catalogs on perlmutter on multiple nodes, e.g. for the mean over the z axis:
instead of (computed on 1 node):
There is no issue with mpi4py sending messages between nodes, nor with nbodykit on a single node.
In fact there is no issue even when using one rank per node on two nodes.
I expect this is related to fastpm/fastpm#112.
(Tangentially, I can also run fastpm just fine following the fix discussed in the above link on as many as 32 nodes).
Following up on some of @rainwoodman and @biweidai's discussion, it looks like this effect depends on the number of mpi tasks and nodes used. I expect for a large mpi grid somewhere there is some reordering?
This behavior with 2 nodes, 2 tasks working I mentioned above is similar to what Biwei saw. I also see different patterns when using 2 or 3 nodes vs 4 nodes (all with 128 nodes per task):
Though using 4 nodes with 64 ranks per node is identical to 4 nodes with 128 ranks per node.
Is there some fix similar to fastpm that can be done in nbodykit? Or maybe I am missing something?
Here is a reproducer (can also post the job script):
python file:
from nbodykit.lab import *
from nbodykit import setup_logging
setup_logging() # turn on logging to screen
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=42)
mesh = cat.to_mesh(window='tsc')
one_plus_delta = mesh.paint(mode='real')
FieldMesh(one_plus_delta).save('linear-mesh-real.bigfile', mode='real', dataset='Field')