Resampling a very big file with nbodykit
boryanah opened this issue · 6 comments
Hi Yu Feng,
I am trying to resample a mesh stored in a bigfile from 2304^3 (40 GB) to 1152^3 and save it again to a bigfile mesh using a parallel script (on NERSC).
So far I have tried doing the following, granting the task 512 GB spread across 16 nodes, but got an out of memory error. Here is the script:
mesh = BigFileMesh(dens_dir+"density_%d.bigfile"%N_dim, mode='real', dataset='Field')
print("loaded bigfile")
mesh.save(dens_dir+"density_%d.bigfile"%N_dim_new, mode='real', dataset='Field', Nmesh=N_dim_new)
(The failure is after loading the bigfile. Also I have very very slightly modified the save function in the source code to allow the user to specify the new downsampled mesh-size Nmesh_new by adding Nmesh=Nmesh to the following line in the source code self.compute(mode,Nmesh=Nmesh) in the save function and have tested through a simple example that it works)
The failure of this line makes me think that the save step is perhaps not parallelized or perhaps that when resampling (i.e. Nmesh is not equivalent to the Nmesh of the original mesh) the self._paint_XXX function isn't.
A new approach that I intend to pursue is loading the bigfile mesh and converting it to a complex field through the paint command (as I believe that uses pfft) and then zeroing all modes that are higher than pi Nmesh_new/Lbox before reverting back to real space through the paint command (and saving to a bigfile the downsampled field). Is that approach parallelized? And if not what functions can I apply to make it so?
My nbodykit version is 0.3.15
Hi,
The current resampling algorithm in pmesh is doing more or less what you wanted to do. It calls to https://github.com/rainwoodman/pmesh/blob/f8f1d78eb22027fbcf71345d382477cea25ab9e3/pmesh/pm.py#L479
What you do appears to be correct, but the memory is actually bit tight.
The resampling algorithm does use a lot of memory -- it wasn't very optimized. It was originally written to handle 128/256 meshes under extremely low memory load (in the over-decomposed regime) I think.
When I scanned the code in pmesh,, I see scratch spaces for r2c (1x), for the indices (2 to 4 long integers, 4x or 8x of floats), for global sorting (2x-ish), and c2r(1x), and output space (1x), and some masks that can probably be ignored.
So the 512 GB appears to be a bit tight. (as 13 * 40G = 520G).
As usually more nodes = faster processing time, if you can get 32 nodes, then give it a try. We can also try to curb the memory usage of resample, it is likely some of the objects can be eagerly freed with del, for example.
Another reason to run into OOM is that if you use too many ranks such that 1152x1152 is not divided by the process mesh; in that case some ranks will not receive any part of the mesh. This seems to be unlikely as you only have 16 nodes (and potentially 16*64 = 32x32 ranks; how many ranks did you run the script with?), and 1152 = 128 * 9.
Thank you -- that's good to know!
I am currently using I believe 64 ranks (2 tasks per node and 32 nodes):
#!/bin/bash -l
#SBATCH --tasks-per-node=2
#SBATCH --nodes=32
#SBATCH --mem=16GB
#SBATCH -q debug
#SBATCH -t 00:05:00
#SBATCH -C haswell
#SBATCH -J test-fields
#SBATCH --mail-user=boryana.hadzhiyska@cfa.harvard.edu
#SBATCH --mail-type=ALL
conda activate desc
which mpirun
echo $SLURM_NTASKS
mpirun -n $SLURM_NTASKS python obtain_fields.py
I am not super familiar with the desc environment.
Could you do a conda list? Is the the environment you built?
mpirun on NERSC may not actually go across multiple nodes.
Did you check if the result of these commands make sense to you?
mpirun -n 64 hostname
and
mpirun -n 64 python -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.rank)'
(A year ago they were giving unexpected results)
Running on NERSC computing nodes with MPI, I usually use the bcast based environment provided by m3035:
source /usr/common/software/m3035/conda-activate.sh 3.7
# optionally, add
# bcast-pip [path_to_sdist_zip_file_of_additional_package.zip ...]
srun -n 64 python obtain_fields.py
Hi Yu Feng,
I will attempt to follow those instructions in more detail if I run into more issues in the future, but for now it seems like mpirun is working properly across nodes and outputs different host names!
I was more worried about the question of whether the code is properly parallelized: e.g. when I do manipulations on a bigfile mesh such as mesh.paint('real'), mesh.apply(filter), ArrayMesh(numpy_array), FFTPower(mesh), and mesh.save()? I am slightly confused by the documentation about which functions are and are not parallelized, but I may also be misreading!
Cheers,
Boryana
Thanks a lot for your help! This is all very useful to know!