script hangs when using unbuffered output
valerio-marra opened this issue · 17 comments
General information
- Corrfunc version: 2.4.0
- platform: cluster with CentOS 7.4.1708
- installation method (pip/source/other?): first pip, then source
Issue description
i’m running Corrfunc on a simulation snapshot: i’m computing the angular correlation function in thin shells using DDtheta_mocks
. I first installed Corrfunc via pip, then, in order to increase performance, via source:
$ git clone https://github.com/manodeep/Corrfunc/
$ make
$ make install
$ python -m pip install . --user
$ make tests
However, while with the pip install the code would parallelize on multiple threads, now it runs mostly on one thread. I’m submitting my SLURM job via:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=480000
#SBATCH --exclusive
[…]
export OMP_NUM_THREADS=48
srun -n 1 python -u $PY_CODE > $LOGS
Expected behavior
To run on 48 threads at ~100%.
Actual behavior
To mostly run on 1 thread. I checked with htop
.
What have you tried so far?
To re-install it from source.
Minimal failing example
i’m attaching the log file corrfunc-logs.txt, including the 'make tests'.
Hi, @valerio-marra, looking at your log, it seems to build okay (using OpenMP), so maybe it's an affinity issue. Does the behavior change when you specify #SBATCH -c 48
? I would have thought that --exclusive
would have taken care of that, but you never know... Maybe try passing -c
in srun -c 48 -n1 python -u $PY_CODE [...]
as well.
Can you also double-check that the OMP_PROC_BIND
and OMP_PLACES
bash environment variables are unset? Setting OMP_DISPLAY_ENV=TRUE
will print their values when the application starts and can help debug.
Does the parallelism work locally, and fail through Slurm? Or does it always run single-threaded?
If the parallelism worked with pip
installation but not from source, and you're invoking it with the exact same Slurm script, then it might be an OpenMP library issue, e.g. python is linked against one OpenMP and Corrfunc another. Since you're using Anaconda, you can try to build Corrfunc with Anaconda's compilers instead:
$ conda install gcc_linux-64
$ cd Corrfunc/
$ make distclean
$ CC=x86_64-conda_cos6-linux-gnu-gcc make # or better yet edit CC in common.mk
$ pip install -e ./
The name x86_64-conda_cos6-linux-gnu-gcc
might be different on your platform. I think installing the conda compiler package is actually supposed to alias gcc
to the conda compiler; you can check.
hi, thanks! i tried what you suggested and nothing worked. Actually, i assumed the code was working in single thread (and killed it because it was taking too long) but when i call DDtheta_mocks, it just keeps running without doing anything. I reduced the number of particles to 10**4 and it does not produce anything (it takes a few seconds on my laptop).
When i first installed Corrfunc via pip, it was working. I tried to re-install it via pip, but it does not work anymore (when i call DDtheta_mocks, it just keeps running without doing anything). I think the problem is that it is loaded the old system gcc instead of the conda one (that i installed as you suggested).
It might be running, but just very very slowly because 48 threads are fighting for one core. If you run it with DDtheta_mocks(..., nthreads=1)
, does it complete? Adding verbose=True
ought to give a progress bar.
I also just realized I got the syntax wrong for the make
command, it should be:
$ make CC=x86_64-conda_cos6-linux-gnu-gcc
You may have realized this already if you saw Corrfunc was still building with gcc instead of that long compiler name.
Thanks @lgarrison
For a one-line solution (may be in modern enough pip
versions?), you can use the install-option
parameter -- python -m pip install --install-option="CC=x86_64-conda_cos6-linux-gnu-gcc" -e . --verbose
Hi @lgarrison, regarding the compilation, indeed it was using the system gcc, but i edited CC in common.mk.
I’m attaching the compilation logs, it gave this warnings:
../common.mk:371: DISABLING AVX-512 SUPPORT DUE TO GNU ASSEMBLER BUG. UPGRADE TO BINUTILS >=2.32 TO FIX THIS.
How can I update BINUTILS?
Regarding verbose=True
, I’ve been using it and it works on my laptop but when i run the script via slurm it does not show anything. Again, it seems that DDtheta_mocks just keeps running without doing anything.
Regarding OMP_DISPLAY_ENV=TRUE
, i’m attaching the logs.
OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '48'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'FALSE'
OMP_PLACES = ''
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OMP_DEFAULT_DEVICE = '0'
OMP_MAX_TASK_PRIORITY = '0'
OMP_DISPLAY_AFFINITY = 'FALSE'
OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END
Regarding running with nthreads=1
, same as before: DDtheta_mocks just keeps running without doing anything.
I think I'm running out of ideas, other than to try yet more compilers and/or Python stacks. If your cluster has other compilers (e.g. clang, icc, other versions of gcc) available via modules (module load clang
...), that would probably be the easiest way to try. Same with different Python environments (e.g. try a clean conda environment, or a non-conda environment if you have module load python
).
Another thought: if you have a cluster where the submission nodes might have a different architecture than the compute nodes, make sure you build on the compute nodes.
If you want to confirm that the issue is OpenMP related, you can disable OpenMP support by commenting out OPT += -DUSE_OMP
in common.mk
.
I wouldn't worry about the binutils bug for now, it's secondary to the code running at all.
@manodeep do you have any ideas?
thanks, @lgarrison , i'll try that (I already tried using a clean conda environment).
Could it be that the uninstalled pip version is still being called? Otherwise, why make tests
is successful?
One more thing: you said "double-check that the OMP_PROC_BIND and OMP_PLACES bash environment variables are unset" but it seems that it is OMP_PROC_BIND = 'FALSE'
. Is this a problem?
Oh, that's true, I didn't read the logs carefully enough! I just assumed the C tests were passing, but it looks like the Python tests are passing too. Maybe the issue is exactly what you suggested, and you're installing in one environment and running in another. Make sure to repeat pip uninstall Corrfunc
until no more installations are remaining. Don't run it from inside the Corrfunc source directory. Then reinstall in a fresh environment, and make sure it's loaded when you are running your Python script. Use print(Corrfunc.__file__)
to see what installation is being used. (This is all just general advice for managing Python packages, nothing here is specific to Corrfunc.)
OMP_PROC_BIND = 'FALSE'
is fine, that's the same as unset.
@lgarrison , @manodeep i found the problem: if i set verbose=False
then it works! I was always using verbose=True
. Is this a bug or a compilation issue?
Wow, that's pretty unusual! Glad it's working. I see you were running with Python in unbuffered mode with python -u $CODE
, does verbose=True
work if you remove the -u
?
I will note we've seen one other instance of verbose
causing problems here: #224, but it still seems to be a rare problem.
@lgarrison if i remove -u
it does work, although it updates the log file with low cadence and, actually, it does not print the info that verbose=True
usually prints, that is, it is as if i set verbose=False
. Does verbose=True
work only in interactive mode?
Should I fix the binutils bug to increase performance? I'll run my code on hundreds of snapshots.
Okay, I think I might understand the root cause here. It's probably this issue: minrk/wurlitzer#20
Specifically, we're filling up some buffer (or perhaps even blocking while trying to do an unbuffered write), but the code that drains the buffer (in Wurlitzer) is at the Python level. And that code can't run because we don't release the GIL when we call into Corrfunc.
I'll need to think about the right way to fix this. Releasing the GIL is probably something we ought to be doing anyway, although it will need to be tested. In addition, it's possible that we're not doing the output redirection in the simplest/most robust way.
On binutils/AVX-512, if you want the extra performance (usually a factor of < 2x), your best bet is if you can find another compiler stack to use, like clang, icc, or a more modern gcc. If one is not readily available, you can try to install one from scratch, although at that point it might not be worth your time! If you're feeling brave here are instructions that worked at least once: #196 (comment)
Oops, actually we are releasing the GIL. In which case I'm not exactly sure what's happening. Will investigate...
Hi @valerio-marra, can you please check if PR #270 fixes your issue? Just test your same code on the fix-std-redir
branch.
hi @lgarrison, it works! Now the verbose output is printed into the slurm job's standard error.
Regarding binutils/AVX-512, is it necessary to create e new environment? also, shouldn't there be a make
before pip in #196 (comment)?
It probably has the best chance of success in a new environment (and it has the least chance of disrupting any of your other work that uses an existing environment).
Pip runs make behind the scenes, so an explicit make is not necessary.