ms4alg hangs after lots of thread-related log output
Closed this issue · 3 comments
My run of the sorting algorithm produces the following output and seems to hang during the re-assigning phase (attaching output file). It will stay at this spot for over 15 minutes, even while running with 60 CPUs and 100GB of RAM (processing a 40GB file). Is it normal for the re-assigning phase to take that long?
Could it be related to all the OpenBLAS outputs I get? How should I interpret the OpenBLAS outputs:
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 6190469 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
Can you post the output of
echo $OPENBLAS_NUM_THREADS
?
If this is >2 I suggest adding the following to you .bashrc
(or at least run it before you sort).
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
This will prevent the linear algebra libraries from trying to do multi-threading on their own and instead leave that up to the language you are calling them from.
Sweet! That worked, though it revealed a new error. I am attaching the output below.
Clustering for channel 80 (phase1)...
Found 0 clusters for channel 80 (phase1)...
Computing templates for channel 80 (phase1)...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 513, in run_phase1_sort
neighborhood_sorter.runPhase1Sort()
File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 345, in runPhase1Sort
self.runSort(mode='phase1')
File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 395, in runSort
templates=compute_templates_from_timeseries_model(X,times,labels,nbhd_channels=nbhd_channels,clip_size=clip_size,chunk_infos=chunk_infos)
File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 269, in compute_templates_from_timeseries_model
K=np.max(labels)
File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2320, in amax
out=out, **kwargs)
File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax
return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg_spec.py", line 11, in <module>
if not PM.run(sys.argv):
File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/mltools/processormanager/processormanager_impl.py", line 37, in run
return P(**args)
File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/p_ms4alg.py", line 72, in sort
MS4.sort()
File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 591, in sort
pool.map(run_phase1_sort, neighborhood_sorters)
File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
ValueError: zero-size array to reduction operation maximum which has no identity
[ Removing temporary directory ... ]
Process returned with non-zero exit code.
This error only appears when I sort an hour long session (96 channels X 107919471 time points). Sorting only a small subset of that session (96 channels X 18000000 time points) runs successfully with your fix above.
OK. This one looks like an actual bug -> error when there are no cluster's found on a particular channel. Could you open a new issue with the above content. Thanks!