flatironinstitute/mountainlab-js

ms4alg hangs after lots of thread-related log output

Closed this issue · 3 comments

My run of the sorting algorithm produces the following output and seems to hang during the re-assigning phase (attaching output file). It will stay at this spot for over 15 minutes, even while running with 60 CPUs and 100GB of RAM (processing a 40GB file). Is it normal for the re-assigning phase to take that long?

Could it be related to all the OpenBLAS outputs I get? How should I interpret the OpenBLAS outputs:

OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 6190469 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable

_sort_hang.log

Can you post the output of
echo $OPENBLAS_NUM_THREADS ?

If this is >2 I suggest adding the following to you .bashrc (or at least run it before you sort).

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

This will prevent the linear algebra libraries from trying to do multi-threading on their own and instead leave that up to the language you are calling them from.

Sweet! That worked, though it revealed a new error. I am attaching the output below.

Clustering for channel 80 (phase1)...
Found 0 clusters for channel 80 (phase1)...
Computing templates for channel 80 (phase1)...

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 513, in run_phase1_sort
   neighborhood_sorter.runPhase1Sort()
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 345, in runPhase1Sort
    self.runSort(mode='phase1')
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 395, in runSort
    templates=compute_templates_from_timeseries_model(X,times,labels,nbhd_channels=nbhd_channels,clip_size=clip_size,chunk_infos=chunk_infos)
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 269, in compute_templates_from_timeseries_model
    K=np.max(labels)
  File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2320, in amax
    out=out, **kwargs)
  File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
"""

The above exception was the direct cause of the following exception:


Traceback (most recent call last):
  File "/data/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg_spec.py", line 11, in <module>
    if not PM.run(sys.argv):
  File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/mltools/processormanager/processormanager_impl.py", line 37, in run
    return P(**args)
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/p_ms4alg.py", line 72, in sort
    MS4.sort()
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 591, in sort
    pool.map(run_phase1_sort, neighborhood_sorters)
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
   return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
ValueError: zero-size array to reduction operation maximum which has no identity

[ Removing temporary directory ... ]
Process returned with non-zero exit code.

This error only appears when I sort an hour long session (96 channels X 107919471 time points). Sorting only a small subset of that session (96 channels X 18000000 time points) runs successfully with your fix above.

_sort_array_err.log

OK. This one looks like an actual bug -> error when there are no cluster's found on a particular channel. Could you open a new issue with the above content. Thanks!