Ongoing issues with pykilosort
rossant opened this issue · 0 comments
rossant commented
(previously at MouseLand/Kilosort#201)
This GitHub issue contains some information about ongoing issues with the Python port of Kilosort 2 that should be given high development priority:
- On large datasets, the end of the split2 step fails, just after the loop, because of an out of GPU memory issue (reproduced on a 8 GB GPU on one of Nick Steinmetz's drift datasets). Perhaps one should make sure that intermediate CuPy arrays that are no longer used are flushed at the end of each step. Also it might be a good idea to add a checkpoint just after the long split iteration, and before the line that fails, to avoid rerunning the split iteration at each try.
12:17:22.403 [I] postprocess:649 Finished splitting. Found 658 splits, checked 2077/2077 clusters, nccg 113
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
~/spikesorting/drift/run.py in <module>
22 probe.kcoords = matdata.kcoords.squeeze()
23 start = datetime.datetime.now()
---> 24 run(dat_path, probe=probe, dir_path=dir_path, n_channels=385, dtype=np.int16, sample_rate=3e4)
25 finish = datetime.datetime.now()
26 print(start, finish)
~/git/pykilosort/pykilosort/main.py in run(dat_path, probe, params, dir_path, stop_after, **kwargs)
222 # final splits by amplitudes
223 with ctx.time('split_2'):
--> 224 out = splitAllClusters(ctx, False)
225 out['st3_s0'] = out.pop('st3_s')
226 ctx.save(**out)
~/git/pykilosort/pykilosort/postprocess.py in splitAllClusters(ctx, flag)
665
666 # we re-compute similarity scores between templates
--> 667 WtW, iList = getMeWtW(W.astype(cp.float32), U.astype(cp.float32), Nnearest)
668 # ir.iList = iList # over-write the list of nearest templates
669
~/git/pykilosort/pykilosort/learn.py in getMeWtW(W, U0, Nnearest)
520 wtw0 = mexWtW2(Params, W[:, :, i], W[:, :, j], utu0)
521 # add it to the full correlation array
--> 522 WtW = WtW + wtw0
523
524 # the maximum across timelags accounts for sample alignment mismatch
cupy/core/core.pyx in cupy.core.core.ndarray.__add__()
cupy/core/_kernel.pyx in cupy.core._kernel.ufunc.__call__()
cupy/core/_kernel.pyx in cupy.core._kernel._get_out_args()
cupy/core/core.pyx in cupy.core.core.ndarray.__init__()
cupy/cuda/memory.pyx in cupy.cuda.memory.alloc()
cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()
cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()
cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()
cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()
cupy/cuda/memory.pyx in cupy.cuda.memory._try_malloc()
OutOfMemoryError: out of memory to allocate 2087942144 bytes (total 7810840064 bytes)
- pykilosort caches intermediate results after each step (preprocess, cluster, learn, merge, split...) and provides a system to automatically resume a run after the last successful step, in case a later step fails. However in some instances a CUDA segmentation fault occurs when resuming. In the example below, the split2 step failed due to a lack of GPU memory error, and resuming this step at the next run fails, perhaps due to a shape mismatch of one of the intermediate arrays. One should check, on a minimal failing example, that the inputs to the split2 step are identical between the first and the second (resumed) run.
12:35:26.643 [I] postprocess:649 Finished splitting. Found 658 splits, checked 2077/2077 clusters, nccg 113
---------------------------------------------------------------------------
CUDADriverError Traceback (most recent call last)
~/spikesorting/drift/run.py in <module>
22 probe.kcoords = matdata.kcoords.squeeze()
23 start = datetime.datetime.now()
---> 24 run(dat_path, probe=probe, dir_path=dir_path, n_channels=385, dtype=np.int16, sample_rate=3e4)
25 finish = datetime.datetime.now()
26 print(start, finish)
~/git/pykilosort/pykilosort/main.py in run(dat_path, probe, params, dir_path, stop_after, **kwargs)
222 # final splits by amplitudes
223 with ctx.time('split_2'):
--> 224 out = splitAllClusters(ctx, False)
225 out['st3_s0'] = out.pop('st3_s')
226 ctx.save(**out)
~/git/pykilosort/pykilosort/postprocess.py in splitAllClusters(ctx, flag)
662 Ka, Kb = getKernels(params)
663 # we run SVD
--> 664 W, U, mu = mexSVDsmall2(Params, dWU, W, iC, iW, Ka, Kb)
665
666 # we re-compute similarity scores between templates
~/git/pykilosort/pykilosort/learn.py in mexSVDsmall2(Params, dWU, W, iC, iW, Ka, Kb)
312 # compute dWU * dWU'
313 getwtw = cp.RawKernel(code, 'getwtw')
--> 314 getwtw((Nfilt,), tpS, (d_Params, d_dWUb, d_wtw))
315
316 # get W by power svd iterations
cupy/core/raw.pyx in cupy.core.raw.RawKernel.__call__()
cupy/util.pyx in cupy.util.memoize.decorator.ret()
cupy/core/raw.pyx in cupy.core.raw._get_raw_kernel()
cupy/core/carray.pxi in cupy.core.core.compile_with_cache()
cupy/core/carray.pxi in cupy.core.core.compile_with_cache()
~/anaconda3/lib/python3.7/site-packages/cupy/cuda/compiler.py in compile_with_cache(source, options, arch, cache_dir, extra_source)
160 cubin_hash = six.b(hashlib.md5(cubin).hexdigest())
161 if hash == cubin_hash:
--> 162 mod.load(cubin)
163 return mod
164
cupy/cuda/function.pyx in cupy.cuda.function.Module.load()
cupy/cuda/function.pyx in cupy.cuda.function.Module.load()
cupy/cuda/driver.pyx in cupy.cuda.driver.moduleLoadData()
cupy/cuda/driver.pyx in cupy.cuda.driver.check_status()
CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
- Care was taken during the port to ensure a line by line match between MATLAB and Python, but there might remain some discrepancies. In particular, it seems the number of good units found in Python on a given dataset is significantly lower than MATLAB. More in-depth comparisons of MATLAB/Python of the postprocessing steps should be done to make sure that each processing step gives the same outputs on the same inputs.