SpikeInterface/spikeinterface

timeout error when spike sorting with container

Opened this issue · 24 comments

Hi,

I'm recording using neuropixel 2 probe and running kilosort for each shank on docker image. My code is as follows:
sorting = si.run_sorter_by_property('kilosort3', rec, grouping_property='group', working_folder= base_folder/'kilosort3_output', mode_if_folder_exists="overwrite", docker_image=True, verbose=True, **params_kilosort3)

However, I got a timeout error that says connection aborted. Attached is the detailed error code. Thanks in advance!!

Starting container

TimeoutError Traceback (most recent call last)
File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
704 conn,
705 method,
706 url,
707 timeout=timeout_obj,
708 body=body,
709 headers=headers,
710 chunked=chunked,
711 )
713 # If we're going to release the connection in finally:, then
714 # the response doesn't need to know about the connection. Otherwise
715 # it will also try to release it and we'll have a double-release
716 # mess.

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.
--> 449 six.raise_from(e, None)
450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File :3, in raise_from(value, from_value)

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
443 try:
--> 444 httplib_response = conn.getresponse()
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.

File ~.conda\envs\spikeinterface\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
1376 try:
-> 1377 response.begin()
1378 except ConnectionError:

File ~.conda\envs\spikeinterface\lib\http\client.py:320, in HTTPResponse.begin(self)
319 while True:
--> 320 version, status, reason = self._read_status()
321 if status != CONTINUE:

File ~.conda\envs\spikeinterface\lib\http\client.py:281, in HTTPResponse._read_status(self)
280 def _read_status(self):
--> 281 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
282 if len(line) > _MAXLINE:

File ~.conda\envs\spikeinterface\lib\site-packages\docker\transport\npipesocket.py:224, in NpipeFileIOBase.readinto(self, buf)
223 def readinto(self, buf):
--> 224 return self.sock.recv_into(buf)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\transport\npipesocket.py:25, in check_closed..wrapped(self, *args, **kwargs)
22 raise RuntimeError(
23 'Can not reuse socket after connection was closed.'
24 )
---> 25 return f(self, *args, **kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\transport\npipesocket.py:151, in NpipeSocket.recv_into(self, buf, nbytes)
150 win32file.CancelIo(self._handle)
--> 151 raise TimeoutError
152 return win32file.GetOverlappedResult(self._handle, overlapped, 0)

TimeoutError:

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last)
File ~.conda\envs\spikeinterface\lib\site-packages\requests\adapters.py:487, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
486 try:
--> 487 resp = conn.urlopen(
488 method=request.method,
489 url=url,
490 body=request.body,
491 headers=request.headers,
492 redirect=False,
493 assert_same_host=False,
494 preload_content=False,
495 decode_content=False,
496 retries=self.max_retries,
497 timeout=timeout,
498 chunked=chunked,
499 )
501 except (ProtocolError, OSError) as err:

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
785 e = ProtocolError("Connection aborted.", e)
--> 787 retries = retries.increment(
788 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
789 )
790 retries.sleep()

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
549 if read is False or not self._is_method_retryable(method):
--> 550 raise six.reraise(type(error), error, _stacktrace)
551 elif read is not None:

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
768 if value.traceback is not tb:
--> 769 raise value.with_traceback(tb)
770 raise value

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
704 conn,
705 method,
706 url,
707 timeout=timeout_obj,
708 body=body,
709 headers=headers,
710 chunked=chunked,
711 )
713 # If we're going to release the connection in finally:, then
714 # the response doesn't need to know about the connection. Otherwise
715 # it will also try to release it and we'll have a double-release
716 # mess.

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.
--> 449 six.raise_from(e, None)
450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File :3, in raise_from(value, from_value)

File ~.conda\envs\spikeinterface\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
443 try:
--> 444 httplib_response = conn.getresponse()
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.

File ~.conda\envs\spikeinterface\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
1376 try:
-> 1377 response.begin()
1378 except ConnectionError:

File ~.conda\envs\spikeinterface\lib\http\client.py:320, in HTTPResponse.begin(self)
319 while True:
--> 320 version, status, reason = self._read_status()
321 if status != CONTINUE:

File ~.conda\envs\spikeinterface\lib\http\client.py:281, in HTTPResponse._read_status(self)
280 def _read_status(self):
--> 281 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
282 if len(line) > _MAXLINE:

File ~.conda\envs\spikeinterface\lib\site-packages\docker\transport\npipesocket.py:224, in NpipeFileIOBase.readinto(self, buf)
223 def readinto(self, buf):
--> 224 return self.sock.recv_into(buf)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\transport\npipesocket.py:25, in check_closed..wrapped(self, *args, **kwargs)
22 raise RuntimeError(
23 'Can not reuse socket after connection was closed.'
24 )
---> 25 return f(self, *args, **kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\transport\npipesocket.py:151, in NpipeSocket.recv_into(self, buf, nbytes)
150 win32file.CancelIo(self._handle)
--> 151 raise TimeoutError
152 return win32file.GetOverlappedResult(self._handle, overlapped, 0)

ProtocolError: ('Connection aborted.', TimeoutError())

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)
Cell In[12], line 5
2 params_kilosort3 = {'do_correction': True}
4 # Sorting by groups
----> 5 sorting = si.run_sorter_by_property('kilosort3', rec,
6 grouping_property='group',
7 working_folder= base_folder/'kilosort3_output',
8 mode_if_folder_exists="overwrite",
9 docker_image=True, verbose=True, **params_kilosort3)

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:119, in run_sorter_by_property(sorter_name, recording, grouping_property, working_folder, mode_if_folder_exists, engine, engine_kwargs, verbose, docker_image, singularity_image, **sorter_params)
116 assert grouping_property in recording.get_property_keys(), f"The 'grouping_property' {grouping_property} is not "
117 f"a recording property!"
118 recording_dict = recording.split_by(grouping_property)
--> 119 sorting_output = run_sorters([sorter_name], recording_dict, working_folder,
120 mode_if_folder_exists=mode_if_folder_exists,
121 engine=engine,
122 engine_kwargs=engine_kwargs,
123 verbose=verbose,
124 with_output=True,
125 docker_images={sorter_name: docker_image},
126 singularity_images={sorter_name: singularity_image},
127 sorter_params={sorter_name: sorter_params})
129 grouping_property_values = None
130 sorting_list = []

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:269, in run_sorters(sorter_list, recording_dict_or_list, working_folder, sorter_params, mode_if_folder_exists, engine, engine_kwargs, verbose, with_output, docker_images, singularity_images)
266 if engine == 'loop':
267 # simple loop in main process
268 for task_args in task_args_list:
--> 269 _run_one(task_args)
271 elif engine == 'joblib':
272 from joblib import Parallel, delayed

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:38, in _run_one(arg_list)
35 # because we won't want the loop/worker to break
36 raise_error = False
---> 38 run_sorter(sorter_name, recording, output_folder=output_folder,
39 remove_existing_folder=remove_existing_folder,
40 delete_output_folder=delete_output_folder,
41 verbose=verbose, raise_error=raise_error,
42 docker_image=docker_image, singularity_image=singularity_image,
43 with_output=with_output, **sorter_params)

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\runsorter.py:137, in run_sorter(sorter_name, recording, output_folder, remove_existing_folder, delete_output_folder, verbose, raise_error, docker_image, singularity_image, with_output, **sorter_params)
135 else:
136 container_image = singularity_image
--> 137 return run_sorter_container(
138 container_image=container_image,
139 mode=mode,
140 **common_kwargs,
141 )
143 return run_sorter_local(**common_kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\runsorter.py:478, in run_sorter_container(sorter_name, recording, mode, container_image, output_folder, remove_existing_folder, delete_output_folder, verbose, raise_error, with_output, extra_requirements, sorter_params)
476 if verbose:
477 print('Starting container')
--> 478 container_client.start()
480 if verbose and install_si_from_source:
481 print("
****")

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\runsorter.py:305, in ContainerClient.start(self)
303 def start(self):
304 if self.mode == 'docker':
--> 305 self.docker_container.start()
306 elif self.mode == 'singularity':
307 self.client_instance.start()

File ~.conda\envs\spikeinterface\lib\site-packages\docker\models\containers.py:406, in Container.start(self, **kwargs)
397 def start(self, **kwargs):
398 """
399 Start this container. Similar to the docker start command, but
400 doesn't support attach options.
(...)
404 If the server returns an error.
405 """
--> 406 return self.client.api.start(self.id, **kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\utils\decorators.py:19, in check_resource..decorator..wrapped(self, resource_id, *args, **kwargs)
15 if not resource_id:
16 raise errors.NullResource(
17 'Resource ID was not provided'
18 )
---> 19 return f(self, resource_id, *args, **kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\api\container.py:1126, in ContainerApiMixin.start(self, container, *args, **kwargs)
1120 raise errors.DeprecatedMethod(
1121 'Providing configuration in the start() method is no longer '
1122 'supported. Use the host_config param in create_container '
1123 'instead.'
1124 )
1125 url = self._url("/containers/{0}/start", container)
-> 1126 res = self._post(url)
1127 self._raise_for_status(res)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\utils\decorators.py:46, in update_headers..inner(self, *args, **kwargs)
44 else:
45 kwargs['headers'].update(self._general_configs['HttpHeaders'])
---> 46 return f(self, *args, **kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\docker\api\client.py:233, in APIClient._post(self, url, **kwargs)
231 @update_headers
232 def _post(self, url, **kwargs):
--> 233 return self.post(url, **self._set_request_timeout(kwargs))

File ~.conda\envs\spikeinterface\lib\site-packages\requests\sessions.py:635, in Session.post(self, url, data, json, **kwargs)
624 def post(self, url, data=None, json=None, **kwargs):
625 r"""Sends a POST request. Returns :class:Response object.
626
627 :param url: URL for the new :class:Request object.
(...)
632 :rtype: requests.Response
633 """
--> 635 return self.request("POST", url, data=data, json=json, **kwargs)

File ~.conda\envs\spikeinterface\lib\site-packages\requests\sessions.py:587, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
582 send_kwargs = {
583 "timeout": timeout,
584 "allow_redirects": allow_redirects,
585 }
586 send_kwargs.update(settings)
--> 587 resp = self.send(prep, **send_kwargs)
589 return resp

File ~.conda\envs\spikeinterface\lib\site-packages\requests\sessions.py:701, in Session.send(self, request, **kwargs)
698 start = preferred_clock()
700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
703 # Total elapsed time of the request (approximately)
704 elapsed = preferred_clock() - start

File ~.conda\envs\spikeinterface\lib\site-packages\requests\adapters.py:502, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
487 resp = conn.urlopen(
488 method=request.method,
489 url=url,
(...)
498 chunked=chunked,
499 )
501 except (ProtocolError, OSError) as err:
--> 502 raise ConnectionError(err, request=request)
504 except MaxRetryError as e:
505 if isinstance(e.reason, ConnectTimeoutError):
506 # TODO: Remove this in 3.0.0: see #2811

ConnectionError: ('Connection aborted.', TimeoutError())

@shunnnli does this error only appear if you sort by property?

No, I've tried run_sorter as well and it also returns the same result

Ok, can you make sure you followed the instructions here?
https://spikeinterface.readthedocs.io/en/latest/modules/sorters.html#running-sorters-in-docker-singularity-containers

Basically you need to have Docker desktop installed, an account on dockerhub, and being logged in (for Windows)

Yes, as you can see from my previous error (issue #1662), I can start enter the container normally.

Hi I restarted my computer and it runs now! But I got the following error:

Starting container
Installing spikeinterface==0.97.1 in spikeinterface/kilosort2_5-compiled-base
Installing extra requirements: ['neo']
Running kilosort2_5 sorter inside spikeinterface/kilosort2_5-compiled-base
Stopping container
Starting container
Installing spikeinterface==0.97.1 in spikeinterface/kilosort2_5-compiled-base
Installing extra requirements: ['neo']
Running kilosort2_5 sorter inside spikeinterface/kilosort2_5-compiled-base
Stopping container
Starting container
Installing spikeinterface==0.97.1 in spikeinterface/kilosort2_5-compiled-base
Installing extra requirements: ['neo']
Running kilosort2_5 sorter inside spikeinterface/kilosort2_5-compiled-base
Stopping container

SpikeSortingError Traceback (most recent call last)
Cell In[15], line 2
1 # Sorting by groups
----> 2 sorting = si.run_sorter_by_property('kilosort2_5', rec,
3 grouping_property='group',
4 working_folder= base_folder/'kilosort2.5_output',
5 mode_if_folder_exists="overwrite",
6 docker_image=True,verbose=True)

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:119, in run_sorter_by_property(sorter_name, recording, grouping_property, working_folder, mode_if_folder_exists, engine, engine_kwargs, verbose, docker_image, singularity_image, **sorter_params)
116 assert grouping_property in recording.get_property_keys(), f"The 'grouping_property' {grouping_property} is not "
117 f"a recording property!"
118 recording_dict = recording.split_by(grouping_property)
--> 119 sorting_output = run_sorters([sorter_name], recording_dict, working_folder,
120 mode_if_folder_exists=mode_if_folder_exists,
121 engine=engine,
122 engine_kwargs=engine_kwargs,
123 verbose=verbose,
124 with_output=True,
125 docker_images={sorter_name: docker_image},
126 singularity_images={sorter_name: singularity_image},
127 sorter_params={sorter_name: sorter_params})
129 grouping_property_values = None
130 sorting_list = []

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:348, in run_sorters(sorter_list, recording_dict_or_list, working_folder, sorter_params, mode_if_folder_exists, engine, engine_kwargs, verbose, with_output, docker_images, singularity_images)
343 print(f'Warning!! With engine="{engine}" you cannot have directly output results\n'
344 'Use : run_sorters(..., with_output=False)\n'
345 'And then: results = collect_sorting_outputs(output_folders)')
346 return
--> 348 results = collect_sorting_outputs(working_folder)
349 return results

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:413, in collect_sorting_outputs(working_folder)
408 """Collect results in a working_folder.
409
410 The output is a dict with double key access results[(rec_name, sorter_name)] of SortingExtractor.
411 """
412 results = {}
--> 413 for rec_name, sorter_name, sorting in iter_sorting_output(working_folder):
414 results[(rec_name, sorter_name)] = sorting
415 return results

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\launcher.py:403, in iter_sorting_output(working_folder)
401 for rec_name, sorter_name, output_folder in iter_working_folder(working_folder):
402 SorterClass = sorter_dict[sorter_name]
--> 403 sorting = SorterClass.get_result_from_folder(output_folder)
404 yield rec_name, sorter_name, sorting

File ~.conda\envs\spikeinterface\lib\site-packages\spikeinterface\sorters\basesorter.py:274, in BaseSorter.get_result_from_folder(cls, output_folder)
272 log_file = output_folder / 'spikeinterface_log.json'
273 if not log_file.is_file():
--> 274 raise SpikeSortingError('get result error: the folder do not contain spikeinterface_log.json')
276 with log_file.open('r', encoding='utf8') as f:
277 log = json.load(f)

SpikeSortingError: get result error: the folder do not contain spikeinterface_log.json

Hi @shunnnli

This is progress!! I'll look into it but I think I know where the source of the error is.

In the meanwhile (to test if the container mechanism works), can you try to run spike sorting on one shank only?

rec_by_group_dict = recording.split_by("group")
# this gives a dict with group names as keys and recording objectes as values

group_names = list(rec_by_group_dict.keys())

recording_to_sort = rec_by_group_dict[group_names[0]]

sorting = si.run_sorter('kilosort3', recording_to_sort, output_folder= base_folder/'kilosort3_output_shank0', docker_image=True, verbose=True, **params_kilosort3)

Let me know if this works!

Hi! This works!! However, the kilosort has been running for 3 hours with no sign of finishing. I'm willing to stop and rerun for all shanks if you have any suggestions!

Let's let it finish! Still investigating!

It's still running after 24 hours haha :(

image

Mmmm...it's not supposed to be taking that much! How long is the recording?

Can you also check if the GPU is beong used?

The recording is like 40 minutes, so it shouldn't take that long. How can I check whether GPU is being used in windows?

you should have an NVIDIA application or somtehing, but I'm not 100% sure

Yeah I have it on my computer

So I also re-create the timeout error: I stopped the kernal, shut it down, deactivate conda; then I reactivate conda and re-run the code, now running sorter on single shank would show me timeout error as above

Not really sure what's happening then..it might be due to a problem with the Docker installation itself. Are you able to try the analysis on a Linux machine?

I can try install it in Linux. Is there any sample output for kilosort after printing 'Running kilosort2_5 sorter inside spikeinterface/kilosort2_5-compiled-base'? I just want to know which step the code got stuck on

You should have a sorter_output folder in base_folder/'kilosort3_output_shank0' with the files that Kilosort is (supposedly) creating. What is the content of that?

Its a chanMap.mat file and a recording.dat file
image

Hi! I ran into the same thing and was able to fix it by updating the docker library to 6.1.2

See the relevant issue and PR here: docker/docker-py#3124

musall commented

Unfortunately, we are having the same issue running Docker under Windows with larger datasets. The sorter runs indefinitely (although uses large CPU and memory resources) and only creates chanMap.mat and recording.dat but doesn't start creating temp_wh.dat

We are running docker 6.1.3. so I don't think updating fully resolves this problem (currently rerunning it with downgraded 6.1.2 to make sure).

Does Docker create a log file that we can use to identify where the code gets stuck?

I'm having the same issue, also running on windows locally, for a ~90 gb concatenated neuropixels 1.0 recording. I have docker 6.1.3. Has anyone found a solution or work around?

not sure if it's related but I'm also running into this memory overflow error #2201 when concatenating, so I changed the dtype of the recordings from int16 to int64 and I no longer get the warning but the print statement is still telling me the recording is much smaller than the actual size:

CommonReferenceRecording: 383 channels - 30.0kHz - 1 segments - 58,042,836 samples
1,934.76s (32.25 minutes) - float64 dtype - 1.63 GiB

I'm experiencing a similar issue where the Docker has been running for more than 24 hours, but there's no progress. It only created two files, chanMap.mat, and recording.dat, and then appeared to be idle without any error messages. I've checked my GPU and CPU, but they're also idle. The recording size is around 136GB (npx 1.0), and I've been able to sort it using a locally installed KS3. However, Docker's KS3 seems to be running indefinitely, and I would love to set up my workflow on Docker. Do you have any suggestions? Thanks!