galaxyproject/galaxy-helm

No space left on device even though NFS and nodes have disk space

Closed this issue · 4 comments

pcm32 commented

After I resized the NFS disk (and a fresh call to df -h . shows there is capacity from the containers), I keep getting this errors from the container resolver:

galaxy.tool_util.deps.containers INFO 2022-12-05 17:03:17,624 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Checking with container resolver [ExplicitContainerResolver[]] found description [None]
galaxy.tool_util.deps.containers ERROR 2022-12-05 17:03:18,020 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] Could not get container description for tool 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_read_10x/scanpy_read_10x/1.8.1+2+galaxy0'
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/tool_util/deps/containers.py", line 320, in find_best_container_description
    resolved_container_description = self.resolve(enabled_container_types, tool_info, **kwds)
  File "/galaxy/server/lib/galaxy/tool_util/deps/containers.py", line 351, in resolve
    container_description = container_resolver.resolve(
  File "/galaxy/server/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 557, in resolve
    name = targets_to_mulled_name(
  File "/galaxy/server/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 361, in targets_to_mulled_name
    tags = mulled_tags_for(namespace, target.package_name, resolution_cache=resolution_cache, session=session)
  File "/galaxy/server/lib/galaxy/tool_util/deps/mulled/util.py", line 127, in mulled_tags_for
    if not _namespace_has_repo_name(namespace, image, resolution_cache):
  File "/galaxy/server/lib/galaxy/tool_util/deps/mulled/util.py", line 113, in _namespace_has_repo_name
    preferred_resolution_cache[cache_key] = repo_names
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/cache.py", line 374, in __setitem__
    self.put(key, value)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/cache.py", line 317, in put
    self._get_value(key, **kw).set_value(value)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 417, in set_value
    self.namespace.release_write_lock()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 231, in release_write_lock
    self.close(checkcount=True)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 254, in close
    self.do_close()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 685, in do_close
    util.safe_write(self.file, pickled)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/util.py", line 502, in safe_write
    fh.close()
OSError: [Errno 28] No space left on device

I did leave the galaxy containers running while increasing the disk size, so this might be due to maybe galaxy needing a restart after disk re-sizing? It is the only piece of code that is complaining about disk space, everything else seems to work. Most leaving here for reference.

pcm32 commented

...mmm... deleting the pod and getting a new one for job handler didn't do the trick, it kept having this error. So this is trying to write somewhere where there is no space. Is not the nodes and is not the shared file system....

pcm32 commented

It seems to be failing at setting a key at this ResolutionCache from .container_resolvers import ResolutionCache.

pcm32 commented

Downscaling everything that had mounted the NFS and then re-upscaling seems to have fixed it.

pcm32 commented

This was probably because the disk re-sizing was partly done in "hot".