iterative/dvc

import: hangs when pulling many files from GCS remote and one fails

dluks opened this issue · 2 comments

dluks commented

Bug Report

import/pull: hangs when pulling many files from GCS and one (or a few) fails

Description

I have a directory in a data registry that is tracked with DVC which contains ~1.6k files. I pushed it to a Google Cloud Storage remote, and now I am trying to either dvc import or dvc pull the data from the GCS remote to a new machine. This works well for ~99% of the files, but sometimes a few seem to simply fail silently and never actually download.

When this happens, the entire process hangs once all successful downloads have completed, leaving only the hung/failed/paused (not sure) downloads remaining. They never seem to resume and I am forced to Ctrl+C out of the process.

This is less of an issue when using dvc pull as I can simply re-run the command and it will only download the missing files, but with dvc import I am forced to re-run the process.

Reproduce

  1. dvc init (data registry project)
  2. dvc add <large directory containing many moderately-sized files (each around 30MB)>
  3. dvc remote add <google cloud storage bucket>
  4. dvc push <google cloud storage>
  5. git commit -am "commit msg" && git push
  6. (new machine or repository) git init <new-project>
  7. dvc import https://github.com/<data-registry-repo> path/to/large-dir

Most files finish downloading, but some fail (or hang) silently and prevent the entire dvc import from completing:

image

I eventually have to Ctrl-C to exit the process or else it hangs for hours. Here is the full output before and after the Ctrl-C interrupt:

$ (traits) ➜  cit-sci-traits git:(wip) ✗ dvc import https://github.com/GeoSense-Freiburg/panops-data-registry data/gbif/all_tracheophyta_non-cult_2024-04-10 -o data/raw --remote gcs --force --verbose   
2024-05-07 22:49:38,950 DEBUG: v3.50.1 (pip), CPython 3.10.12 on Linux-6.5.0-1020-oem-x86_64-with-glibc2.35                                                                                             
2024-05-07 22:49:38,950 DEBUG: command: /home/daniel/.local/bin/dvc import https://github.com/GeoSense-Freiburg/panops-data-registry data/gbif/all_tracheophyta_non-cult_2024-04-10 -o data/raw --remote
 gcs --force --verbose                                                                                                                                                                                  
2024-05-07 22:49:39,321 DEBUG: Removing output 'data/raw/all_tracheophyta_non-cult_2024-04-10' of stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc'.                                           
2024-05-07 22:49:39,322 DEBUG: Removing '/home/daniel/projects/ufr/cit-sci-traits/data/raw/all_tracheophyta_non-cult_2024-04-10'                                                                        
Importing 'data/gbif/all_tracheophyta_non-cult_2024-04-10 (https://github.com/GeoSense-Freiburg/panops-data-registry)' -> 'data/raw/all_tracheophyta_non-cult_2024-04-10'                               
2024-05-07 22:49:39,406 DEBUG: Computed stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc' md5: '4f2d2baf24cfc67c5da509212106b2af'                                                              
2024-05-07 22:49:39,406 DEBUG: 'md5' of stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc' changed.                                                                                             
2024-05-07 22:49:39,407 DEBUG: Creating external repo https://github.com/GeoSense-Freiburg/panops-data-registry@None                                                                                    
2024-05-07 22:49:39,407 DEBUG: erepo: git clone 'https://github.com/GeoSense-Freiburg/panops-data-registry' to a temporary dir                                                                          
2024-05-07 22:49:40,906 DEBUG: Lockfile '../../../../../dvc.lock' needs to be updated.                                                                                                                  
2024-05-07 22:49:41,101 DEBUG: failed to load ('data', 'gbif', 'all_tracheophyta_non-cult_2024-04-10') from storage local (/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5) - [Errno 2] No
 such file or directory: '/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5/ad/a860692d425bd047bd09fd4531fdbf.dir'                                                                          
Traceback (most recent call last):                                                                                                                                                                      
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/index/index.py", line 611, in _load_from_storage                                                                       
    _load_from_object_storage(trie, entry, storage)                                                                                                                                                     
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/index/index.py", line 547, in _load_from_object_storage                                                                
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)                                                                                                                 
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load                                                                                   
    with obj.fs.open(obj.path, "r") as fobj:                                                                                                                                                            
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 324, in open                                                                                      
    return self.fs.open(path, mode=mode, **kwargs)                                                                                                                                                      
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 131, in open                                                                                     
    return open(path, mode=mode, encoding=encoding)  # noqa: SIM115                                                                                                                                     
FileNotFoundError: [Errno 2] No such file or directory: '/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5/ad/a860692d425bd047bd09fd4531fdbf.dir'                                           
                                                                                                                                                                                                        
 98% Downloading all_tracheophyta_non-cult_2024-04-10|████████████████████████████████████████████████████████████████████████████████████████████████████████▎  |1.54k/1.58k [54:11<00:53, 1.37s/files]
2024-05-07 23:47:02,580 ERROR: interrupted by the user                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                      
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 264, in _get                                                                                   
    list(callback.wrap(it))                                                                                                                                                                             
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/fsspec/callbacks.py", line 157, in wrap                                                                                         
    for item in iterable:
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 37, in imap_unordered                                                                           
    done, tasks = futures.wait(tasks, return_when=futures.FIRST_COMPLETED)phyta_non-cult_2024-04-10.parquet/000000                                                14.2M/72.8M [13:40<07:01,     146kB/s]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 307, in wait                         
    waiter.event.wait(timeout)                    
  File "/usr/lib/python3.10/threading.py", line 607, in wait                                        
    signaled = self._cond.wait(timeout)           
  File "/usr/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/commands/imp.py", line 15, in run
    self.repo.imp(
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/imp.py", line 44, in imp
    return self.imp_url(path, out=out, erepo=erepo, frozen=True, **kwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/scm_context.py", line 143, in run
    return method(repo, *args, **kw)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/imp_url.py", line 86, in imp_url
    stage.run(jobs=jobs, no_download=no_download)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/decorators.py", line 44, in rwlocked
    return call()
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs) 
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/__init__.py", line 603, in run
    self._sync_import(dry, force, kwargs.get("jobs", None), no_download)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/decorators.py", line 44, in rwlocked
    return call()                                                                                                                                                                               [4/1937]
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs) 
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/__init__.py", line 640, in _sync_import
    sync_import(self, dry, force, jobs, no_download)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/imports.py", line 61, in sync_import
    stage.deps[0].download(stage.outs[0], jobs=jobs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/dependency/base.py", line 48, in download
    fs_download(self.fs, self.fs_path, to.fs_path, jobs=jobs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/fs/__init__.py", line 75, in download
    generic.copy(fs, from_infos, localfs, to_infos, callback=cb, batch_size=jobs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 97, in copy
    return _get(
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 262, in _get
    with executor:
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 44, in __exit__
    self.shutdown(wait=True, cancel_futures=cancel_futures)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

2024-05-07 23:47:02,586 DEBUG: Analytics is enabled.                                                
2024-05-07 23:47:02,644 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp50h17rzm', '-v']    
2024-05-07 23:47:02,678 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp50h17rzm', '-v'] with pid 314508                                                                                                
2024-05-07 23:47:02,682 DEBUG: Removing '/tmp/tmp2elzelv6dvc-clone'                                 
2024-05-07 23:47:02,688 DEBUG: Removing '/tmp/tmp47j66_8ldvc-cache'                                 
  1%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000048                                               687k/57.1M [57:19<1:08:10,    14.5kB/s]
  4%|▎         |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000023                                               482k/12.9M [57:19<3:50:00,      947B/s]
 15%|█▍        |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000025                                                5.72M/38.4M [57:20<03:16,     175kB/s]
  1%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000058                                             460k/39.3M [57:19<106:54:50,      106B/s]
  0%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000007                                              276k/79.7M [57:20<14:38:23,    1.58kB/s]
 16%|█▋        |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000005                                              4.48M/27.2M [57:19<6:26:13,    1.03kB/s]
 34%|███▍      |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000010                                                11.5M/33.8M [57:19<01:55,     202kB/s]
 37%|███▋      |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000008                                                13.5M/36.7M [57:19<03:14,     125kB/s]
 20%|██        |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000000                                                14.9M/72.8M [57:20<06:56,     146kB/s]
  3%|▎         |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000055                                             1.08M/39.3M [57:19<15:07:16,      736B/s]
  0%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000013                                             266k/67.4M [57:20<165:16:44,      118B/s]
  3%|▎         |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000001                                             1.96M/60.1M [57:20<22:28:28,      754B/s]
  1%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000059                                               251k/39.4M [57:19<9:11:19,    1.24kB/s]

Expected

I expect dvc import to detect download timeouts and either restart the hanging download or to complete the process with a warning and to then allow for a partial dvc import to attempt to gather only the missing files instead of requiring a complete re-download of the data.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-1020-oem-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        gs (gcsfs = 2024.3.1),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2024.4.1)
Config:
        Global: /home/daniel/.config/dvc
        System: /etc/xdg/xdg-cinnamon/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p5
Caches: local
Remotes: local, gs, gs
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1545d141359143110c3029289a778e59

Additional Information (if any):

dluks commented

Update

I found a workaround by running dvc import --no-downloadto first import the .dvcfile only, and then using dvc pull several times to download the remaining data after the above-mentioned failures.

Note: this does not solve the underlying bug, but it is a workaround.

Thanks for the report @dluks and that workaround sounds like a good way around it for now. I'll keep this open but not sure it's actionable at the moment without a more consistent way to reproduce.