iterative/dvc

exp run: unnecessary hashing during experiments

gregstarr opened this issue · 1 comments

Bug Report

Description

Not sure if this is a bug per say, probably more of a discussion. I noticed that it was very slow to run experiments in parallel because it took a long time for them to start. This is because DVC is recomputing all the hashes for my large dataset.

DVC typically avoids recomputing hashes by utilizing a cache stored in site_cache_dir. The site cache dir on linux should be something like /var/tmp/dvc/repo/{hash}. This hash is computed here and is formed from several components including the root_dir (i.e. the dvc repo dir) and the btime which is sort of supposed to be the creation time of the root directory, but is instead taken from the mtime of the btime file in the .dvc/tmp folder.

When you run experiments in parallel, copies of the repo are made in the temp directory and the experiments are run from the copies. This means that the specific site cache dir for the repo copies will be different because the repo paths are different and the mtimes of the copied btime files are different. This results in DVC thinking that there is no cache yet and so it recomputes all the necessary hashes for each experiment. I have evidence of this because I only have one dvc repo, but my site cache dir has many cache folders.

Unless I'm missing something, it seems like experiments should use the same site cache as the base repo.

Reproduce

  1. look in your site cache dir, take note of the hashes
  2. run a bunch of experiments in parallel
  3. see that the site cache dir has more cache folders
$ ls -al /scratch/tmp/starrgw1/dvc/site_cache_dir/repo/
total 72
drwxrwxrwx 18 starrgw1 starrgw1 4096 Feb 17 06:09 .
drwxrwxr-x  3 starrgw1 starrgw1 4096 Feb 15 17:11 ..
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 18:25 048e839878f97ba9324bb139fa8e4b06
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 0c53c5b78086c5438b3ee6b4aaef570d
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 17 06:09 1f18cf09ad43f0845bea96b6b719b3ee
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 378f0eae8f9824f1f96149c481621d03
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 441bd548b8b298abffb2449dc7c1cf54
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 465bff9fb0df8bd1be46b6ec24fdb069
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 511df2ed3e7fdf1d12303c5929277158
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 73887b1a621845b9038bb7d3ec4ba704
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 20:17 7f4e8b33c6bc7ef879b1491b9ed50fec
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 82c6be6b9d97c42ec7ba7569d39a9a65
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 aee9b76e8f486264f0800522304b53b0
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 18:53 d412c540ff7f186df3641073fe15a061
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 e4efc309f726450d1b3bdb37748a60d5
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 e93d6446a825241907ed374d37e1f58d
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 f0fb5078327924424b4c3ae74fe98b46
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 18:25 f2d3168db34ebf88584f37903b9b3dcc

Environment information

dvc doctor
DVC version: 3.38.1 (pip)
-------------------------
Platform: Python 3.10.13 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 3.7.0
        dvc_objects = 3.0.3
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 2.0.2
Supports:
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3)
Config:
        Global: /home/starrgw1/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: lustre on 192.168.199.212@o2ib:192.168.199.213@o2ib:/scratch
Caches: local
Remotes: local
Workspace directory: nfs on master:/home
Repo: dvc, git
Repo.site_cache_dir: /scratch/tmp/starrgw1/dvc/site_cache_dir/repo/d412c540ff7f186df3641073fe15a061

possibly related: #9813