unchanged files showing as changed
sherwoac opened this issue · 8 comments
Bug Report
files show as changed but are not.
I wanted to make a simple change to a tracked json file and push it back to the dvc remote (s3), first test of a recently dvc pull remote
d local problem repo:
adam@z10:~/DATA/***/snt-lab-data$ dvc add ./mds_simulated/2023-09-19-134852/labels.json
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: None, md5: 5d4df06eca9b1487cc4d87ed190819f1.dir
Adding...
ERROR: could not read '5d4df06eca9b1487cc4d87ed190819f1.dir'
adam@z10:~/DATA/***/snt-lab-data$
ok, so I guess I have an issue, I thought I would check I'm in sync with remote by doing a pull from the remote, anticipating a warning about my local changed file:
adam@z10:~/DATA/***/snt-lab-data$ dvc pull
Collecting |0.00 [00:00, ?entry/s]
Fetching
Building workspace index |41.9k [00:01, 22.2kentry/s]
Comparing indexes |41.9k [00:00, 77.5kentry/s]
ERROR: failed to pull data from the cloud - Can't remove the following unsaved files without confirmation. Use `--force` to force.
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP0_LI30_TR5_6Ucolor_2023-05-16-17-43-50/BS_DB_LP0_LI30_TR5_6Ucolor_2023-05-16-17-43-50.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-11-31/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-11-31.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-20-10/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-20-10.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-16-01-51/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-16-01-51.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-23-19/BS_DB_LP2_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-23-19.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP1_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-15-58-15/BS_DB_LP1_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-15-58-15.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP0_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-10-28/BS_DB_LP0_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-10-28.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp300-1000_2023-05-17-16-05-07/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp300-1000_2023-05-17-16-05-07.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-04-18/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-04-18.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-15-30/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-15-30.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-26-47/BS_DB_LP2_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-26-47.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP0_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-17-43-50/BS_DB_LP0_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-17-43-50.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP0_LI40_TR0__2023-01-19-11-20-45/BS_DB_LP0_LI40_TR0__2023-01-19-11-20-45.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP0_LI20_TR0__2023-01-19-11-17-36/BS_DB_LP0_LI20_TR0__2023-01-19-11-17-36.bag
/home/adam/DATA/***/snt-lab-data/test-5/SetupValidation_2023-01-19-10-30-57/SetupValidation_2023-01-19-10-30-57.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP1_LI20_TR0__2023-01-19-11-28-24/BS_DB_LP1_LI20_TR0__2023-01-19-11-28-24.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP1_LI40_TR5__2023-01-19-11-31-54/BS_DB_LP1_LI40_TR5__2023-01-19-11-31-54.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP0_LI40_TR5__2023-01-19-11-23-37/BS_DB_LP0_LI40_TR5__2023-01-19-11-23-37.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP5_LI40_TR5_noOptitrack_2023-01-19-11-59-23/BS_DB_LP5_LI40_TR5_noOptitrack_2023-01-19-11-59-23.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP5_LI40_TR5__2023-01-19-11-42-44/BS_DB_LP5_LI40_TR5__2023-01-19-11-42-44.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP5_LI20_TR0__2023-01-19-11-38-54/BS_DB_LP5_LI20_TR0__2023-01-19-11-38-54.bag
/home/adam/DATA/***/snt-lab-data/test-5/validationFiday_2023-01-20-12-37-45/validationFiday_2023-01-20-12-37-45.bag
/home/adam/DATA/***/snt-lab-data/mds_simulated/2023-09-19-134852/labels.json
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI25_LP1_2022-12-02-11-15-53/BS_DB_LI25_LP1_2022-12-02-11-15-53.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI40_LP3_2022-12-01-11-15-46/BS_DB_LI40_LP3_2022-12-01-11-15-46.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI25_LP1_2022-12-01-11-37-21/BS_DB_LI25_LP1_2022-12-01-11-37-21.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI25_LP3_2022-12-02-11-11-48/BS_DB_LI25_LP3_2022-12-02-11-11-48.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI10_LP3_roundtrip_2022-12-01-11-48-14/BS_DB_LI10_LP3_roundtrip_2022-12-01-11-48-14.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI40_LP1_2022-12-02-11-18-23/BS_DB_LI40_LP1_2022-12-02-11-18-23.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP4_LI30_TRV_Tgray_2023-03-31-16-04-58/BS_DS_closed_loop_LP4_LI30_TRV_Tgray_2023-03-31-16-04-58.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP1_LI30_TRV_Tgray_2023-03-31-16-25-04/BS_DS_closed_loop_LP1_LI30_TRV_Tgray_2023-03-31-16-25-04.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP5_LI40_TRV_Tgray_2023-03-31-14-06-56/BS_DS_closed_loop_LP5_LI40_TRV_Tgray_2023-03-31-14-06-56.bag
wow, that's weird, the .bag
files are unchanged, to check I did an md5sum on them, on this local repo and another (originating golden source), and checked they're the same:
on originating golden source repo :
ubuntu@ip-***:/data/snt-lab-data$ md5sum test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
0ebd33226d18e8699116efd0be55f173 test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
and local problem repo:
adam@z10:~/DATA/***/snt-lab-data$ md5sum test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
0ebd33226d18e8699116efd0be55f173 test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
dvc status
output on local problem repo:
adam@z10:~/DATA/***/snt-lab-data$ dvc status
test-5.dvc:
changed outs:
modified: test-5
test-8.dvc:
changed outs:
modified: test-8
test-4.dvc:
changed outs:
modified: test-4
mds_simulated.dvc:
changed outs:
modified: mds_simulated
test-7.dvc:
changed outs:
modified: test-7
adam@z10:~/DATA/***/snt-lab-data$
dvc doctor
output on local problem repo:
adam@z10:~/DATA/***/snt-lab-data$ dvc doctor
DVC version: 3.50.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.15.1
dvc_objects = 5.1.0
dvc_render = 1.0.1
dvc_task = 0.4.0
scmrepo = 3.3.1
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.3.1, boto3 = 1.34.51),
ssh (sshfs = 2024.4.1)
Config:
Global: /home/adam/.config/dvc
System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: ssh, s3
Workspace directory: ext4 on /dev/nvme0n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/8877645b7466dfc116ad2a8134c823c5
adam@z10:~/DATA/***/snt-lab-data$
dvc doctor on originating golden source repo (which I pushed to s3):
ubuntu@ip-***:/data/snt-lab-data$ dvc doctor
DVC version: 2.57.2 (pip)
-------------------------
Platform: Python 3.8.10 on Linux-5.15.0-1033-aws-x86_64-with-glibc2.29
Subprojects:
dvc_data = 0.51.0
dvc_objects = 0.22.0
dvc_render = 0.5.2
dvc_task = 0.2.1
scmrepo = 1.0.3
Supports:
azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.13.0),
gdrive (pydrive2 = 1.15.3),
gs (gcsfs = 2023.5.0),
hdfs (fsspec = 2023.5.0, pyarrow = 12.0.0),
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
oss (ossfs = 2023.5.0),
s3 (s3fs = 2023.5.0, boto3 = 1.26.76),
ssh (sshfs = 2023.4.1),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8),
webhdfs (fsspec = 2023.5.0)
Config:
Global: /home/ubuntu/.config/dvc
System: /etc/xdg/dvc
Cache types: reflink, hardlink, symlink
Cache directory: xfs on /dev/xvdf
Caches: local
Remotes: ssh, s3, local
Workspace directory: xfs on /dev/xvdf
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/a639c39231cdedf7ce3c5e00b401799d
You are using dvc version 2.57.2; however, version 3.50.0 is available.
To upgrade, run 'pip install --upgrade dvc'.
ubuntu@ip-***:/data/snt-lab-data$
...which is way out of date, but I'm thinking is not material to this issue, as local data is showing changed that isn't changed.
Could you try dvc cache migrate --dvc-files
to make sure all the data is being tracked in dvc 3.x mode?
thanks for the response.
I ran that command and it seemed to take a long time to recalculate all the hashes - as intended I guess. I'm not sure the result it what I want though, as when I run dvc status --cloud
(I think to compare the local repo to the 'cloud'), (~) everything shows as new.
adam@z10:~/DATA/***/snt-lab-data$ find . -type f -not -path '*/.*' | wc -l
41830
adam@z10:~/DATA/***/snt-lab-data$ dvc status --cloud | grep new | wc -l
40452
Just to reiterate the mission here, I made a small change to (labels.json
) one file (out of ~40k files) but ~25 large files showed as changed too. I just wanted to change and update that one file.
so now it looks like I'm left with the prospect of either:
- deleting all the data and downloading it again, and making the small change to the
labels.json
file and hoping that the change tracking works this time, or; - uploading all the 'changed' data, that isn't actually changed
any further suggestions welcome.
addendum:
adam@z10:~/DATA/***/snt-lab-data$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: LabData-2022-09-15-for-training.dvc
modified: mds_simulated.dvc
modified: office_data.dvc
modified: test-4.dvc
modified: test-5.dvc
modified: test-7.dvc
modified: test-8.dvc
Unfortunately, 3.x is a major version update that changes how all file hashes are calculated, so even though your data change is minor, updating your repo to 3.x is not. As soon as you change any file in a directory that is tracked by dvc, the whole directory will have to be migrated to 3.x. If you want to continue with 3.x, I think you are best off uploading all the 3.x data, after which hopefully everything will be smooth. The alternative is to stay on 2.x for now. Apologies for the inconvenience.
Thanks again for your response.
I didn't appreciate the implications of your initial advice, might be worth adding that in advance in future.
In my original post I mentioned that I have the original source of the data using an old version of DVC:
...
DVC version: 2.57.2 (pip)
am I right in thinking:
- I could update this version
- run the same upgrade cache command you suggested
dvc cache migrate --dvc-files
- push the changes to the (S3) repo
dvc push --remote ..
- ..then my local (which I ran
dvc cache migrate --dvc-files
on above) and remote repos could be in sync?
Yes, it's hard to say for sure, but I suspect that having all local and remote copies of your repo migrated will get you back to a smoother working state.
thanks for the response, appreciate your support.
I have now run:
dvc cache migrate --dvc-files
on my golden copy, the cache migration is complete, but, as there aren't any local data files changed:
ubuntu@ip-***:/data/snt-lab-data$ dvc status
Data and pipelines are up to date.
I have nothing to commit I guess. I've just done a remote push with:
dvc push --remote <my remote>
..and it seems to be updating the hashes, is that the right way to update the remote after tha cache migration?
ok, so just to complete this issue, in the end I:
- migrated both caches
- pushed my golden source cache to s3
- checked my local cache was in sync with it:
showing just the files I changed locally
adam@z10:~/DATA/***/snt-lab-data$ dvc status --cloud new: mds_simulated new: mds_simulated/2023-09-19-134852/labels.json
- then added that change with
dvc add mds_simulated/2023-09-19-134852/labels.json
- then
dvc push
- now
dvc status --cloud
shows all in sync
thanks for your help.