iterative/dvc

unchanged files showing as changed

sherwoac opened this issue · 8 comments

Bug Report

files show as changed but are not.

I wanted to make a simple change to a tracked json file and push it back to the dvc remote (s3), first test of a recently dvc pull remoted local problem repo:

adam@z10:~/DATA/***/snt-lab-data$ dvc add ./mds_simulated/2023-09-19-134852/labels.json
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                      
name: None, md5: 5d4df06eca9b1487cc4d87ed190819f1.dir                                                                  
Adding...
ERROR: could not read '5d4df06eca9b1487cc4d87ed190819f1.dir'
adam@z10:~/DATA/***/snt-lab-data$ 

ok, so I guess I have an issue, I thought I would check I'm in sync with remote by doing a pull from the remote, anticipating a warning about my local changed file:

adam@z10:~/DATA/***/snt-lab-data$ dvc pull
Collecting                                                                                   |0.00 [00:00,    ?entry/s]
Fetching
Building workspace index                                                                   |41.9k [00:01, 22.2kentry/s]
Comparing indexes                                                                          |41.9k [00:00, 77.5kentry/s]
ERROR: failed to pull data from the cloud - Can't remove the following unsaved files without confirmation. Use `--force` to force.
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP0_LI30_TR5_6Ucolor_2023-05-16-17-43-50/BS_DB_LP0_LI30_TR5_6Ucolor_2023-05-16-17-43-50.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-11-31/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-11-31.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-20-10/BS_DB_LP1_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-20-10.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-16-01-51/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-16-01-51.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-23-19/BS_DB_LP2_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-23-19.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP1_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-15-58-15/BS_DB_LP1_LI40_TR5_1Ucolor_IntelExp150_2023-05-17-15-58-15.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP0_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-10-28/BS_DB_LP0_LI30_TR5_6Uwhite_IntelExp50_2023-05-17-11-10-28.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp300-1000_2023-05-17-16-05-07/BS_DB_LP2_LI40_TR5_1Ucolor_IntelExp300-1000_2023-05-17-16-05-07.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-04-18/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-18-04-18.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-15-30/BS_DB_LP5_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-15-30.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP2_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-26-47/BS_DB_LP2_LI30_TR5_6Ucolor_IntelExp50_2023-05-17-10-26-47.bag
/home/adam/DATA/***/snt-lab-data/test-8/BS_DB_LP0_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-17-43-50/BS_DB_LP0_LI30_TR5_6Ucolor_IntelExp200_2023-05-16-17-43-50.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP0_LI40_TR0__2023-01-19-11-20-45/BS_DB_LP0_LI40_TR0__2023-01-19-11-20-45.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP0_LI20_TR0__2023-01-19-11-17-36/BS_DB_LP0_LI20_TR0__2023-01-19-11-17-36.bag
/home/adam/DATA/***/snt-lab-data/test-5/SetupValidation_2023-01-19-10-30-57/SetupValidation_2023-01-19-10-30-57.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP1_LI20_TR0__2023-01-19-11-28-24/BS_DB_LP1_LI20_TR0__2023-01-19-11-28-24.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP1_LI40_TR5__2023-01-19-11-31-54/BS_DB_LP1_LI40_TR5__2023-01-19-11-31-54.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP0_LI40_TR5__2023-01-19-11-23-37/BS_DB_LP0_LI40_TR5__2023-01-19-11-23-37.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP5_LI40_TR5_noOptitrack_2023-01-19-11-59-23/BS_DB_LP5_LI40_TR5_noOptitrack_2023-01-19-11-59-23.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP5_LI40_TR5__2023-01-19-11-42-44/BS_DB_LP5_LI40_TR5__2023-01-19-11-42-44.bag
/home/adam/DATA/***/snt-lab-data/test-5/BS_DB_LP5_LI20_TR0__2023-01-19-11-38-54/BS_DB_LP5_LI20_TR0__2023-01-19-11-38-54.bag
/home/adam/DATA/***/snt-lab-data/test-5/validationFiday_2023-01-20-12-37-45/validationFiday_2023-01-20-12-37-45.bag
/home/adam/DATA/***/snt-lab-data/mds_simulated/2023-09-19-134852/labels.json
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI25_LP1_2022-12-02-11-15-53/BS_DB_LI25_LP1_2022-12-02-11-15-53.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI40_LP3_2022-12-01-11-15-46/BS_DB_LI40_LP3_2022-12-01-11-15-46.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI25_LP1_2022-12-01-11-37-21/BS_DB_LI25_LP1_2022-12-01-11-37-21.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI25_LP3_2022-12-02-11-11-48/BS_DB_LI25_LP3_2022-12-02-11-11-48.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI10_LP3_roundtrip_2022-12-01-11-48-14/BS_DB_LI10_LP3_roundtrip_2022-12-01-11-48-14.bag
/home/adam/DATA/***/snt-lab-data/test-4/Data/BS_DB_LI40_LP1_2022-12-02-11-18-23/BS_DB_LI40_LP1_2022-12-02-11-18-23.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP4_LI30_TRV_Tgray_2023-03-31-16-04-58/BS_DS_closed_loop_LP4_LI30_TRV_Tgray_2023-03-31-16-04-58.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP1_LI30_TRV_Tgray_2023-03-31-16-25-04/BS_DS_closed_loop_LP1_LI30_TRV_Tgray_2023-03-31-16-25-04.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
/home/adam/DATA/***/snt-lab-data/test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP5_LI40_TRV_Tgray_2023-03-31-14-06-56/BS_DS_closed_loop_LP5_LI40_TRV_Tgray_2023-03-31-14-06-56.bag

wow, that's weird, the .bag files are unchanged, to check I did an md5sum on them, on this local repo and another (originating golden source), and checked they're the same:

on originating golden source repo :

ubuntu@ip-***:/data/snt-lab-data$ md5sum test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
0ebd33226d18e8699116efd0be55f173  test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag

and local problem repo:

adam@z10:~/DATA/***/snt-lab-data$ md5sum test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag
0ebd33226d18e8699116efd0be55f173  test-7/BS_test_Mar23/verified/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49/BS_DS_closed_loop_LP0_LI40_TRV_Tgray_2023-03-31-14-27-49.bag

dvc status output on local problem repo:

adam@z10:~/DATA/***/snt-lab-data$ dvc status
test-5.dvc:                                                                                                            
	changed outs:
		modified:           test-5
test-8.dvc:
	changed outs:
		modified:           test-8
test-4.dvc:
	changed outs:
		modified:           test-4
mds_simulated.dvc:
	changed outs:
		modified:           mds_simulated
test-7.dvc:
	changed outs:
		modified:           test-7
adam@z10:~/DATA/***/snt-lab-data$ 

dvc doctor output on local problem repo:

adam@z10:~/DATA/***/snt-lab-data$ dvc doctor
DVC version: 3.50.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 3.15.1
	dvc_objects = 5.1.0
	dvc_render = 1.0.1
	dvc_task = 0.4.0
	scmrepo = 3.3.1
Supports:
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2024.3.1, boto3 = 1.34.51),
	ssh (sshfs = 2024.4.1)
Config:
	Global: /home/adam/.config/dvc
	System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: ssh, s3
Workspace directory: ext4 on /dev/nvme0n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/8877645b7466dfc116ad2a8134c823c5
adam@z10:~/DATA/***/snt-lab-data$ 

dvc doctor on originating golden source repo (which I pushed to s3):

ubuntu@ip-***:/data/snt-lab-data$ dvc doctor
DVC version: 2.57.2 (pip)
-------------------------
Platform: Python 3.8.10 on Linux-5.15.0-1033-aws-x86_64-with-glibc2.29
Subprojects:
	dvc_data = 0.51.0
	dvc_objects = 0.22.0
	dvc_render = 0.5.2
	dvc_task = 0.2.1
	scmrepo = 1.0.3
Supports:
	azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.13.0),
	gdrive (pydrive2 = 1.15.3),
	gs (gcsfs = 2023.5.0),
	hdfs (fsspec = 2023.5.0, pyarrow = 12.0.0),
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	oss (ossfs = 2023.5.0),
	s3 (s3fs = 2023.5.0, boto3 = 1.26.76),
	ssh (sshfs = 2023.4.1),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8),
	webhdfs (fsspec = 2023.5.0)
Config:
	Global: /home/ubuntu/.config/dvc
	System: /etc/xdg/dvc
Cache types: reflink, hardlink, symlink
Cache directory: xfs on /dev/xvdf
Caches: local
Remotes: ssh, s3, local
Workspace directory: xfs on /dev/xvdf
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/a639c39231cdedf7ce3c5e00b401799d

You are using dvc version 2.57.2; however, version 3.50.0 is available.
To upgrade, run 'pip install --upgrade dvc'.
ubuntu@ip-***:/data/snt-lab-data$ 

...which is way out of date, but I'm thinking is not material to this issue, as local data is showing changed that isn't changed.

Could you try dvc cache migrate --dvc-files to make sure all the data is being tracked in dvc 3.x mode?

thanks for the response.

I ran that command and it seemed to take a long time to recalculate all the hashes - as intended I guess. I'm not sure the result it what I want though, as when I run dvc status --cloud (I think to compare the local repo to the 'cloud'), (~) everything shows as new.

adam@z10:~/DATA/***/snt-lab-data$ find . -type f -not -path '*/.*' | wc -l
41830
adam@z10:~/DATA/***/snt-lab-data$ dvc status --cloud | grep new | wc -l
40452                                                                                                                  

Just to reiterate the mission here, I made a small change to (labels.json) one file (out of ~40k files) but ~25 large files showed as changed too. I just wanted to change and update that one file.

so now it looks like I'm left with the prospect of either:

  1. deleting all the data and downloading it again, and making the small change to the labels.json file and hoping that the change tracking works this time, or;
  2. uploading all the 'changed' data, that isn't actually changed

any further suggestions welcome.

addendum:

adam@z10:~/DATA/***/snt-lab-data$ git status
On branch master
Your branch is up-to-date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   LabData-2022-09-15-for-training.dvc
	modified:   mds_simulated.dvc
	modified:   office_data.dvc
	modified:   test-4.dvc
	modified:   test-5.dvc
	modified:   test-7.dvc
	modified:   test-8.dvc

Unfortunately, 3.x is a major version update that changes how all file hashes are calculated, so even though your data change is minor, updating your repo to 3.x is not. As soon as you change any file in a directory that is tracked by dvc, the whole directory will have to be migrated to 3.x. If you want to continue with 3.x, I think you are best off uploading all the 3.x data, after which hopefully everything will be smooth. The alternative is to stay on 2.x for now. Apologies for the inconvenience.

Thanks again for your response.

I didn't appreciate the implications of your initial advice, might be worth adding that in advance in future.

In my original post I mentioned that I have the original source of the data using an old version of DVC:

...
DVC version: 2.57.2 (pip)

am I right in thinking:

  • I could update this version
  • run the same upgrade cache command you suggested dvc cache migrate --dvc-files
  • push the changes to the (S3) repo dvc push --remote ..
  • ..then my local (which I ran dvc cache migrate --dvc-files on above) and remote repos could be in sync?

Yes, it's hard to say for sure, but I suspect that having all local and remote copies of your repo migrated will get you back to a smoother working state.

thanks for the response, appreciate your support.

I have now run:

dvc cache migrate --dvc-files

on my golden copy, the cache migration is complete, but, as there aren't any local data files changed:

ubuntu@ip-***:/data/snt-lab-data$ dvc status
Data and pipelines are up to date.

I have nothing to commit I guess. I've just done a remote push with:

dvc push --remote <my remote>

..and it seems to be updating the hashes, is that the right way to update the remote after tha cache migration?

ok, so just to complete this issue, in the end I:

  • migrated both caches
  • pushed my golden source cache to s3
  • checked my local cache was in sync with it:
    adam@z10:~/DATA/***/snt-lab-data$ dvc status --cloud
            new:                mds_simulated                                                                              
    	new:                mds_simulated/2023-09-19-134852/labels.json
    
    showing just the files I changed locally
  • then added that change with dvc add mds_simulated/2023-09-19-134852/labels.json
  • then dvc push
  • now dvc status --cloud shows all in sync

thanks for your help.