neuropoly/intranet.neuro.polymtl.ca

git-annex whereis returns nothing

jcohenadad opened this issue · 7 comments

i've uploaded a new dataset (canproco), and at the very end, git-annex whereis returns nothing:

julien-macbook:~/data.neuro/canproco $ git annex whereis
julien-macbook:~/data.neuro/canproco $ 

It looks like all the files in canproco are directly in the repository, instead of being handled by git-annex. So, the good news is that the files are all available (with a simple git clone), but the bad news is that a simple git clone downloads 16G of data directly.

Looking at git log, it seems that everything was added in the initial commit, instead of just the configuration files (.gitattributes and .gitignore), followed by a second commit with the data files (that would be handled by git-annex). Which instructions did you follow?

I can try to fix things, but I'll have to force-push the master branch, otherwise the 16G of data files would always be downloaded as part of the history.

Looking at git log, it seems that everything was added in the initial commit, instead of just the configuration files (.gitattributes and .gitignore), followed by a second commit with the data files (that would be handled by git-annex). Which instructions did you follow?

I'll copy/paste the screenshots in case the page evolves with time (so as to not create confusion).

I started from this:
image

So, the first step was to click on the 'recipe', which brought me here:
image

Given that the repos with the data was already there, I skipped:

mkdir my-new-repo
cd my-new-repo

and I did all the rest.

But then, I realized at the end of the recipe that I should NOT have included the data, because I saw this:
image

but it was too late, so I continued.

Maybe we should add a big disclaimer saying something like: "THE FOLDER SHOULD BE EMPTY" or something like that. I'll suggest something in a PR.

I can try to fix things, but I'll have to force-push the master branch, otherwise the 16G of data files would always be downloaded as part of the history.

Hum, I don't know if it is worth your time (given the other priorities in the lab). At some point we discussed about the possibility to drop git-annex and only rely on git (with the inconvenience of having a double-size repos). In light of the multiple confusions that happened in the lab since git-annex was introduced, maybe we can revisit this option?

kousu commented

Maybe we should add a big disclaimer saying something like: "THE FOLDER SHOULD BE EMPTY" or something like that. I'll suggest something in a PR.

git add README && git commit -m "Initial commit"

and

git add .gitignore .gitattributes && git commit -m "Configure git-annex"

should have achieved the same thing as leaving the folder empty, but more flexibly because they scalpel out the git-annex meta-config parts before getting git-annex or the other datasets involved. Either order should be safe, and I must have written those commands the way they are with that in mind.

So maybe better wording on that last section to make that clearer is

# If you have not yet, copy in your dataset files now using
# rsync, wget, curl, tar, dropbox, etc <...>

git add .

Judging from the latest git log

canproco git log
commit 1576b3430d958d01473555b55d30b0abbae18ea3 (HEAD -> master)
Author: Julien Cohen-Adad <jcohen@polymtl.ca>
Date:   Wed Sep 14 11:31:35 2022 -0400

    Configure git-annex

.gitattributes
.gitignore
dataset_description.json
participants.json
participants.tsv
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_STIR.json
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_STIR.nii.gz
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_T2star.json
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_T2star.nii.gz
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_T2w.json
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_T2w.nii.gz
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_acq-MT_MTS.json
sub-cal056/ses-M0/anat/sub-cal056_ses-M0_acq-MT_MTS.nii.gz
sub-cal072/ses-M0/anat/sub-cal072_ses-M0_STIR.json
sub-cal072/ses-M0/anat/sub-cal072_ses-M0_STIR.nii.gz
sub-cal072/ses-M0/anat/sub-cal072_ses-M0_T2star.json
sub-cal072/ses-M0/anat/sub-cal072_ses-M0_T2star.nii.gz
...

you accidentally did

git add . && git commit -m "Configure git-annex"

instead, which unfortunately means something different.

It's not a big deal, it's just something we need to back out and rewrite, and this is why we've imposed code review on ourselves on these datasets afterall. The fastest fix would just to for @mguaypaq to download the current state, erase the .git folder, and refollow the recipe from there, and then erase/rename the dataset on the server side before re-uploading.

Ok, this is done, so data.neuro.polymtl.ca:datasets/canproco should be in a good state right now. I tested by doing:

$ git clone git@data.neuro.polymtl.ca:datasets/canproco
$ cd canproco
$ file=sub-cal104/ses-M0/anat/sub-cal104_ses-M0_acq-MT_MTS.nii.gz
$ git annex whereis "$file"
(merging origin/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
whereis sub-cal104/ses-M0/anat/sub-cal104_ses-M0_acq-MT_MTS.nii.gz (2 copies) 
  	5d6b1659-18d9-4766-aad3-f7d6f97aadff -- u119414@joplin.neuro.polymtl.ca:/tmp/tmp.GVDsJDPDQr/canproco
   	9d6c1dbe-a3bb-4509-b357-1139aec886c3 -- git@data.neuro.polymtl.ca:~/repositories/datasets/canproco.git [origin]
ok
$ git annex get "$file"
get sub-cal104/ses-M0/anat/sub-cal104_ses-M0_acq-MT_MTS.nii.gz (from origin...) 
ok                                    
(recording state in git...)
$ du -sh "$file"
4.9M	sub-cal104/ses-M0/anat/sub-cal104_ses-M0_acq-MT_MTS.nii.gz

For your local copy of the repository, I would recommend making a new git clone git@data.neuro.polymtl.ca:datasets/canproco to make sure you're also in a good state.

At some point we discussed about the possibility to drop git-annex and only rely on git (with the inconvenience of having a double-size repos). In light of the multiple confusions that happened in the lab since git-annex was introduced, maybe we can revisit this option?

I guess that's always an option, but that's a bigger discussion than will fit in this thread!

kousu commented

A light touch that this is off topic for this repository. I'm going to close this. Please reopen in data-management if needed.