git-annex: support forking
kousu opened this issue · 2 comments
I just forked https://data.dev.neuropoly.org/neuropoly/spine-generic-single -> https://data.dev.neuropoly.org/kousu/spine-generic-single.
Server side, this caused a local clone:
gitea@data:~/data/gitea-repositories$ cd kousu/spine-generic-single.git/
gitea@data:~/data/gitea-repositories/kousu/spine-generic-single.git$ git remote -v
origin /srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git (fetch)
origin /srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git (push)
and per git-clone(1)
-l, --local When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/ directory are hardlinked to save space when possible. If the repository is specified as a local path (e.g., /path/to/repo), this is the default, and --local is essentially a no-op.
Evidence
gitea@data:~/data/gitea-repositories$ find . -links 2 -type f
./neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
./neuropoly/spine-generic-single.git/objects/f8/b1bcb88b6dfc3503c3493ae732d38f3a37135d
./neuropoly/spine-generic-single.git/objects/f8/747259e448b3a32a6917c5d63862ff731e1059
./neuropoly/spine-generic-single.git/objects/f8/6cd7b6e9b495c79f172e3c9996848de028e6bd
./neuropoly/spine-generic-single.git/objects/f8/ea925abf181f1d007a19c4d7655f007e78d746
./neuropoly/spine-generic-single.git/objects/f8/64332f6c2267b256ea4448bb36b6ebe8ec11a9
./neuropoly/spine-generic-single.git/objects/f8/a78f3f0c2ba75c2ccf0fea91a8501556e90acc
./neuropoly/spine-generic-single.git/objects/e5/ba1898582c0c46fdeffdc49fdc596093f1355e
./neuropoly/spine-generic-single.git/objects/99/c56b08b0508e0e6a9865696d152d31fa92ee17
[...]
./neuropoly/spine-generic-single.git/objects/54/f59b8b8b0a843937ac73251a9b34d432cf8ac0
./neuropoly/spine-generic-single.git/objects/54/1e6e09c8d6b6e1e63fba519ba2d2c68a9e00a1
./kousu/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
./kousu/spine-generic-single.git/objects/f8/b1bcb88b6dfc3503c3493ae732d38f3a37135d
./kousu/spine-generic-single.git/objects/f8/747259e448b3a32a6917c5d63862ff731e1059
./kousu/spine-generic-single.git/objects/f8/6cd7b6e9b495c79f172e3c9996848de028e6bd
./kousu/spine-generic-single.git/objects/f8/ea925abf181f1d007a19c4d7655f007e78d746
./kousu/spine-generic-single.git/objects/f8/64332f6c2267b256ea4448bb36b6ebe8ec11a9
./kousu/spine-generic-single.git/objects/f8/a78f3f0c2ba75c2ccf0fea91a8501556e90acc
./kousu/spine-generic-single.git/objects/e5/ba1898582c0c46fdeffdc49fdc596093f1355e
./kousu/spine-generic-single.git/objects/99/c56b08b0508e0e6a9865696d152d31fa92ee17
And to make doubly sure, here's looking one up by it's actual inode number:
gitea@data:~/data/gitea-repositories$ stat neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
Fichier : neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
Taille : 198 Blocs : 8 Blocs d'E/S : 4096 fichier
Périphérique : fc01h/64513d Inœud : 1032761 Liens : 2
Accès : (0444/-r--r--r--) UID : ( 996/ gitea) GID : ( 996/ gitea)
Accès : 2022-12-12 00:00:00.137276119 -0500
Modif. : 2022-11-30 02:24:29.875003217 -0500
Changt : 2022-12-12 16:59:12.058919207 -0500
Créé : 2022-11-30 02:24:29.875003217 -0500
gitea@data:~/data/gitea-repositories$ find . -inum 1032761
./neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
./kousu/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
But the repo sizes are wildly different: ~885MB vs ~1.5MB:
And this is of course because it didn't clone the annex files:
gitea@data:~/data/gitea-repositories$ ls kousu/spine-generic-single.git/annex
ls: impossible d'accéder à 'kousu/spine-generic-single.git/annex': Aucun fichier ou dossier de ce type
and of course this means the repo is broken
p115628@joplin:~/src/neurogitea/test$ git clone https://data.dev.neuropoly.org/kousu/spine-generic-single spine-generic-single-fork
Clonage dans 'spine-generic-single-fork'...
remote: Enumerating objects: 3703, done.
remote: Counting objects: 100% (3703/3703), done.
remote: Compressing objects: 100% (1255/1255), done.
remote: Total 3703 (delta 2015), reused 2942 (delta 1550), pack-reused 0
Réception d'objets: 100% (3703/3703), 338.08 Kio | 9.39 Mio/s, fait.
Résolution des deltas: 100% (2015/2015), fait.
p115628@joplin:~/src/neurogitea/test$ cd spine-generic-single-fork/
p115628@joplin:~/src/neurogitea/test/spine-generic-single-fork$ git annex get
(merging origin/git-annex origin/synced/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz (not available)
Maybe add some of these git remotes (git remote add ...):
5c733c49-b0a9-4d18-989a-11829918dcc1 -- gitea@data.dev.neuropoly.org:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (not available)
Maybe add some of these git remotes (git remote add ...):
5c733c49-b0a9-4d18-989a-11829918dcc1 -- gitea@data.dev.neuropoly.org:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (not available)
same for ssh:
p115628@joplin:~/src/neurogitea/test$ git clone gitea@data.dev.neuropoly.org:kousu/spine-generic-single.git spine-generic-single-fork
Clonage dans 'spine-generic-single-fork'...
remote: Énumération des objets: 3703, fait.
remote: Décompte des objets: 100% (3703/3703), fait.
remote: Compression des objets: 100% (1255/1255), fait.
remote: Total 3703 (delta 2015), réutilisés 2942 (delta 1550), réutilisés du pack 0
Réception d'objets: 100% (3703/3703), 338.08 Kio | 9.39 Mio/s, fait.
Résolution des deltas: 100% (2015/2015), fait.
p115628@joplin:~/src/neurogitea/test$ cd spine-generic-single-fork/
p115628@joplin:~/src/neurogitea/test/spine-generic-single-fork$ git annex get
(merging origin/git-annex origin/synced/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz (not available)
Maybe add some of these git remotes (git remote add ...):
5c733c49-b0a9-4d18-989a-11829918dcc1 -- gitea@data.dev.neuropoly.org:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (not available)
Maybe add some of these git remotes (git remote add ...):
5c733c49-b0a9-4d18-989a-11829918dcc1 -- gitea@data.dev.neuropoly.org:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (not available)
Maybe add some of these git remotes (git remote add ...):
But if I run git annex get
inside the remote repo
gitea@data:~/data/gitea-repositories/kousu/spine-generic-single.git$ git annex get
(recording state in git...)
get SHA256E-s896332--71a1699d1944f4817f8aaf0d0d36660576649eeaafd56273f67437855135d3d1.nii.gz (from origin...)
ok
get SHA256E-s2101125--c07a5070d63235cd576195a5a3580152dd079e4399e18d4b74e5efba4cceef83.nii.gz (from origin...)
ok
get SHA256E-s1755316--3564eb18fc031d066a4c3f2956a40ffa60a8b4d12b8a5cdbc2f24eb5d7b92e3c.nii.gz (from origin...)
ok
get SHA256E-s8190151--594c0a052fae3ee009212af444420398ba9874502dec4ec23d96157bff7eeed2.nii.gz (from origin...)
ok
get SHA256E-s1756168--2fef600a9ddee9cacdf83d94068b786d213f0b598b0ada4417da0416e078b15c.nii.gz (from origin...)
ok
get SHA256E-s1455109--edc02370aaef945de7e3a13fe0e975a7fbb01af76c5ccfb69cda44f0a24e2bf7.nii.gz (from origin...)
[..]
get SHA256E-s3350533--aae0efb7544e05e33bde3d8fd3b633a7a41eee629bc7c231d4016bc7cd09670b.nii.gz (from origin...)
ok
get SHA256E-s1824049--5472fd5f7ca43b8d2c3b35ca210ccaf5373f709cdf4b08845df8b221ba0c025b.nii.gz (from origin...)
ok
get SHA256E-s1180152--efdf45e83f7548c1632214c8a8332db44eed4f1581523e3142d7f180bb6762cd.nii.gz (from origin...)
ok
get SHA256E-s1082687--290a43b80da6f608e3d47107f3b6c05e98eebe56ed4eea633748c08bd1a7837a.nii.gz (from origin...)
ok
(recording state in git...)
Then it works
p115628@joplin:~/src/neurogitea/test$ git clone gitea@data.dev.neuropoly.org:kousu/spine-generic-single.git spine-generic-single-fork
Clonage dans 'spine-generic-single-fork'...
remote: Énumération des objets: 4134, fait.
remote: Décompte des objets: 100% (4134/4134), fait.
remote: Compression des objets: 100% (1544/1544), fait.
remote: Total 4134 (delta 2296), réutilisés 2943 (delta 1550), réutilisés du pack 0
Réception d'objets: 100% (4134/4134), 360.49 Kio | 6.21 Mio/s, fait.
Résolution des deltas: 100% (2296/2296), fait.
p115628@joplin:~/src/neurogitea/test$ cd spine-generic-single-fork/
p115628@joplin:~/src/neurogitea/test/spine-generic-single-fork$ git annex get
(merging origin/git-annex origin/synced/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz (from origin...)
ok
get derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (from origin...)
ok
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (from origin...)
ok
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_seg-manual.nii.gz (from origin...)
ok
get derivatives/labels/sub-perform/anat/sub-perform_T1w_RPI_r_labels-manual.nii.gz (from origin...)
ok
get derivatives/labels/sub-perform/anat/sub-perform_T1w_RPI_r_seg-manual.nii.gz (from origin...)
ok
get derivatives/labels/sub-perform/dwi/sub-perform_dwi_moco_dwi_mean_seg-manual.nii.gz (from origin...)
[...]
So, we need to add calling git annex get
to the Gitea "Fork" button -- but only in git-annex repos, of course.
However, if we can, we should try to use hardlinks the way git clone
does, as the git annex get
I ran above actually made copies
gitea@data:~/data/gitea-repositories$ du -hs kousu/spine-generic-single.git/ neuropoly/spine-generic-single.git/
886M kousu/spine-generic-single.git/
882M neuropoly/spine-generic-single.git/
The key seems to be annex.hardlink
. I deleted and reforked the repo, then
gitea@data:~/data/gitea-repositories$ git config --global annex.hardlink true
Then copying the annex files was much faster
gitea@data:~/data/gitea-repositories$ cd kousu/spine-generic-single.git/
gitea@data:~/data/gitea-repositories/kousu/spine-generic-single.git$ git annex get
get SHA256E-s896332--71a1699d1944f4817f8aaf0d0d36660576649eeaafd56273f67437855135d3d1.nii.gz (from origin...)
ok
get SHA256E-s2101125--c07a5070d63235cd576195a5a3580152dd079e4399e18d4b74e5efba4cceef83.nii.gz (from origin...)
ok
[...]
get SHA256E-s1082687--290a43b80da6f608e3d47107f3b6c05e98eebe56ed4eea633748c08bd1a7837a.nii.gz (from origin...)
ok
(recording state in git...)
git-annex: get: 12 failed
And the counts come out showing they are indeed now avoiding the duplication:
gitea@data:~/data/gitea-repositories$ du -hs kousu/spine-generic-single.git/ neuropoly/spine-generic-single.git/
886M kousu/spine-generic-single.git/
2,9M neuropoly/spine-generic-single.git/
gitea@data:~/data/gitea-repositories$ # but counting them separately shows them as full sized
gitea@data:~/data/gitea-repositories$ du -hs kousu/spine-generic-single.git/; du -hs neuropoly/spine-generic-single.git/
886M kousu/spine-generic-single.git/
885M neuropoly/spine-generic-single.git/
gitea@data:~/data/gitea-repositories$
The git-annex manpage says
When a repository is set up using git clone --shared, git-annex init will automatically set annex.hardlink and mark the repository as untrusted.
which I guess means gitea is not doing git clone --shared
. Perhaps a pity? But probably not something we can risk changing.
It also warns
Use with caution -- This can invalidate numcopies counting, since with hard links, fewer copies of a file can exist. So, it is a good idea to mark a repository using this setting as untrusted.
but I think that's just..a standard assumption we always have to live with (git-annex makes a lot of design choices and assumptions that aren't actually enforceable in like, physical reality, where entropy exists.)
Note: this triggered #32, in a different way than before, because the git annex get
was run after the repo size had been cached. But as in #32 a single git push
was enough to trigger the size recomputation:
tl;dr:
- make gitea set
git config annex.hardlink true
, either in all repos it creates, or in--global
(I'm unsure which is better) - add
git annex get
to the internal fork process