iterative/scmrepo

clone: seems to be pulling orphaned revisions unlike `git clone`

efiop opened this issue ยท 6 comments

efiop commented

One of our users https://iterativeai.slack.com/archives/C03JS2V4MQU/p1689332412460989?thread_ts=1689272591.583879&cid=C03JS2V4MQU found that when using dvcfs, we were downloading large files that they've removed from history. And when running git clone those files were not downloaded.

Need to check if maybe one of our git backends is accidentally cloning more than intended.

It's probably that those files aren't orphaned, but that they are referenced by exp refs which are still pushed to the repo. git clone does not fetch exp refs but git clone --mirror (and DVC's clone implementation) do

efiop commented

@pmrowla Indeed, that explains it. Thank you!

(and DVC's clone implementation) do

Is it intentional though?

It's intentional so that you can dvc import or dvc get from named DVC experiments the same way you can import from a branch or tag name. We could try to make that lazier and only search and fetch exp refs if we fail to resolve a name on import, but that would be a DVC issue and not scmrepo.

efiop commented

@pmrowla Makes sense. So you mean that at least vanilla clone implemented in scmrepo, should not clone exp refs by default, right? We should have some kind of flag or something, so that default behaviour is closer to git clone. Or do I misunderstand something?

Vanilla Git.clone in scmrepo doesn't do anything with exp refs, the exp ref behavior is kept in the DVC erepo code (it's technically an additional fetch after the default clone finishes):
https://github.com/iterative/dvc/blob/f1764bdc772916d40f824531705fffdfc462793e/dvc/repo/open_repo.py#L217C20-L217C20

efiop commented

@pmrowla Ah, indeed, I completely blinded that one. Thank you. Closing this issue for now then. Overall it is now clear that this is an intended behaviour so no action needed here.