iterative/scmrepo

dvc has problems when there is a git "insteadOf" configuration in place that transforms "https://" urls to "ssh://" urls

larsks opened this issue · 2 comments

I'm running this in a clean environment (an Ubuntu 23.04 container into which I've installed git, python3, etc, and no explicit git configuration other than user.name and user.email).

I start with an empty repository:

git init dvctest
cd dvctest
echo 'dvc example' > README.md
git add README.md
git commit -m 'Initial commit'

And then install dvc into a virtual environment:

python3 -m venv .venv
. .venv/bin/activate
pip install dvc

And initialize dvc in the directory:

dvc init

Now, if I try the dvc get command from the "Get Started" document, it works as expected:

dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

But with this Git configuration in place:

# git config --global url.ssh://git@github.com/.insteadof https://github.com/

The same command fails:

# dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
ERROR: failed to get 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - Git failed to fetch ref from 'https://github.com/iterative/dataset-registry'

Running with -v, it looks as if remote.ls_remote() is throwing an authentication error:

Traceback (most recent call last):
  File "/dvctest/.venv/lib/python3.11/site-packages/funcy/flow.py", line 84, in reraise
    yield
  File "/dvctest/.venv/lib/python3.11/site-packages/scmrepo/git/backend/pygit2/__init__.py", line 704, in fetch_refspecs
    for head in remote.ls_remotes(callbacks=cb)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dvctest/.venv/lib/python3.11/site-packages/pygit2/remote.py", line 164, in ls_remotes
    self.connect(callbacks=callbacks, proxy=proxy)
  File "/dvctest/.venv/lib/python3.11/site-packages/pygit2/remote.py", line 112, in connect
    payload.check_error(err)
  File "/dvctest/.venv/lib/python3.11/site-packages/pygit2/callbacks.py", line 98, in check_error
    check_error(error_code)
  File "/dvctest/.venv/lib/python3.11/site-packages/pygit2/errors.py", line 65, in check_error
    raise GitError(message)
_pygit2.GitError: authentication required but no callback set

But that doesn't make sense, because cloning the remote repository works just fine:

# git clone https://github.com/iterative/dataset-registry
Cloning into 'dataset-registry'...
remote: Enumerating objects: 296, done.
remote: Counting objects: 100% (91/91), done.
remote: Compressing objects: 100% (54/54), done.
remote: Total 296 (delta 52), reused 43 (delta 37), pack-reused 205
Receiving objects: 100% (296/296), 45.06 KiB | 1.22 MiB/s, done.
Resolving deltas: 100% (84/84), done.

You can see that git has replaced the https:// url with an ssh:// url:

# git -C dataset-registry remote -v
origin  ssh://git@github.com/iterative/dataset-registry (fetch)
origin  ssh://git@github.com/iterative/dataset-registry (push)

And we can run git ls-remote without a problem:

# git -C dataset-registry ls-remote
From ssh://git@github.com/iterative/dataset-registry
0f1b2967161751e1bc6b117952588bcfca123d89        HEAD
6672e265ea03930dba33146b0533942dcb6c5f30        refs/heads/artifact
e9769688078894f478b5051039a576c7e793e187        refs/heads/docs-dvc-remote
.
.
.

If I explicitly use an ssh url in the dvc get command, like this:

# dvc get -v ssh://git@github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

Then it works fine:

2024-01-26 15:44:49,135 DEBUG: v3.42.0 (pip), CPython 3.11.4 on Linux-6.6.12-100.fc38.x86_64-x86_64-with-glibc2.37
2024-01-26 15:44:49,135 DEBUG: command: /dvctest/.venv/bin/dvc get -v ssh://git@github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-01-26 15:44:49,213 DEBUG: Creating external repo ssh://git@github.com/iterative/dataset-registry@None
2024-01-26 15:44:49,213 DEBUG: erepo: git clone 'ssh://git@github.com/iterative/dataset-registry' to a temporary dir
2024-01-26 15:44:52,362 DEBUG: Analytics is enabled.
2024-01-26 15:44:52,379 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpyvk2pa46', '-v']
2024-01-26 15:44:52,384 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpyvk2pa46', '-v'] with pid 5207
2024-01-26 15:44:52,384 DEBUG: Removing '/tmp/tmpz4ybeh7wdvc-clone'
2024-01-26 15:44:52,386 DEBUG: Removing '/tmp/tmp6aot93__dvc-cache'

This should be fixed in scmrepo==2.1.1 which was just released

With the updated scmrepo I was getting a new error...

ERROR: unexpected error - [Errno 2] No storage files available: 'get-started/data.xml' 

...but it turns out that's because requests, to my surprise, parses ~/.netrc by default and was picking up some credentials it should not have been using. With that file out of the way, I am able to successfully dvc get.