commercialhaskell/pantry

"Could not parse object" when referencing the commit of a github pull request

Closed this issue · 8 comments

I had this error from stack build, but I gather the problem is in pantry.

If I reference a repository on github, and set the commit hash to reference a commit found in a pull request made from a fork, I get the error fatal: Could not parse object '<hash>'. If I reference the repository that the PR was made from, it works fine.

For example (though this example won't work if the PR in question gets merged), if I have an extra-deps entry

  - git: https://github.com/haskell-streaming/streaming-postgresql-simple
    commit: 32762bdb7e352200ea3dd93f6466a060752baae6

referencing haskell-streaming/streaming-postgresql-simple#3, I get the error:

$ stack build --fast
Cloning 32762bdb7e352200ea3dd93f6466a060752baae6 from https://github.com/haskell-streaming/streaming-postgresql-simple
Received ExitFailure 128 when running
Raw command: /usr/bin/git reset --hard 32762bdb7e352200ea3dd93f6466a060752baae6
Run from: /tmp/with-repo949997/cloned
Standard error:

fatal: Could not parse object '32762bdb7e352200ea3dd93f6466a060752baae6'.

If I change the repository to https://github.com/ivan-m/streaming-postgresql-simple, I don't.

I can reproduce the error with git itself. After a clone, if I manually run the raw command:

$ git reset --hard 32762bdb7e352200ea3dd93f6466a060752baae6
fatal: Could not parse object '32762bdb7e352200ea3dd93f6466a060752baae6'.

An easy solution here is to fetch the commit explicitly:

$ git fetch origin 32762bdb7e352200ea3dd93f6466a060752baae6
remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (14/14), done.
remote: Total 42 (delta 14), reused 14 (delta 14), pack-reused 28
Unpacking objects: 100% (42/42), 6.05 KiB | 344.00 KiB/s, done.
From github.com:haskell-streaming/streaming-postgresql-simple
 * branch            32762bdb7e352200ea3dd93f6466a060752baae6 -> FETCH_HEAD
$ git reset --hard 32762bdb7e352200ea3dd93f6466a060752baae6
HEAD is now at 32762bd Re-export exceptions and connection functions from postgresql-simple

So I expect that adding this extra fetch to Pantry.Repo.withRepo would fix the issue. It does have a downside that the extra fetch takes a bit of extra time, even if the commit is already available. There's probably more complicated things that work to avoid that.

Oh, turns out using github instead of git works fine (i.e. github: haskell-streaming/streaming-postgresql-simple and keep the commit: line). It doesn't seem to work with whatever nix setup my company is using though, so it would still be nice for git to work.

@ChickenProp, I understand (from https://docs.haskellstack.org/en/stable/pantry/#package-location) that the difference is:

I also understand that 'forking' and 'pull requests' are a feature of GitHub and not a feature of Git itself. So, it seems to me that the git: specification can not adopt the GitHub 'archive' feature.

So I'm no expert, but I don't think my proposed git fetch would be github specific. Forking and PRs are indeed features of github, but I think what's happening here can be described purely in terms of git:

  1. When you git clone, it picks some set of commits to download from the origin; not all of them.
  2. When you git fetch a commit, it downloads that one.
  3. When you git reset to a commit, you need that commit to have been downloaded.

Where github comes into it, is that apparently the commits downloaded in a clone don't include commits found in unmerged PRs made from forks. But those commits are still part of the repo, which is why git fetch can work.

It's possible that github is doing something weird when it doesn't send every commit in the original clone? In that case no other upstream would ever need it, but I'd expect it to at least be harmless (apart from the time cost, which I think should be avoidable). But I'd be a bit surprised if this is the case. My guess would be that this is standard behavior and it's something like "you get every commit referenced from a named branch". Then we'd also get this error if you're pointing at a commit on a branch that's since been force pushed away, and the git fetch would fix it.

I'm not going to do this right now, but I think I could test that, and also test a gitlab instance and see if it has the same behavior with PRs. (I don't think I have access to a gitlab instance where I could test the force pushing thing.)

OK, I understand from this that, in the case, of GitHub:

Once a pull request is opened, GitHub stores all of the changes remotely. In other words, commits in a pull request are available in a repository even before the pull request is merged. You can fetch an open pull request and recreate it as your own.

I don't know what it means for a commit in a pull request to be 'available' in a remote repository, but not available to be included in a clone with git clone - I'm still investigating.

So, to recap, with a GitHub repository and:

- git: <repo>
  commit: <commit>

Stack, effectively, does:

git clone <repo> <dir>
cd <dir>
git reset --hard <commit>

and that does not work if <commit> relates to an unmerged pull request from a different repository <fork>.

The alternatives appear to be:

  1. Specify :

    - git: <fork>
      commit: <commit>

    If the user knows the identity of the <commit>, they should also know the <fork> from which it comes.

  2. Change the behaviour of Pantry so that it effectively does:

    git clone <repo> <dir>
    cd <dir>
    fetch <repo> <commit> # This will be redundant for commits brought across by the clone
    git reset --hard <commit>
    

    The fetch has a cost for all users, but that cost is likely small relative to the cost of the clone.

If I were to make the call, I would say that Alternative 1 (plus clear online documentation) is the better one. However, others may have different view.

That sounds about right. Some clarifications:

and that does not work if relates to an unmerged pull request from a different repository .

To confirm, I just tested with gitlab and it behaves the same in this case.

And on github (and I assume gitlab but didn't test this one) the reset-without-fetch also doesn't work if a commit is on a deleted branch, but fetch-then-reset does work in that case. I don't know if it will continue to work indefinitely; I think GH won't garbage collect it if that commit can still be found in the GH user interface (e.g. if it was a PR that got rejected and the branch deleted), but might eventually do so if not (e.g. someone just pushed a branch and then deleted it, but then how did you find it?). If it will get GCed, then it might actually be better for pantry to fail before then, so users can still look up the commit manually (by copying its shasum into the URL bar) if necessary. But this feels like something that won't come up very often.

The alternatives appear to be:

Two more options:

  1. When the repo is on github, the user can also use github: instead of git:, and then the upstream repo continues to work. Though I don't know why this is; presumably pantry does different git commands in that case, but is there a reason for that? (This doesn't work for me for reasons to do with our nix setup, but probably works for most users.)

  2. Pantry could instead do a conditional fetch. Something like

    git clone <repo> <dir>
    cd <dir>
    git reset --hard <commit>
    # If that succeeded, great. If not:
    git fetch <repo> <commit>
    git reset --hard <commit>
    

    ought to work, or there's probably some way to check if the reset will succeed without actually trying.

If the user knows the identity of the , they should also know the from which it comes.

Yeah. It's awkward for me for two reasons. One is that I have to edit both the commit: and the git: lines, and often I'll forget to edit the second and get the error. The other is about what happens next time I come to update this dependency. Suppose I'm following but want to use a PR from , maybe it's doing a version bump of dependencies or something. Then a few months later the dep needs updating again. I want to be looking at , not , to try to find recent releases and other PRs. But if the URL is pointing at , that's not clear unless I remembered to add a comment when I switched the dep to point at it.

I don't think documentation would help much with this, because for the most part I just don't look at these docs.

Neither of these is a big deal. It would be a small QOL improvement for me, but if you think it's not worth it, I'm not going to argue.

@ChickenProp, the conditional fetch seemed like a good idea. I've raised a pull request to implement it.

@mpilgrem Awesome, thank you!