martinvonz/jj

linux-x86: Most git operations fail with "failed to stat <some file>"

senekor opened this issue · 38 comments

Update/TLDR

It looks like the issue is that cargo binstall jj-cli installs a broken binary from QuickInstall for jj 0.18 on linux x86 systems.

Fix:

The problem is solved on recent versions of cargo-binstall. Make sure you're running the latest one before reinstalling jj to solve the problem:

cargo binstall cargo-binstall
cargo binstall --force jj-cli

We'd appreciate a confirmation that the fix works for people.

Alternatively, or if you want to be doubly sure, you can run

cargo binstall --strategies crate-meta-data --force jj-cli

Description

I just updated to 0.18 and most jj commands fail (some are fine, e.g. log) with similar error messages:

$ jj new x
Error: Git operation failed
Caused by: failed to stat '/home/senk/.gitconfig'; class=Config (7)

(I do have a git config, it just lives in ~/.config/git/config.)
If I touch that file, the process repeats with ./.git/info/grafts and ./.git/shallow.
Finally, git operations of jj seemingly work again (at least jj new).
However, using the git cli itself I get a warning about .git/info/grafts being deprecated.
So that doesn't seem like a good solution to just make these dummy files.

Steps to Reproduce the Problem

This is the second computer I updated and the first didn't show this problem.
So I don't know how to reproduce it yet.

Expected Behavior

Git operations should succeed as normal.

Actual Behavior

$ jj new x
Error: Git operation failed
Caused by: failed to stat '/home/senk/.gitconfig'; class=Config (7)

Specifications

  • Platform: Linux
  • Version: 6.8.11-300.fc40.x86_64

This happens in all repositories, even when I do a fresh jj git init --colocate in a previously only git repo.

It does NOT happen in a repo that isn't colocated. Only noticing it now because I default to colocated repos.

Confirming I'm seeing this in my collocated repos as well with the upgrade to 0.18.0

EDIT: I can't even fresh clone a repo actually, it's not even a matter of colocation.
EDIT2: Even if I add an empty grafts and shallow file other operations like git fetch fail with more "failed to stat"

❯ jj git fetch
Error: failed to stat '/home/michael/dotfiles/.jj/repo/store/git/objects/pack/pack_git2_4978ca90985didx': Resource temporarily unavailable; class=Os (2)

I tired to reproduce this briefly on a Mac, but haven't been able to yet. I can also try Linux later. So, I'm totally guessing in the dark.

It might be worth double-checking that it's not some sort of a temporary problem with the /home filesystem. Do the files it's complaining about exist? Would rebooting the computer make a difference? Do plain git commands work in that repo?

It might also be some sort of incompatibility between a libgit2 and/or gix version and/or some git version. Which version of Git do you use?

It would also be very helpful if one if you can build jj from source and bisect the v0.17.0..v0.18.0 range.

yuja commented
❯ jj git fetch
Error: failed to stat '/home/michael/dotfiles/.jj/repo/store/git/objects/pack/pack_git2_4978ca90985didx': Resource temporarily unavailable; class=Os (2)

Seems like underlying libgit2 code paths are totally broken. It's weird that fstat returned EAGAIN. Running with strace might show something wrong.

Btw, which Linux distribution and version are you using?

I'm running Fedora 40. The thing is that my personal computer doesn't have this issue, only my work computer. (Same setup for all intents and purposes. Obv. not the exact same hardware.)

I'll find some time later today to go to the office and do the bisect.

Well, installing from source works perfectly fine 😕 I'll try to reproduce the CI build more closely, starting with using musl instead of glibc.

This is quite strange. I built the v0.18.0 tag in an Ubuntu container with the musl target... and it works fine. If I cargo-binstall it, the problem is back. At this point, I don't know what to try anymore to reproduce it. It seems almost like something went wrong with the build, rather than the source.

Have you been using cargo install --locked or cargo build to build it, as opposed to a plain cargo install?

If so, another remote possibility (reaching at straws here) is that perhaps the binary uses some CPU instruction that your processor doesn't support (got optimized for Intel-only instructions or AMD-only or ??).

Also, the release binaries are built with Rust 1.76. I'd be surprised if this made a difference...

I used cargo build. I don't really know about the nuances of what those different commands do. Let me know if I should try something else.

My home pc has an amd cpu, the office one is intel (11900k I think?).

I'm using the latest rust version (1.78).

cargo build should use Cargo.lock and therefore the same dependencies as the release build.

If you installed Rust with rustup, you could relatively easily install 1.76 toolchain (rustup toolchain add 1.76, I believe, and then cargo +1.76 build).

I guess it's possible that the binary doesn't work correctly on Intel CPUs. We can ask on Discord, that's probably the easiest way to find out.

Shoot, looks like I made a mistake. Apparantly I had a self-compiled version of jj installed on my personal machine all along. Now that I installed the precompiled binary here as well, I can at least reproduce at home. So Intel / AMD cpu doesn't seem to be the issue.

I'll try to compile myself with Rust 1.76 in a minute.

Compiling with 1.76 also doesn't cause any issues, jj works fine.

I had another idea that it might be the feature flags. I'm pretty sure I had them set just like in the CI workflow when testing early at the office. But I'll try again. oppenssl always gives me trouble with Rust builds 😢

Nope, the features don't make a difference either. Still at square one: Local builds are fine, prebuilt ones are borked.

What the... if I download the prebuilt binary manually, it seems to be fine as well. Only cargo binstall causes issues! Now I have to figure out how and why I'm getting a different thing from binstall.

With the following command, the correct binary seems to be installed:

cargo binstall jj-cli --target x86_64-unknown-linux-musl
# ...
WARN The package jj-cli v0.18.0 (x86_64-unknown-linux-musl) has been downloaded from github.com
# ...

Notice the snippet of the output (github.com). This is different without the explicit target:


cargo binstall jj-cli
# ...
WARN The package jj-cli v0.18.0 (x86_64-unknown-linux-gnu) has been downloaded from third-party source QuickInstall
# ...

So, QuickInstall. It looks like they also provide prebuilt releases for some rust programs. Presumably, the borked binaries are downloaded from this release:

https://github.com/cargo-bins/cargo-quickinstall/releases/tag/jj-cli-0.18.0

I have no idea how the QuickInstall system works. But it seems they only provide binaries for x86_64-unknown-linux-gnu (the one we don't provide).

Would there be any downside to providing gnu and musl binaries for linux? It would probably be even easier than adding aarch64 binaries. It's a little redundant since musl works fine everywhere, but it's an easy solution.

I'll investigate the QuickInstall system a bit to maybe figure out what went wrong there.

I tried "pinning" this issue and putting the workaround in the first comment.

Great! The flag --strategies crate-meta-data works for me.

The quick-install version is built using glibc 2.17 to maintain backwards compatibility with centos 7, perhaps that's the reason why it failed?

I could:

  • set glibc version higher for jj on quick-install
  • enable gix on quick-install

to workaround this issue

@NobodyXu good questions, thank you for looking into it! Unfortunately, I don't know about glibc.

Re enabling gix, this will not get rid of the libgit2 dependency. Pushing and pulling is only implemented with libgit2 in jj currently.

Pushing and pulling is only implemented with libgit2 in jj currently.

Not sure about pushing, but pulling using gix is definitely doable.

yuja commented

The binary downloaded from https://github.com/cargo-bins/cargo-quickinstall/releases/tag/jj-cli-0.18.0 appears to link some of the libc functions statically.

For example, git_config_add_file_ondisk calls statically-linked stat64->fstatat64. OTOH, it refers to dynamically-linked errno.

(gdb) disas git_config_add_file_ondisk
Dump of assembler code for function git_config_add_file_ondisk:
   ...
   0x0000000000ee4aee <+46>:	mov    rbp,rdi
   0x0000000000ee4af1 <+49>:	lea    rsi,[rsp+0x10]
   0x0000000000ee4af6 <+54>:	mov    rdi,rbx
   0x0000000000ee4af9 <+57>:	call   0x1274b70 <stat64>
   0x0000000000ee4afe <+62>:	test   eax,eax
   0x0000000000ee4b00 <+64>:	js     0xee4b8e <git_config_add_file_ondisk+206>
   ...
   0x0000000000ee4b8e <+206>:	call   0x1274ca0 <__errno_location@plt>
   0x0000000000ee4b93 <+211>:	mov    eax,DWORD PTR [rax]
   0x0000000000ee4b95 <+213>:	cmp    eax,0x2
   0x0000000000ee4b98 <+216>:	je     0xee4b06 <git_config_add_file_ondisk+70>
   0x0000000000ee4b9e <+222>:	cmp    eax,0x14
   0x0000000000ee4ba1 <+225>:	je     0xee4b06 <git_config_add_file_ondisk+70>
   0x0000000000ee4ba7 <+231>:	lea    rsi,[rip+0xffffffffff399803]        # 0x27e3b1
   0x0000000000ee4bae <+238>:	mov    edi,0x7
   0x0000000000ee4bb3 <+243>:	mov    rdx,rbx
   0x0000000000ee4bb6 <+246>:	xor    eax,eax
   0x0000000000ee4bb8 <+248>:	call   0xeed080 <git_error_set>
   0x0000000000ee4bbd <+253>:	jmp    0xee4b79 <git_config_add_file_ondisk+185>
(gdb) disas fstatat64
Dump of assembler code for function fstatat64:
   0x0000000001274bf0 <+0>:	push   rbp
   0x0000000001274bf1 <+1>:	mov    rbp,rsp
   0x0000000001274bf4 <+4>:	mov    eax,0x106
   0x0000000001274bf9 <+9>:	mov    r10d,ecx
   0x0000000001274bfc <+12>:	syscall
   0x0000000001274bfe <+14>:	xor    ecx,ecx
   0x0000000001274c00 <+16>:	cmp    eax,0xfffff001
   0x0000000001274c05 <+21>:	jb     0x1274c18 <fstatat64+40>
   0x0000000001274c07 <+23>:	neg    eax
   0x0000000001274c09 <+25>:	mov    rcx,0xfffffffffffffff8
   0x0000000001274c10 <+32>:	mov    DWORD PTR fs:[rcx],eax
   0x0000000001274c13 <+35>:	mov    ecx,0xffffffff
   0x0000000001274c18 <+40>:	mov    eax,ecx
   0x0000000001274c1a <+42>:	pop    rbp
   0x0000000001274c1b <+43>:	ret

(errno = -ret is inlined?)

I think that's the source of the problem.

Quickinstall uses cargo-zigbuild to cross compile to glibc 2.17, perhaps that's the souce of the error?

Like @senekor I've confirmed that if I install via cargo install jj-cli --locked this problem does not exist. My initial problem was with cargo binstall which will install the binaries from the release and if there are none default to cargo install

Two questions:

  • If we created our own dynamically linked glibc builds, and published them in a release, would cargo binstall use those instead of the upstream zig version that's apparently broken?
  • Alternatively, is it possible to get cargo binstall to use the musl binary in the meantime? Or permanently, on all systems?

I ask because it would be nice of us to mitigate this for our users ASAP, so they don't run into it. We can/should also update our docs I suppose in the meantime, but I think it's important to stop the bleeding, so to speak.

In either case, I would prefer if cargo binstall would only install our binaries from our release archives by default, because that's what we CI and test against, and it would be really nice if that could work OOTB without any extra flags — so that users can get working results even if they mistype something or just run cargo binstall blindly without looking too hard.

If we created our own dynamically linked glibc builds, and published them in a release, would cargo binstall use those instead of the upstream zig version that's apparently broken?

I believe so. Of course, then it would be us to make a glibc-linked binary that works on many linux systems, which I don't have much experience with, but you might.

Alternatively, is it possible to get cargo binstall to use the musl binary in the meantime? Or permanently, on all systems?
In either case, I would prefer if cargo binstall would only install our binaries from our release archibes by default, because that's what we CI and test against, and it would be really nice if that could work OOTB without any extra flags — so that users can get working results even if they mistype something or just run cargo binstall blindly without looking too hard.

AFAIU, there's no way currently, but this is now tracked in cargo-bins/cargo-binstall#1721.

@thoughtpolice

If we created our own dynamically linked glibc builds, and published them in a release, would cargo binstall use those instead of the upstream zig version that's apparently broken?

Yes.

Alternatively, is it possible to get cargo binstall to use the musl binary in the meantime? Or permanently, on all systems?

You can use --disable-strategies quick-install

Alternatively, is it possible to get cargo binstall to use the musl binary in the meantime? Or permanently, on all systems?

You can use --disable-strategies quick-install

I think the question is whether we can do this from the project's side. Asking every user to do this manually isn't reliable. I'm personally so used to binstall working well that I often don't check a projects installation documentation and just binstall it.

I do wonder: Can't you disable building jj and delete the broken release on the quickinstall side? Is it setup in a way that this is impossible or difficult?

We actually have recommended the correct command in https://martinvonz.github.io/jj/latest/install-and-setup/#cargo-binstall for a long time, but I expect binstall users are less likely to read that page than an average user.

I think the question is whether we can do this from the project's side. Asking every user to do this manually isn't reliable. I'm personally so used to binstall working well that I often don't check a projects installation documentation and just binstall it.

It's tracked in cargo-bins/cargo-binstall#1721 , though I'm quite busy recently.

I do wonder: Can't you disable building jj and delete the broken release on the quickinstall side? Is it setup in a way that this is impossible or difficult?

Yes, I can remove the broken release, though it might be built again.
I think it might be a bug in rust-cross/cargo-zigbuild#255 or ziglang, since cargo-quickinstall use it to build for glibc 2.17

Now that this PR is merged, the problem should go away with the next release of cargo-binstall. Of course, people will be running outdated versions for some time, so we might want to keep this issue pinned for a little longer.

cargo-binstall 1.7.1 has released, cargo-bins/cargo-binstall#1741 which fixed this issue is released in 1.7.0

Issue can be closed. Failure was due to combination of glibc target below 2.33 and compilation via Zig prior to 0.12.0.


(errno = -ret is inlined?)

I think that's the source of the problem.

It would be fine with glibc 2.33+ where these calls became proper functions AFAIK (see the referenced quotes in link below).

Prior to glibc 2.32 the problem is Zig specific, which they resolved in 0.12.0 👍

Details can be found here: rust-cross/cargo-zigbuild#255 (comment)