ofborg fails nondeterministically
amjoseph-nixpkgs opened this issue · 6 comments
From
Ofborg fails with:
error: builder for '/nix/store/bgj8mchi88pzgs9c24xb0kivlmglfmhp-nix-2.3.17.drv' failed with exit code 2;
last 10 log lines:
> ran test tests/gc.sh... [PASS]
> ran test tests/gc-runtime.sh... [PASS]
> ran test tests/binary-cache.sh... [PASS]
> ran test tests/user-envs.sh... [PASS]
> ran test tests/remote-store.sh... [PASS]
> ran test tests/lang.sh... [PASS]
> ran test tests/fixed.sh... [PASS]
> ran test tests/timeout.sh... [PASS]
> ran test tests/gc-auto.sh... [PASS]
> ran test tests/gc-concurrent.sh... [PASS]
For full logs, run 'nix log /nix/store/bgj8mchi88pzgs9c24xb0kivlmglfmhp-nix-2.3.17.drv'.
Yet I can build the exact same derivation locally:
$ nix build /nix/store/bgj8mchi88pzgs9c24xb0kivlmglfmhp-nix-2.3.17.drv^*
$
This is why it is so important that we are able to run CT locally.
Please always post more than the last 10 lines mentioned by ofborg. This makes debugging things harder than it needs to be.
I think the log format provided by ofborg is just not great, since it combines everything which can make it hard to find the one derivation that failed. It could also certainly be that the test itself is flaky.
By searching for fail]
in https://gist.githubusercontent.com/GrahamcOfBorg/e550c0c6974dc65290e52db4934143e7/raw/1594d4c46950d777be85a7770155df2c69ce39e7/ofborg-eval-lib-tests I found the failing test:
ran test tests/filter-source.sh... [PASS]
ran test tests/misc.sh... [PASS]
ran test tests/add.sh... [FAIL]
++ nix-store --add ./dummy
+ path1=/build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
+ echo /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
/build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
++ nix-store --add-fixed sha256 --recursive ./dummy
+ path2=/build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
+ echo /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
/build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
+ test /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy '!=' /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
++ nix-store --add-fixed sha256 ./dummy
+ path3=/build/nix-test/add/store/1b0chpd74drqysiwsskw53zwlg18rcjl-dummy
+ echo /build/nix-test/add/store/1b0chpd74drqysiwsskw53zwlg18rcjl-dummy
/build/nix-test/add/store/1b0chpd74drqysiwsskw53zwlg18rcjl-dummy
+ test /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy '!=' /build/nix-test/add/store/1b0chpd74drqysiwsskw53zwlg18rcjl-dummy
++ nix-store --add-fixed sha1 --recursive ./dummy
+ path4=/build/nix-test/add/store/dcjypnx18bpzy81n499bfgq7fl548swl-dummy
+ echo /build/nix-test/add/store/dcjypnx18bpzy81n499bfgq7fl548swl-dummy
/build/nix-test/add/store/dcjypnx18bpzy81n499bfgq7fl548swl-dummy
+ test /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy '!=' /build/nix-test/add/store/dcjypnx18bpzy81n499bfgq7fl548swl-dummy
++ nix-store -q --hash /build/nix-test/add/store/g1qxg63zbanhv79ibby90311521d4237-dummy
+ hash1=sha256:0ip26j2h11n1kgkz36rl4akv694yz65hr72q4kv4b3lxcbi65b3p
+ echo sha256:0ip26j2h11n1kgkz36rl4akv694yz65hr72q4kv4b3lxcbi65b3p
sha256:0ip26j2h11n1kgkz36rl4akv694yz65hr72q4kv4b3lxcbi65b3p
++ nix-hash --type sha256 --base32 ./dummy
+ hash2=0lc8c8k1yc8m563wxg9ikalz4q9f56gc667qnnsjiwgiv7ya8xbw
+ echo 0lc8c8k1yc8m563wxg9ikalz4q9f56gc667qnnsjiwgiv7ya8xbw
0lc8c8k1yc8m563wxg9ikalz4q9f56gc667qnnsjiwgiv7ya8xbw
+ test sha256:0ip26j2h11n1kgkz36rl4akv694yz65hr72q4kv4b3lxcbi65b3p = sha256:0lc8c8k1yc8m563wxg9ikalz4q9f56gc667qnnsjiwgiv7ya8xbw
make: *** [mk/lib.mk:128: tests/add.sh.test] Error 1
make: *** Waiting for unfinished jobs....
ran test tests/pass-as-file.sh... [PASS]
Yet the same derivation builds for me.
Were you able to reproduce this failure?
I didn't try reproducing it but I assume it has probably something to with it's setup. Maybe ZFS is to blame?
I apologize; this is not an ofborg bug, it is a concurrency bug in the Nix test apparatus. I had it happen to me locally during a build of the TVL monorepo and it failed in a way which strongly implicates enableParallelChecking=true
:
@SuperSandro2000 thanks for nudging me to look closer here. I was wrong about this.
Maybe ZFS is to blame?
We can definitely rule that out; the machine where it happened to me uses BTRFS for /
and /nix/store
and tmpfs
for $NIX_BUILD_TOP
.