Github CI Anomaly Zoo
andres-erbsen opened this issue · 40 comments
- https://github.com/mit-plv/fiat-crypto/actions/runs/3163210890/jobs/5150550477 was terminated after 320 minutes, which is a round enough number to be a timeout, but the documented limit is 6 hours (360 minutes).
- https://github.com/mit-plv/bedrock2/actions/runs/3161224197/jobs/5146573405 shows that it was canceled, but none of us remember doing it nor can think of a reason why we would have done it.
- mit-plv/bedrock2@7654f3f shows ❌ even though all builds of it show ✔️
- An intermediate head on the Garagedoor PR showed ✔️ even though listing its builds showed some ❌s
- mit-plv/bedrock2@7654f3f shows ❌ even though all builds of it show ✔️
To shed a bit more light on this
If you go to the PR at mit-plv/bedrock2#280, we see
and the actions log shows
so it looks like the overall status of the commit includes runs of that commit across all branches, but the displayed checks are only for master or something?
- mit-plv/fiat-crypto/actions/runs/3163210890/jobs/5150550477 was terminated after 320 minutes, which is a round enough number to be a timeout, but the documented limit is 6 hours (360 minutes).
I'm thinking this might actually be an OOM kill, especially given that we're running the checker while replacing vm_compute
with the lazy
reduction machine.
- mit-plv/bedrock2/actions/runs/3161224197/jobs/5146573405 shows that it was canceled, but none of us remember doing it nor can think of a reason why we would have done it.
Plausibly related: actions/runner-images#736 (comment) (but this is only about mac)
Maybe worth asking about at https://github.com/orgs/community/discussions/
https://github.com/mit-plv/fiat-crypto/actions/runs/3245312644/jobs/5322679386
Cloning into 'boringssl'...
error: 44224 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
+ exit 128
https://github.com/mit-plv/fiat-crypto/actions/runs/3245312653/jobs/5322679307
This is new
# ocamlfind: Package `findlib' not found
# Error while running 'D:\a\fiat-crypto\fiat-crypto\_opam\bin/ocamlfind.exe query findlib -format %v' (exit code 2)
# Configuration script failed!
I've seen a bunch of errors about dynlink failing, but not this one
We should include snippets of the logs in the comments; GH deletes logs after a couple months
https://github.com/mit-plv/fiat-crypto/actions/runs/3242704485/jobs/5316358457
COQFLAGS="-Q src/bedrock2 bedrock2 -Q src/bedrock2Examples bedrock2Examples -Q /github/workspace/rupicola/bedrock2/deps/coqutil/src/coqutil coqutil " ../etc/bytedump.sh bedrock2.PrintListByte.allBytes > special/BytedumpTest.out.tmp
make[2]: *** [Makefile:64: special/BytedumpTest.out] Error 1
make[1]: *** [Makefile:105: bedrock2_ex] Error 2
How do we debug this? Maybe we shouldn't suppress stderr in bytedump?
validate failed on bytedump after previous steps succeeded:
https://github.com/mit-plv/fiat-crypto/actions/runs/3247910630/jobs/5328487093#step:17:3485
In fact bytedump is failing very frequently now, can we do something about this?
https://github.com/mit-plv/fiat-crypto/actions/runs/3247892148/jobs/5328446887
https://github.com/mit-plv/fiat-crypto/actions/runs/3247910630/jobs/5328487093
https://github.com/mit-plv/fiat-crypto/actions/runs/3247916390/jobs/5328499981
- https://github.com/mit-plv/fiat-crypto/actions/runs/3271198008/jobs/5380690445#step:7:131 on Mac OS:
Command exited with non-zero status 2
/Users/runner/work/fiat-crypto/fiat-crypto/rupicola/bedrock2/deps/coqutil/src/coqutil/Tactics/syntactic_unify.vo (real: 0.35, user: 0.09, sys: 0.09, mem: 57556 ko)
make[3]: *** [/Users/runner/work/fiat-crypto/fiat-crypto/rupicola/bedrock2/deps/coqutil/src/coqutil/Tactics/syntactic_unify.vo] Error 2
This suspiciously started happening when I started not trying to rebuild on Mac OS
Can anyone reproduce the (seemingly consistent?) issue that make -j2
fails on Mac OS the first time it's run after a clean checkout? c.f. #1458
# Usage: coqworker.opt --kind=[proof|query|tactic] $args
# got -main-channel
# File ".\src/Curves/Edwards/XYZT/Basic.v", line 124, characters 6-336:
# Error:
# Anomaly
# "Uncaught exception Failure("The spawned process did not connect back in 10.0s")."
# Please report at [http://coq.inria.fr/bugs/.](http://coq.inria.fr/bugs/)
#
# make: *** [Makefile.coq:793: src/Curves/Edwards/XYZT/Basic.vo] Error 129
https://github.com/mit-plv/fiat-crypto/actions/runs/3281473086/jobs/5403532417#step:17:1812
https://github.com/mit-plv/bedrock2/actions/runs/3292916290/jobs/5428839023 self-cancelled on start
Sometimes the version of Coq changes between one step and the next in the docker action
coq-community/docker-coq-action#80
https://github.com/mit-plv/fiat-crypto/actions/runs/3287380197/jobs/5416490558#step:17:501
This results in a bytedump failure because the bedrock2 files are compiled with an outdated version of Coq and so we get inconsistent assumptions
https://github.com/mit-plv/bedrock2/actions/runs/3305594557/jobs/5455825957 self-cancelled on submit
https://github.com/mit-plv/bedrock2/actions/runs/3337004947/jobs/5522821413 self-cancelled on submit
https://github.com/mit-plv/bedrock2/actions/runs/3337008636/jobs/5522829175 self-cancelled on submit
https://github.com/mit-plv/bedrock2/actions/runs/3352143850/jobs/5554084845 self-cancelled on submit
https://github.com/mit-plv/bedrock2/actions/runs/3390987808/jobs/5635703710 self-cancelled on submit
gcc: fatal error: no input files
https://github.com/mit-plv/fiat-crypto/actions/runs/3420869283/jobs/5696244304#step:17:453
gcc: fatal error: no input files
https://github.com/mit-plv/fiat-crypto/actions/runs/3420869283/jobs/5696244304#step:17:453
That's probably just me messing up how to get the version of gcc
; I guess it's not gcc -v
? The line to be fixed is
The actual failure here is
COQFLAGS="-Q src/bedrock2 bedrock2 -Q src/bedrock2Examples bedrock2Examples -Q /github/workspace/rupicola/bedrock2/deps/coqutil/src/coqutil coqutil " ../etc/bytedump.py bedrock2.PrintListByte.allBytes > special/BytedumpTest.out.tmp
Coq < Coq < Toplevel input, characters 1-55:
> Require bedrock2.PrintListByte bedrock2.PrintListByte.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error:
Compiled library bedrock2.PrintListByte (in file /github/workspace/rupicola/bedrock2/bedrock2/src/bedrock2/PrintListByte.vo) makes inconsistent assumptions over library Coq.Init.Ltac
Coq <
Coq <
Unnamed_thm <
Unnamed_thm <
Unnamed_thm < Toplevel input, characters 32-63:
> PrintListByte.print_list_byte bedrock2.PrintListByte.allBytes.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: The reference bedrock2.PrintListByte.allBytes was not found in the current environment.
Unnamed_thm <
Coq <
Error: make[2]: *** [Makefile:64: special/BytedumpTest.out] Error 2
Error: make[1]: *** [Makefile:105: bedrock2_ex] Error 2
make: *** No rule to make target 'bedrock2', needed by '.Makefile.coq.d'. Stop.
which presumably is #1394 (comment)
https://github.com/mit-plv/fiat-crypto/actions/runs/3446654504/jobs/5751826046 Windows: Retrieve the Cygwin cache
Error: getaddrinfo ENOTFOUND www.cygwin.com
https://github.com/mit-plv/fiat-crypto/actions/runs/3457159795/jobs/5770405148#step:6:116
** Fatal error: Cannot run cygpath -m "libshell32" "libshell32.lib" "libshell32.dll.a" "libshell32.a" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\libshell32" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\libshell32.lib" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\libshell32.dll.a" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\libshell32.a" "lib\libshell32" "lib\libshell32.lib" "lib\libshell32.dll.a" "lib\libshell32.a" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\flexdll\libshell32" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\flexdll\libshell32.lib" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\flexdll\libshell32.dll.a" "D:/a/fiat-crypto/fiat-crypto/_opam/lib/ocaml\flexdll\libshell32.a" "/usr/lib/gcc/x86_64-w64-mingw32/11/libshell32" "/usr/lib/gcc/x86_64-w64-mingw32/11/libshell32.lib" "/usr/lib/gcc/x86_64-w64-mingw32/11/libshell32.dll.a" "/usr/lib/gcc/x86_64-w64-mingw32/11/libshell32.a" "/usr/x86_64-w64-mingw32/lib/x86_64-w64-mingw32/11/libshell32" "/usr/x86_64-w64-mingw32/lib/x86_64-w64-mingw32/11/libshell32.lib" "/usr/x86_64-w64-mingw32/lib/x86_64-w64-mingw32/11/libshell32.dll.a" "/usr/x86_64-w64-mingw32/lib/x86_64-w64-mingw32/11/libshell32.a" "/usr/x86_64-w64-mingw32/lib/libshell32" "/usr/x86_64-w64-mingw32/lib/libshell32.lib" "/usr/x86_64-w64-mingw32/lib/libshell32.dll.a" "/usr/x86_64-w64-mingw32/lib/libshell32.a" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/x86_64-w64-mingw32/11/libshell32" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/x86_64-w64-mingw32/11/libshell32.lib" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/x86_64-w64-mingw32/11/libshell32.dll.a" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/x86_64-w64-mingw32/11/libshell32.a" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/libshell32" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/libshell32.lib" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/libshell32.dll.a" "/usr/x86_64-w64-mingw32/sys-root/mingw/lib/libshell32.a"
# File "caml_startup", line 1:
# Error: Error during linking (exit code 2)
# make[1]: *** [Makefile.common:184: _build/default/coq-core.install] Error 1
# make[1]: *** Waiting for unfinished jobs....
# make[1]: Leaving directory '/cygdrive/d/a/fiat-crypto/fiat-crypto/_opam/.opam-switch/build/coq.8.15.2'
# make: *** [Makefile.make:122: submake] Error 2
2 | opam exec -- cat ~/.opam/log/*
| ~~~~
| The term 'opam' is not recognized as a name of a cmdlet, function, script file, or executable program. Check the
| spelling of the name, or if a path was included, verify that the path is correct and try again.
https://github.com/mit-plv/fiat-crypto/actions/runs/3470364042/jobs/5798475008
2 | opam exec -- cat ~/.opam/log/* | ~~~~ | The term 'opam' is not recognized as a name of a cmdlet, function, script file, or executable program. Check the | spelling of the name, or if a path was included, verify that the path is correct and try again.
https://github.com/mit-plv/fiat-crypto/actions/runs/3470364042/jobs/5798475008
This is because I have the cat
step set to run even if earlier steps failed. The actual issue here is that opam failed to install in an earlier step with
connect ETIMEDOUT 185.199.110.133:443
Waiting 20 seconds before trying again
connect ETIMEDOUT 185.199.110.133:443
Waiting 14 seconds before trying again
Error: connect ETIMEDOUT 185.199.110.133:443
validate times out in https://github.com/mit-plv/fiat-crypto/actions/runs/3470357503/jobs/5798461079 and https://github.com/mit-plv/fiat-crypto/actions/runs/3470364029/jobs/5798476434 because changed Coq version between steps results in project rebuild. We should really have a better way of dealing with Coq changing versions under us... (no progress on coq-community/docker-coq-action#80 yet...)
boringssl test failure at https://github.com/mit-plv/fiat-crypto/actions/runs/3524875493/jobs/5910863477#step:7:111 on a PR changing exact _
to refine _
in a Qed
'd proof (aka a PR that cannot have changed anything impacting BoringSSL)
FAILED: crypto_test_data.cc /home/runner/work/fiat-crypto/fiat-crypto/boringssl/build/crypto_test_data.cc
cd /home/runner/work/fiat-crypto/fiat-crypto/boringssl && /opt/hostedtoolcache/go/1.13.15/x64/bin/go run util/embed_test_data.go -file-list /home/runner/work/fiat-crypto/fiat-crypto/boringssl/build/embed_test_data_args.txt > /home/runner/work/fiat-crypto/fiat-crypto/boringssl/build/crypto_test_data.cc
# command-line-arguments
Error: util/embed_test_data.go:79:16: undefined: os.ReadFile
Error: util/embed_test_data.go:129:16: undefined: os.ReadFile
Package g++-7 is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
E: Package 'g++-7' has no installation candidate
https://github.com/mit-plv/fiat-crypto/actions/runs/3524875493/jobs/5936926857#step:3:67
https://github.com/mit-plv/bedrock2/actions/runs/3653687403/jobs/6176913824 timed out for no reason I can see
https://github.com/mit-plv/bedrock2/actions/runs/3653687403/jobs/6176913824 timed out for no reason I can see
Windows does this sometimes, I think: it fails to quit terminal after all scripts complete
https://github.com/mit-plv/fiat-crypto/actions/runs/4151994561/jobs/7182649901
connection error: 12037 fetching https://cygwin.mirror.constant.com/x86_64/setup.bz2.sig
https://github.com/mit-plv/fiat-crypto/actions/runs/7700820211/job/20985462543?pr=1811
error: failed retrieving file 'ocaml-compiler-libs-5.1.0-1-x86_64.pkg.tar.zst.sig' from geo.mirror.pkgbuild.com : Operation too slow. Less than 1 bytes/sec transferred the last 10 seconds
coq-8.18.0-2-x86_64 downloading...
error: failed retrieving file 'ghc-9.0.2-3-x86_64.pkg.tar.zst.sig' from geo.mirror.pkgbuild.com : Operation too slow. Less than 1 bytes/sec transferred the last 10 seconds
warning: failed to retrieve some files
error: failed to commit transaction (unexpected error)
Connecting to gitlab.alpinelinux.org (172.105.69.85:443)
wget: server returned error: HTTP/1.1 404 Not Found
ssl_client: write: Broken pipe
Error: Process completed with exit code 1.
https://github.com/mit-plv/rewriter/actions/runs/7705023867/job/20998315736?pr=148
error: failed retrieving file 'perl-5.38.2-1-x86_64.pkg.tar.zst.sig' from geo.mirror.pkgbuild.com : Operation too slow. Less than 1 bytes/sec transferred the last 10 seconds
https://github.com/mit-plv/fiat-crypto/actions/runs/8750960028/job/24015572653?pr=1885
https://github.com/mit-plv/fiat-crypto/actions/runs/9056914029/job/24884198262#step:7:29
Run jirutka/setup-alpine@v1
Run sudo -E ./setup-alpine.sh
Prepare rootfs directory
Download static apk-tools
▷ Downloading https://gitlab.alpinelinux.org/api/v4/projects/5/packages/generic/v2.14.0/x86_64/apk.static
curl: (7) Failed to connect to gitlab.alpinelinux.org port 443 after 9945 ms: Connection timed out
Error occurred at line 176:
173 | APK="$RUNNER_TEMP/apk"
174 |
175 | info "Downloading ${INPUT_APK_TOOLS_URL%\#*}"
> 176 | download_file "$INPUT_APK_TOOLS_URL" "$APK"
177 | chmod +x "$APK"
178 |
179 |
Error: Error occurred at line 176: download_file "$INPUT_APK_TOOLS_URL" "$APK" (see the job log for more information)
Error: Process completed with exit code 1.
[1 of 2] Compiling Main ( src/ExtractionHaskell/bedrock2_fiat_crypto.hs, src/ExtractionHaskell/bedrock2_fiat_crypto.o )
panic! (the 'impossible' happened)
GHC version 9.10.1:
heap overflow
Please report this as a GHC bug: https://www.haskell.org/ghc/reportabug
https://github.com/mit-plv/fiat-crypto/actions/runs/9138561025/job/25130899016?pr=1913#step:4:4949
Maybe we need to bump -M7G
to -M8G
? (And maybe we should upstream a bug report to not call heap overflows impossible?)