EnzymeAD/Enzyme

OpenStack github runners down?

Closed this issue · 7 comments

Based on https://github.com/EnzymeAD/Enzyme/actions
it looks like both MLIR and benchmark actions are not run anymore, does anyone know why?
Also: https://github.com/EnzymeAD/Enzyme/actions/workflows/enzyme-mlir.yml

it looks like (presumably the failing benchmarking CI) crashed the runners

Can you restart them? Or will we wait and they will fix themselves eventually?

yeah, but we need to fix the cause [presumably the bad benchmarks] otherwise this will continue to crash

Yeah then please restart them. The PR works locally, so without CI I have no way of making progress on debugging:
image

And the llvm16 proof:
image

Okay they are back up again. However I did confirm that your benchmark PR was the cause of their failure.

Can you see about reducing ram/fixing to not cause that issue?

Sep 09 02:05:35 github22-ci-5 systemd[1]: actions.runner.EnzymeAD.github22-ci-5.service: A process of this unit has been killed by the OOM killer.
Sep 09 02:05:35 github22-ci-5 runsvc.sh[669]: Shutting down runner listener
Sep 09 02:05:35 github22-ci-5 runsvc.sh[669]: Sending SIGINT to runner listener to stop
Sep 09 02:05:35 github22-ci-5 runsvc.sh[669]: Sending SIGKILL to runner listener
Sep 09 02:05:35 github22-ci-5 runsvc.sh[669]: Exiting...
Sep 09 02:05:51 github22-ci-5 runsvc.sh[669]: 2024-09-09 02:05:51Z: Job Benchmark Test on os openstack22 and llvm 16 mode Release completed with result: Canceled
Sep 09 02:05:51 github22-ci-5 runsvc.sh[669]: Runner listener exited with error code 0
Sep 09 02:05:51 github22-ci-5 runsvc.sh[669]: Runner listener exit with 0 return code, stop the service, no retry needed.
Sep 09 02:05:51 github22-ci-5 systemd[1]: actions.runner.EnzymeAD.github22-ci-5.service: Failed with result 'oom-kill'.
Sep 09 02:05:51 github22-ci-5 systemd[1]: actions.runner.EnzymeAD.github22-ci-5.service: Consumed 16h 31.577s CPU time.
Removed /etc/systemd/system/multi-user.target.wants/actions.runner.EnzymeAD.github22-ci-5.service.
ZuseZ4 commented

yep, I already reduced the number of benchmark samples which we run in one of them (due to the runtime and not mem usage, but both should benefit), very happy to also reduce it in the remaining ones.