openHPI/poseidon

Nomad Rescheduling

Closed this issue · 2 comments

On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of docker-ce.

In the syslogs, we see:

  • Docker starting to restart
  • Nomad starting to restart gracefully
  • Docker warning about ShouldRestart failed, container will not be restarted
  • Docker ignoring event topic=/tasks/delete
  • Containerd warning about runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
  • Systemd remarking Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
  • Nomad throwing many times error reading from server: EOF

In #612, we are currently investigating whether batch jobs restart/reschedule at all.
I suggest to

  • Wait for #673
  • Test for lost runners using our (sequential) Ansible Deployments
    • Maybe inject a Nomad or Docker service restart
  • Keep an eye open for the #587 and reduced numbers of idle runners after deployments and unattended-upgrades

Since we have two dedicated issues for #673 and #587, this issue is only about the sequential restart of Nomad agents together with the rescheduling behavior. Restarting Nomad sequentially is more fault tolerant than a simultaneous restart, showing less errors (according to our past experience). That's why we also included a rolling restart of Nomad in our Ansible pipeline.

Since the upstream issue created for #673 is not really about simultaneous restarts (but rather restarting Nomad in general with the batch jobs we use), currently this issue does not provide many additional insights. To keep a better visibility of pending issues and since we expect that #673 will improve the situation anyway, we are closing this one.