Nomad Rescheduling

Question

Nomad Rescheduling

Closed this issue 5 days ago · 2 comments

On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of docker-ce.

In the syslogs, we see:

Docker starting to restart
Nomad starting to restart gracefully
Docker warning about ShouldRestart failed, container will not be restarted
Docker ignoring event topic=/tasks/delete
Containerd warning about runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
Systemd remarking Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Nomad throwing many times error reading from server: EOF

Answer 1 · 2024-09-12T09:06:04.000Z

In #612, we are currently investigating whether batch jobs restart/reschedule at all.
I suggest to

Wait for #673
Test for lost runners using our (sequential) Ansible Deployments
- Maybe inject a Nomad or Docker service restart
Keep an eye open for the #587 and reduced numbers of idle runners after deployments and unattended-upgrades

Answer 2 · 2024-09-25T13:28:09.000Z

Since we have two dedicated issues for #673 and #587, this issue is only about the sequential restart of Nomad agents together with the rescheduling behavior. Restarting Nomad sequentially is more fault tolerant than a simultaneous restart, showing less errors (according to our past experience). That's why we also included a rolling restart of Nomad in our Ansible pipeline.

Since the upstream issue created for #673 is not really about simultaneous restarts (but rather restarting Nomad in general with the batch jobs we use), currently this issue does not provide many additional insights. To keep a better visibility of pending issues and since we expect that #673 will improve the situation anyway, we are closing this one.