Nomad Rescheduling
Closed this issue · 2 comments
On the 2024-07-23 (02:21, agent 2) and the 2024-07-24 (02:34, agent 2), we observed that Nomad did not (successfully) reschedule runners. On both days, this behavior was triggered by an unattended upgrade of docker-ce
.
In the syslogs, we see:
- Docker starting to restart
- Nomad starting to restart gracefully
- Docker warning about
ShouldRestart failed, container will not be restarted
- Docker
ignoring event topic=/tasks/delete
- Containerd warning about
runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n
- Systemd remarking
Found left-over process 1680662 (nomad) in control group while starting unit. Ignoring. This usually indicates unclean termination of a previous run, or service implementation deficiencies.
- Nomad throwing many times
error reading from server: EOF
In #612, we are currently investigating whether batch jobs restart
/reschedule
at all.
I suggest to
Since we have two dedicated issues for #673 and #587, this issue is only about the sequential restart of Nomad agents together with the rescheduling behavior. Restarting Nomad sequentially is more fault tolerant than a simultaneous restart, showing less errors (according to our past experience). That's why we also included a rolling restart of Nomad in our Ansible pipeline.
Since the upstream issue created for #673 is not really about simultaneous restarts (but rather restarting Nomad in general with the batch jobs we use), currently this issue does not provide many additional insights. To keep a better visibility of pending issues and since we expect that #673 will improve the situation anyway, we are closing this one.