`error waiting for container: unexpected EOF` failing builds
claraberendsen opened this issue · 12 comments
Description
We have been encountering the error of error waiting for container: unexpected EOF
frequently on ci.ros2.org. Seems to be an issue with memory availability for the container.
Example build
https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2664/console
Statistics on the issue for the first two weeks of May:
nightly_linux_debug: 20% of the times
nightly_linux-aarch64_release: 13.33% of the times
nightly_linux-aarch64_repeated: 13.33% of the times
nightly_linux_release: 7.14% of the times
Container EOF error statistics
15 days
job_name | last_fail | first_fail | build_count | failure_count | failure_percentage |
---|---|---|---|---|---|
nightly_linux_debug | 2023-05-19 | 2023-05-10 | 16 | 5 | 31.25 |
nightly_linux_foxy_coverage | 2023-05-18 | 2023-05-07 | 16 | 2 | 12.5 |
nightly_linux-rhel_repeated | 2023-05-18 | 2023-05-06 | 21 | 5 | 23.81 |
nightly_linux-rhel_release | 2023-05-18 | 2023-05-12 | 15 | 4 | 26.67 |
nightly_linux-rhel_debug | 2023-05-18 | 2023-05-06 | 15 | 3 | 20.0 |
Humble cyclonedds | 2023-05-18 | 2023-05-09 | 6 | 2 | 33.33 |
Humble connext | 2023-05-18 | 2023-05-06 | 5 | 2 | 40.0 |
Foxy connext | 2023-05-14 | 2023-05-14 | 5 | 1 | 20.0 |
Rolling cyclonedds | 2023-05-13 | 2023-05-13 | 17 | 1 | 5.88 |
Rolling connext | 2023-05-13 | 2023-05-13 | 16 | 1 | 6.25 |
nightly_linux_release | 2023-05-12 | 2023-05-12 | 24 | 1 | 4.17 |
nightly_linux-aarch64_release | 2023-05-11 | 2023-05-10 | 15 | 2 | 13.33 |
nightly_linux-aarch64_repeated | 2023-05-10 | 2023-05-10 | 15 | 1 | 6.67 |
5 days
job_name | last_fail | first_fail | build_count | failure_count | failure_percentage |
---|---|---|---|---|---|
nightly_linux_debug | 2023-05-19 | 2023-05-18 | 6 | 2 | 33.33 |
nightly_linux_foxy_coverage | 2023-05-18 | 2023-05-18 | 6 | 1 | 16.67 |
nightly_linux-rhel_repeated | 2023-05-18 | 2023-05-17 | 6 | 2 | 33.33 |
nightly_linux-rhel_release | 2023-05-18 | 2023-05-16 | 5 | 3 | 60.0 |
nightly_linux-rhel_debug | 2023-05-18 | 2023-05-18 | 5 | 1 | 20.0 |
Humble cyclonedds | 2023-05-18 | 2023-05-18 | 2 | 1 | 50.0 |
Humble connext | 2023-05-18 | 2023-05-18 | 2 | 1 | 50.0 |
Foxy connext | 2023-05-14 | 2023-05-14 | 2 | 1 | 50.0 |
Today two aarch64 nightlies failed with this issue: nightly_linux-aarch64_debug #2384 and nightly_linux-aarch64_repeated #2371. Since the agent didn't get destroyed afterwards the logs were preserved and a detailed investigation was possible.
Investigation
1. Rule out OOM killer stopping the docker container.
The command journalctl -k
was run to verify that the OOM killer process of the OS triggered either the daemon or the docker process to die. No indication of OOM issues was found in log.
2. Check docker.service logs to have a better indication of what was happening
Run the command journalctl -u docker.service
.In the docker logs we could see that systemd
was causing the docker daemon to restart. See the following lines :
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete"
And then this line
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Similar logs can be found on the other agent the nightly was run.
Note that the timing of these logs is close to when the error happens in the builds per each agent. This is a good indication that the errors are happening due to the restart of the docker service.
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.430955206Z" level=info msg="ignoring event" container=88acb4d47b1e53cb20be3fa2c2c3c43472b51f9fe05b9d81a4e7a4aacf46ad8a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: run-docker-netns-d1893231c907.mount: Succeeded.
Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: var-lib-docker-overlay2-1c135b999632ae831e47372d4458c53f1bf556a3062789f60394c15a953ed313-merged.mount: Succeeded.
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete"
Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: docker.service: Succeeded.
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.834629993Z" level=info msg="Starting up"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.835519557Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839813954Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839843632Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839870011Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839881745Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843368324Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843398814Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843419565Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843431356Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083075793Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083108146Z" level=warning msg="Your kernel does not support cgroup blkio weight"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083115251Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083282143Z" level=info msg="Loading containers: start."
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.220468635Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.551039037Z" level=info msg="Loading containers: done."
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.588748768Z" level=info msg="Docker daemon" commit="20.10.21-0ubuntu1~20.04.2" graphdriver(s)=overlay2 version=20.10.21
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.589147012Z" level=info msg="Daemon has completed initialization"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.609178145Z" level=info msg="API listen on /run/docker.sock"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.976374499Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.977224089Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.977467411Z" level=info msg="Daemon shutdown complete"
Jun 06 06:44:14 ip-10-0-1-221 systemd[1]: docker.service: Succeeded.
Jun 06 06:44:14 ip-10-0-1-221 systemd[1]: docker.socket: Succeeded.
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.285510408Z" level=info msg="Starting up"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.286470569Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287077320Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287100057Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287121070Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287131712Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288953180Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288979888Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288998989Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.289011009Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362675376Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362707343Z" level=warning msg="Your kernel does not support cgroup blkio weight"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362715433Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362883835Z" level=info msg="Loading containers: start."
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.495188699Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.535984330Z" level=info msg="Loading containers: done."
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.555383161Z" level=info msg="Docker daemon" commit="20.10.21-0ubuntu1~20.04.2" graphdriver(s)=overlay2 version=20.10.21
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.555446997Z" level=info msg="Daemon has completed initialization"
Jun 06 06:44:15 ip-10-0-1-221 systemd[1]: var-lib-docker-overlay2-check\x2doverlayfs\x2dsupport223685642-merged.mount: Succeeded.
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.580350166Z" level=info msg="API listen on /run/docker.sock"
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802841899Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802922719Z" level=info msg="blockingPicker: the picked transport is not ready, loop back to repick" module=grpc
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802963933Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802999568Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby
3. Check journalctl logs in general
Executing journalctl between the time of failures we can see the following a couple of minutes before:
Jun 06 06:25:01 ip-10-0-1-221 CRON[428020]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jun 06 06:25:01 ip-10-0-1-221 CRON[428018]: pam_unix(cron:session): session closed for user root
Jun 06 06:34:19 ip-10-0-1-221 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 06 06:35:56 ip-10-0-1-221 dbus-daemon[592]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.28' (uid=0 pid=444913 comm="/usr/bin/gdbus call --system --dest org.freedeskto" label="unconfined")
Which triggers a restart of most daemons by systemd. This Daily apt upgrade
log indicates that an unattended-upgrade
is being run.
Conclusion
It would seem that the latest issues with error waiting for container: unexpected EOF
are related to the timing of the unattended-upgrades
and not by a OOM or CPU issue like initially consider.
Even though we cannot rule out that an OOM issue may be causing some of them, it seems there is a periodicity and timing to which jobs fail, affecting mostly nightlies and in the same time.
As an example the nightly_linux-aarch64_debug
has failed on two different agents but on similar times of day in the last two runs #2383 and #2384.
Further investigation / Not answered questions
- Why does it fail in normal agents and not in the 2xlarge instances?. An avenue of research is to investigate how is the time of execution for
unattended-upgrades
being configured.
Today another nightly failed with this error nightly_linux-rhel_debug
In the logs we see the time of failure is ...
06:52:51 time="2023-06-07T06:52:52Z" level=error msg="error waiting for container: unexpected EOF"
If we ssh into the instance we see that the unnatended upgrade was run during that time as well
Jun 07 06:40:39 ip-10-0-1-235 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 07 06:53:35 ip-10-0-1-235 systemd[1]: Finished Daily apt upgrade and clean activities.
aarch builds have been green since the weekend.
Waiting for: https://github.com/osrf/chef-osrf/pull/204
This issue has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros-infrastructure-updates-from-june-2023/32258/1
I have encountered the similar issue that says error waiting for container: unexpected EOF.
https://build.ros2.org/job/Hbin_uJ64__aws_sdk_cpp_vendor__ubuntu_jammy_amd64__binary/16/display/redirect
https://build.ros2.org/job/Rbin_ujv8_uJv8__aws_sdk_cpp_vendor__ubuntu_jammy_arm64__binary/11/display/redirect
@wep21 thanks for reporting! The PR to fix has gotten lost in the weeds of other things but given that the error is appearing again I have re-prioritized it and added to my backlog for this week.
This however is most likely unrelated to the general error of Unexpected EOF .
Fixing this issue in https://github.com/osrf/chef-osrf/pull/212
This is not completely closed. As we are still missing the new images for build.ros2.org
Finally fixed by: https://github.com/osrf/osrf-terraform/pull/147