No network connectivity in some docker containers after upgrade to 1153.0.0

Question

No network connectivity in some docker containers after upgrade to 1153.0.0

cdwertmann opened this issue 9 years ago · 19 comments

Issue Report

Bug

After upgrading from stable (1068.10.0) to alpha (1153.0.0), some freshly submitted fleet services start containers that do not have network connectivity. From within the container I cannot ping the docker bridge (default gateway address) or any other containers or host interfaces. A simple "docker restart " resolves the issue.

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1153.0.0
VERSION_ID=1153.0.0
BUILD_ID=2016-08-27-0408
PRETTY_NAME="CoreOS 1153.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

QEMU/KVM on Openstack

Expected Behavior

All docker containers have network connectivity, as it has always been in the past.

Actual Behavior

A few containers cannot reach any hosts, not even able to ping the default gateway (docker0).

Reproduction Steps

upgrade to CoreOS alpha 1153.0.0
submit your usual containers
find the container that does not have network access
restart the container and see that it now does have access

Other Information

This happens across different docker images that are based on different distributions, so I don't think it is related to the image. docker inspect shows no difference before (when networking is down) and after a restart of the container (when networking works again).

I'm passing these options to docker:

cat /etc/systemd/system/docker.service.d/50-insecure-registry.conf
[Service]
Environment='DOCKER_OPTS=--bip 172.17.42.1/16 --dns 172.17.42.1 --dns-search=service.consul --insecure-registry="0.0.0.0/0"'

crawford commented 8 years ago

#1678

Answer 1 · 2016-09-08T20:59:40.000Z

I was able to reproduce this with Docker 1.11.2 on Linux 4.7.1 and 4.6.3, but was unable to reproduce with Docker 1.10.3.

Answer 2 · 2016-09-09T17:19:25.000Z

@crawford did you reproduce generically? or only in QEMU/KVM?

Answer 3 · 2016-09-09T17:29:42.000Z

@bryanlatten I've been reproducing it with QEMU. I haven't tried other platforms.

Answer 4 · 2016-09-12T10:01:49.000Z

I have the same issue docker doesn't attach veth interface to docker0 bridge. Restarting daemon helps and manually attaching interface by running brctl addif docker0 veth. We started to see this issue after upgrading to docker 1.12.1 on baremetal CoreOS server.

docker info:

Containers: 38
 Running: 20
 Paused: 0
 Stopped: 18
Images: 60
Server Version: 1.12.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: selinux
Kernel Version: 4.7.3-coreos
Operating System: CoreOS 1164.1.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 377.9 GiB
Name: host-1
ID: 655L:2CC7:MOFH:H2NQ:UKZG:DAEO:BVIR:3IAY:HHSN:UPBF:6EMS:NT7M
Docker Root Dir: /var/lib/docker

os-release:

NAME=CoreOS
ID=coreos
VERSION=1164.1.0
VERSION_ID=1164.1.0
BUILD_ID=2016-09-10-0834
PRETTY_NAME="CoreOS 1164.1.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Answer 5 · 2016-09-13T10:50:30.000Z

Linking moby/moby#26492

Answer 6 · 2016-09-14T00:17:42.000Z

This is what I have observed so far from trying to test and narrow it down to a problematic component: Docker containers' network links randomly fail to have their master set. This happens with Docker in CoreOS alpha and beta. The ip link command can be used on the host to set the master and restore networking for the containers.

It continues to fail when booting a kernel from stable and the user space from alpha or beta. It does not fail with alpha or beta kernels and stable user spaces.

It fails whether Docker is built with Go 1.6 or 1.7. It fails with all Project Atomic patches applied.

It fails when patching libnetwork to just use an ioctl instead of netlink to set the master. The contents of the netlink request in LinkSetMasterByIndex is essentially identical between working and failing containers. Calling LockOSThread around the syscalls and logging the thread's network namespace shows no indication of the Go runtime leaking namespaces.

Answer 7 · 2016-09-14T15:49:35.000Z

I'll add that we're experiencing the same issue using CoreOS Beta (1153.4.0) running in AWS:

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1153.4.0
VERSION_ID=1153.4.0
BUILD_ID=2016-09-10-0107
PRETTY_NAME="CoreOS 1153.4.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

I used a container that simply curls the AWS metadata service (http://169.254.169.254/latest/meta-data) and then wrote a script to run it over and over. The result was that about 60% of the time the container could not curl the above endpoint while the remaining 40% could. I could not detect any pattern, as the same docker run run within seconds of each other could result in completely different results.

Answer 8 · 2016-09-15T00:29:51.000Z

Proof-of-concept fix is here: dm0-/libnetwork@4343ba4c21f1a121f9e867efda3231a61dc5565e. Waiting for confirmation from upstream.

Answer 9 · 2016-09-20T20:36:30.000Z

I believe I have a (rather unfortunate) workaround for people who can't run a patched Docker: stop/mask systemd-networkd.service. Obviously, keep in mind the implications of stopping your network manager. You'd only want to do this after it's initialized your real interfaces. I'll see if there is a less destructive way to work around this.

Answer 10 · 2016-09-26T17:15:48.000Z

This was fixed with coreos/coreos-overlay@874c1b8 and coreos/docker#29 and should roll out in the next Alpha. Assuming nothing goes wrong, we'll backport this to Docker 1.11.2 in Beta in the coming weeks.

Answer 11 · 2016-11-02T18:06:00.000Z

This is now available in Stable. /cc @bryanlatten

Answer 12 · 2016-11-07T17:46:24.000Z

I'm interested why you don't modify the systemd-networkd config to avoid matching on these interfaces?

Answer 13 · 2016-11-07T17:55:46.000Z

@bboreham that is currently not possible with networkd. We have a proposal to add this functionality to networkd.

Answer 14 · 2016-11-07T18:01:53.000Z

Well, it's not possible to exactly specify what you don't want, but you could write some rules like 'Name=eth*' to match what you do want. Maybe too hard to cover all bases?

Answer 15 · 2016-11-07T18:29:58.000Z

There isn't a way to match all ethernet devices. The names use the persistent naming scheme (so they won't be eth*).

Answer 16 · 2016-11-07T19:10:44.000Z

ATM this seems to be a regression with CoreOS stable 1122.3.0 -> 1185.3.0 breaking Weave for some users.

Should we have a separate issue to track that somewhere?

Answer 17 · 2016-12-05T11:21:22.000Z

I note that systemd/systemd#4228 has now been merged and the upstream Docker PR moby/libnetwork#1450 was rejected.

Do you have a CoreOS issue to make use of the new systemd feature?

Answer 18 · 2016-12-05T15:30:21.000Z

coreos/systemd#73