Intermittent responses for Kubernetes service endpoints (postmortem)
Closed this issue · 4 comments
Follow up to #63.
Beginning of Problems
At some point in time a known-good deployment stopped succeeding on newly created clusters. This was caused by several disparate issues across several versions/configurations/components.
- Init containers would not progress because service availability checks would fail
- A service would appear to exist (
kubectl get svc
) and point at pods with correct endpoints (kubectl describe service
) - Attaching to pods directly for inspection would show them operating as expected
- Sometimes parts would succeed, but not uniformly and with no clear pattern
The first step to check if a service is working correctly is actually a simple DNS check (nslookup service
). By chance, this would often appear to be functioning as expected indicating the problem must be elsewhere (not necessarily with kubernetes).
However, not to bury the lead: running nslookup on a loop would later expose that it was timing out sporadically. That is the sort of thing that makes a bug sinister as it misdirects debugging efforts away from the problem.
Known KubeDNS Issues Encountered
-
Secrets volume & SELinux permissions
SELinux context was missing 'svirt_sandbox_file_t' on the secretes volume and therefore from the perspective of the KubeDNS pod
/var/run/secrets/kubernetes.io/serviceaccount/
was mangled and it couldn't in turn use that to connect to the master. -
Secrets volume got stale
The kube-controller is responsible for injecting the secrets volume into pods and keeping it up to date. There were/are known bugs where it would fail to do that. As a result KubeDNS would mysteriously stop working because its tokens to connect to the master had grown stale. (This sort of thing: kubernetes/kubernetes#24928)
-
Typo
official skydns-rc.yaml had a typo at some point with
--domain=
missing the trailing dot. -
Scalability
It is now recommended to scale KubeDNS pods proportionally to number of nodes in a cluster.
These problems would crop up and get resolved yet errors would stubbornly persist.
kubectl logs $(kubectl --namespace=kube-system get pods | tail -n1 | cut -d' ' -f1) --namespace=kube-system --container kubedns
I0829 20:19:21.696107 1 server.go:94] Using https://10.16.0.1:443 for kubernetes master, kubernetes API: <nil>
I0829 20:19:21.699491 1 server.go:99] v1.4.0-alpha.2.1652+c69e3d32a29cfa-dirty
I0829 20:19:21.699518 1 server.go:101] FLAG: --alsologtostderr="false"
I0829 20:19:21.699536 1 server.go:101] FLAG: --dns-port="10053"
I0829 20:19:21.699548 1 server.go:101] FLAG: --domain="cluster.local."
I0829 20:19:21.699554 1 server.go:101] FLAG: --federations=""
I0829 20:19:21.699560 1 server.go:101] FLAG: --healthz-port="8081"
I0829 20:19:21.699565 1 server.go:101] FLAG: --kube-master-url=""
I0829 20:19:21.699571 1 server.go:101] FLAG: --kubecfg-file=""
I0829 20:19:21.699577 1 server.go:101] FLAG: --log-backtrace-at=":0"
I0829 20:19:21.699584 1 server.go:101] FLAG: --log-dir=""
I0829 20:19:21.699600 1 server.go:101] FLAG: --log-flush-frequency="5s"
I0829 20:19:21.699607 1 server.go:101] FLAG: --logtostderr="true"
I0829 20:19:21.699613 1 server.go:101] FLAG: --stderrthreshold="2"
I0829 20:19:21.699618 1 server.go:101] FLAG: --v="0"
I0829 20:19:21.699622 1 server.go:101] FLAG: --version="false"
I0829 20:19:21.699629 1 server.go:101] FLAG: --vmodule=""
I0829 20:19:21.699681 1 server.go:138] Starting SkyDNS server. Listening on port:10053
I0829 20:19:21.699729 1 server.go:145] skydns: metrics enabled on : /metrics:
I0829 20:19:21.699751 1 dns.go:167] Waiting for service: default/kubernetes
I0829 20:19:21.700458 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0829 20:19:21.700474 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0829 20:19:26.691900 1 logs.go:41] skydns: failure to forward request "read udp 10.32.0.2:49468->172.20.0.2:53: i/o timeout"
Known Kubernetes Networking Issues Encountered
Initial Checks
Kubernetes imposes the following fundamental requirements on any networking implementation:
- all containers can communicate with all other containers without NAT
- all nodes can communicate with all containers (and vice-versa) without NAT
- the IP that a container sees itself as is the same IP that others see it as
In other words, to make sure networking is not seriously broken/misconfigured check:
- Pods are being created / destroyed
- Pods are able to ping each other
At first blush these were looking fine, but pod creation was sluggish (30-60 seconds), and that is a red flag.
Missing Dependencies
As described in #62, at some version CNI folder started missing binaries.
More undocumented dependencies (#64) were found from staring at logs and noting weirdness.
The real important ones are (conntrack-tools, socat, bridge-utils), these things are now being pinned down upstream.
The errors were time consuming to understand because often their phrasing would leave something to be desired. Unfortunately there's at least one known false-positive warning (kubernetes/kubernetes#23385).
Cluster CIDR overlaps
--cluster-cidr="": CIDR Range for Pods in cluster.
--service-cluster-ip-range="": CIDR Range for Services in cluster.
In my case services got a /16 starting on 10.0.0.0, the cluster-cidr got a 16 on 10.244.0.0.
The service cidr is routable because kube-proxy is constantly writing iptable rules on every minion.
For Weave in particular --ipalloc-range
needs to be passed to exactly match what's given to the Kubernetes cluster-cidr
.
Whatever your network overlay, it must not clobber the service range!
Iptables masquerade conflicts
Flannel
If using Flannel be sure to follow the newly documented instructions:
DOCKER_OPTS="--iptables=false --ip-masq=false"
Kube-proxy makes extensive use of masquerading rules, similar to an overlay clobbering the service range, another component (like the docker daemon itself) mucking about with masq rules will cause unexpected behavior.
Weave
Weave was originally erronously started with --docker-endpoint=unix:///var/run/weave/weave.sock
which similarly caused unexpected behavior. This flag is extraneous and has to be omitted when used with CNI.
Final Configuration
Image
Centos7 source_ami: ami-bec022de
Dependencies
SELinux disabled.
Yum installed:
- docker
- etcd
- conntrack-tools
- socat
- bridge-utils
kubernetes_version: 1.4.0-alpha.3
(b44b716965db2d54c8c7dfcdbcb1d54792ab8559)
weave_version: 1.6.1
1 Master (172.20.0.78)
Gist of journalctl output shows it boots fine, docker, etcd, kube-apiserver, scheduler, and controller all start. Minion registers successfully.
$ kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health": "true"}
$ kubectl get nodes
NAME STATUS AGE
ip-172-20-0-18.us-west-2.compute.internal Ready 1m
1 minion (172.20.0.18)
$ kubectl run -i --tty --image concourse/busyboxplus:curl dns-test42-$RANDOM --restart=Never /bin/sh
Pod created (not sluggishly). Multiple pods can ping each other.
Weave
Weave and weaveproxy are up and running just fine.
$ weave status
Version: 1.6.0 (version 1.6.1 available - please upgrade!)
Service: router
Protocol: weave 1..2
Name: ce:1a:4b:b0:07:6d(ip-172-20-0-18)
Encryption: disabled
PeerDiscovery: enabled
Targets: 0
Connections: 0
Peers: 1
TrustedSubnets: none
Service: ipam
Status: ready
Range: 10.244.0.0/16
DefaultSubnet: 10.244.0.0/16
Service: proxy
Address: unix:///var/run/weave/weave.sock
$ weave status ipam
ce:1a:4b:b0:07:6d(ip-172-20-0-18) 65536 IPs (100.0% of total)
Conclusion
Kubernetes is rapidly evolving with many open issues -- there are now efforts upstream to pin down and document the dependencies along with making errors and warnings more user-friendly in the logs.
As future versions become less opaque knowing which open issue is relevant to your setup will become easier. Along with whether an obvious dependency is missing and what a good setup looks like.
The nominal sanity check command that currently exists (kubectl get componentstatuses
) does not go far enough. It might show everything is healthy. Pods might be successfully created. Services might work.
And yet these can all be misleading as a cluster may still not be entirely healthy.
A useful test I found in the official repo simply tests connectivity (and authentication) to the master. Sluggishness is not tested and sluggishness it turns out is a red flag.
In fact, there's an entire folder of these, but they are not well documented as far as I can tell.
I believe a smoke test that can deployed against any running cluster and run through a suite of checks and benchmarks (to take into account unexpectedly poor performance) would significantly improve the debugging experience.
This is a great start. Could you please do another thorough edit, and then I'll give my comments over the phone Monday or Tuesday on where you're leaving out the story.
In particular, on your reread, please look at the following things, which are quite distracting:
- Please try to use capitalization consistently.
- Please try to use section headers in a consistent fashion. Preferably, make each section header a complete thought that summarizes the contents of that section. If not, name it something that describes what will be discussed in that section.
- Command text should be inside fenced code blocks, not just blockquoted.
- Write a new conclusion. What is the takeaway from the experience? Why was it hard to debug? What are the specific lessons that would have saved you a week if you could have found this page via Google? Don't include the text
(newly documented)
and not make it a link. What is theextraneous flag
you're referring to?
Template: I expected x
, I got y
, based on url
, I did z
to fix the problem.
Possible other template sections:
- Expected Behavior
- Actual Behavior
- Steps to Reproduce
- Resolution / Fix
- Related Issues