hjacobs/kubernetes-failure-stories

Kubernetes the very hard way (by Datadog) contains some lessons

bgrant0607 opened this issue · 3 comments

Thanks. I think I watched it already, probably have to rewatch.. ⏳

My unstructured notes after watching the talk (for future processing 😏):

multiple cloud providers (AWS + 2nd)
self-driven, API driven

certificates:
refresh certs every 24h

  • etcd did not reload certs for client connections (fixed upstream)
  • Kubernetes master don't reload certs (master needs to be restarted)
  • flaky bootstraps (vault dependency)

runtime containerd:

  • issues with Docker ->
  • many tools assume Docker
  • shim sometimes hang and require kill -9

health monitor on GKE

  • restart docker if docker ps hangs
  • kubelet restart

network overlays

  • using native pod routing (CNI plugin by Lyft)
  • hard to debug
  • good relation with devs of Lyft plugin

ingres

  • trying to achieve native pod routing

kube-proxy

  • lot's of iptables
  • decided to go with IPVS
  • IPVS was too good to be true: no graceful termination
  • mostly fixed (fine in 1.12?)

IPv6 and DNS

  • race condition in the conntrack code
  • sometimes takes 5s
  • alpine musl uses parallel resolver

cloud integrations

  • different LB behaviors
  • magical disappearing AWS instances
  • GCE specific code in Cloud CIDR Allocator
  • no docs for aws.go (only comments)

ecosystem

  • almost never tested on large clusters
  • kube-state-metrics: 100MB payload

scaling 100 -> 1000

  • API server high load, CPU, "TargetRAM"
  • controller/schedule impossible to split out
  • etcd imbalance, long lived connections
  • CoreDNS issues (OOM)

create 200 deployments

  • etcd latency sky rockets
  • main issues: apiserver sizing and traffic imbalance

31m: footguns: DaemonSet

  • DaemonSet: high DDoS risk (API server and cloud rate limits)
  • outage: removed permission from account to read from bucket, imaegPoliy pull always for DaemonSet, 9k image pulls/s
  • API rate limits (429 Too many requests)
  • DaemonSet might stop updating pods, getting stuck (the rollout)

StatefulSets

  • local volume: new node with same name (hash function using node name)
  • problem if you replace a broken node
  • solution: put UUID in mount path
  • trying to freeze a Kafka process, containerd first freezes the whole cgroup
  • could not freeze Kafka process because of IO
  • problem: volumes would be created in wrong zone
  • scheduling of StatefulSets (4 current, 5 desired, nothing happening), pod 3 was missing
  • StatefulSet will not create pods with higher ordinal if CrashLoop of previous

Cargo culting

  • Stackoverflow: trap, term int, sleep infinity & wait
  • people not familiar with containers

Zombies

  • Redis + zombies, exec probes (readinessProbe+exec,command), use tini as pid 1
  • kubelet was kill probe command (doing Redis ping)
  • Redis server not reaping children
  • standardizing on using Tini as pid 1

Containers not VMs

  • complex process trees, many open files

native resources

  • node port, external IP
  • removed load balancer (with External IP), cloud provider removes LB, new node got LB IP (which was still assigned as External IP)
  • internal app with port 443, port conflict, API server could not bind port 443
  • => moving to native pod routing

44m: OOMKiller

  • limits too low will trigger cgroup oom
  • system oom
  • people not setting limits, requests too low => oomed by the system (no idea why something is OOMed)

InitContainers

  • pod resources = MAX(MAX(initContainers), sum(containers))
  • LimitRanger also applies

Future Plans

  • isolate control plane