Kubernetes the very hard way (by Datadog) contains some lessons
bgrant0607 opened this issue · 3 comments
hjacobs commented
Thanks. I think I watched it already, probably have to rewatch.. ⏳
hjacobs commented
My unstructured notes after watching the talk (for future processing 😏):
multiple cloud providers (AWS + 2nd)
self-driven, API driven
certificates:
refresh certs every 24h
- etcd did not reload certs for client connections (fixed upstream)
- Kubernetes master don't reload certs (master needs to be restarted)
- flaky bootstraps (vault dependency)
runtime containerd:
- issues with Docker ->
- many tools assume Docker
- shim sometimes hang and require kill -9
health monitor on GKE
- restart docker if docker ps hangs
- kubelet restart
network overlays
- using native pod routing (CNI plugin by Lyft)
- hard to debug
- good relation with devs of Lyft plugin
ingres
- trying to achieve native pod routing
kube-proxy
- lot's of iptables
- decided to go with IPVS
- IPVS was too good to be true: no graceful termination
- mostly fixed (fine in 1.12?)
IPv6 and DNS
- race condition in the conntrack code
- sometimes takes 5s
- alpine musl uses parallel resolver
cloud integrations
- different LB behaviors
- magical disappearing AWS instances
- GCE specific code in Cloud CIDR Allocator
- no docs for aws.go (only comments)
ecosystem
- almost never tested on large clusters
- kube-state-metrics: 100MB payload
scaling 100 -> 1000
- API server high load, CPU, "TargetRAM"
- controller/schedule impossible to split out
- etcd imbalance, long lived connections
- CoreDNS issues (OOM)
create 200 deployments
- etcd latency sky rockets
- main issues: apiserver sizing and traffic imbalance
31m: footguns: DaemonSet
- DaemonSet: high DDoS risk (API server and cloud rate limits)
- outage: removed permission from account to read from bucket, imaegPoliy pull always for DaemonSet, 9k image pulls/s
- API rate limits (429 Too many requests)
- DaemonSet might stop updating pods, getting stuck (the rollout)
StatefulSets
- local volume: new node with same name (hash function using node name)
- problem if you replace a broken node
- solution: put UUID in mount path
- trying to freeze a Kafka process, containerd first freezes the whole cgroup
- could not freeze Kafka process because of IO
- problem: volumes would be created in wrong zone
- scheduling of StatefulSets (4 current, 5 desired, nothing happening), pod 3 was missing
- StatefulSet will not create pods with higher ordinal if CrashLoop of previous
Cargo culting
- Stackoverflow: trap, term int, sleep infinity & wait
- people not familiar with containers
Zombies
- Redis + zombies, exec probes (readinessProbe+exec,command), use tini as pid 1
- kubelet was kill probe command (doing Redis ping)
- Redis server not reaping children
- standardizing on using Tini as pid 1
Containers not VMs
- complex process trees, many open files
native resources
- node port, external IP
- removed load balancer (with External IP), cloud provider removes LB, new node got LB IP (which was still assigned as External IP)
- internal app with port 443, port conflict, API server could not bind port 443
- => moving to native pod routing
44m: OOMKiller
- limits too low will trigger cgroup oom
- system oom
- people not setting limits, requests too low => oomed by the system (no idea why something is OOMed)
InitContainers
- pod resources = MAX(MAX(initContainers), sum(containers))
- LimitRanger also applies
Future Plans
- isolate control plane
hjacobs commented
KubeCon talk video: https://www.youtube.com/watch?v=QKI-JRs2RIE