Kubernetes the very hard way (by Datadog) contains some lessons

Question

bgrant0607 opened this issue 6 years ago · 3 comments

Answer 1 · 2019-05-08T20:00:44.000Z

Thanks. I think I watched it already, probably have to rewatch.. ⏳

Answer 2 · 2019-05-21T07:12:08.000Z

My unstructured notes after watching the talk (for future processing 😏):

multiple cloud providers (AWS + 2nd)
self-driven, API driven

certificates:
refresh certs every 24h

runtime containerd:

health monitor on GKE

network overlays

ingres

kube-proxy

IPv6 and DNS

cloud integrations

ecosystem

scaling 100 -> 1000

create 200 deployments

31m: footguns: DaemonSet

DaemonSet: high DDoS risk (API server and cloud rate limits)
outage: removed permission from account to read from bucket, imaegPoliy pull always for DaemonSet, 9k image pulls/s
API rate limits (429 Too many requests)
DaemonSet might stop updating pods, getting stuck (the rollout)

StatefulSets

local volume: new node with same name (hash function using node name)
problem if you replace a broken node
solution: put UUID in mount path
trying to freeze a Kafka process, containerd first freezes the whole cgroup
could not freeze Kafka process because of IO
problem: volumes would be created in wrong zone
scheduling of StatefulSets (4 current, 5 desired, nothing happening), pod 3 was missing
StatefulSet will not create pods with higher ordinal if CrashLoop of previous

Cargo culting

Zombies

Containers not VMs

native resources

node port, external IP
removed load balancer (with External IP), cloud provider removes LB, new node got LB IP (which was still assigned as External IP)
internal app with port 443, port conflict, API server could not bind port 443
=> moving to native pod routing

44m: OOMKiller

limits too low will trigger cgroup oom
system oom
people not setting limits, requests too low => oomed by the system (no idea why something is OOMed)

InitContainers

Future Plans

Answer 3 · 2019-05-24T06:44:40.000Z