hjacobs/kubernetes-failure-stories

Idea: annotate keywords/topics and contributing factors?

hjacobs opened this issue · 2 comments

The list of failure stories is still pretty short, but it might still make sense to add more information such as keywords hinting possible contributing factors. This would allow readers to more easily find relevant information, e.g:

  • "I saw problems with kubelet connecting to API server, let's look at the kubelet, dynamic ELB IPs outage post"
  • "I saw DNS issues in our cluster, let's see what the incident report with keyword DNS has to say"

It might also be useful to annotate them with the platform used (AWS, EKS, GKE, GCP, OpenStack, on premise…) as from what I saw many outages are platform-dependent.

If you make this a table, adding the date and K8s version might also be relevant.

Another idea would be to create a section by topic and list pitfalls with known incidents/postmortems linked, e.g. section "API Server" --> "OOM" (link to OOM due to many pods incident), "SPoF" (link to incident about ingress going down if API server goes down)