Idea: annotate keywords/topics and contributing factors?
hjacobs opened this issue · 2 comments
The list of failure stories is still pretty short, but it might still make sense to add more information such as keywords hinting possible contributing factors. This would allow readers to more easily find relevant information, e.g:
- "I saw problems with kubelet connecting to API server, let's look at the
kubelet
,dynamic ELB IPs
outage post" - "I saw DNS issues in our cluster, let's see what the incident report with keyword
DNS
has to say"
It might also be useful to annotate them with the platform used (AWS, EKS, GKE, GCP, OpenStack, on premise…) as from what I saw many outages are platform-dependent.
If you make this a table, adding the date and K8s version might also be relevant.
Another idea would be to create a section by topic and list pitfalls with known incidents/postmortems linked, e.g. section "API Server" --> "OOM" (link to OOM due to many pods incident), "SPoF" (link to incident about ingress going down if API server goes down)