A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
- Kubernetes Load Balancer Konfiguration – Vorsicht beim Drainen von Nodes (German) - DevOps Hof - blog post 2019
- On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019
- Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018
- Outages? Downtime? - Veracode - blog post 2018
- NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018
- A Perfect DNS Storm - Toyota Connected - blog post 2018
- Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018
- Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018
- AirMap Platform Service Outage - AirMap - incident report 2018
- Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018
- 101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018
- 101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017
- Major Outage: Current account payments may fail - Monzo - Monzo Community post 2017
- Search and Reporting Outage - Universe - incident report 2017
- Our First Kubernetes Outage - Saltside - blog post 2017
- Our Failure Migrating to Kubernetes - Saltside - blog post 2017
- SaleMove US System Issue - SaleMove - incident report 2017
Kubernetes is a fairly complex system with many moving parts. Its ecosystem is constantly evolving and adding even more layers (service mesh, ..) to the mix. Considering this environment, we don't hear enough real-world horror stories to learn from each other! This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, ops, platform/infrastructure teams) to learn from others and reduce the unknown unknowns of running Kubernetes in production. For more information, see the blog post.
Please help the community and share a link to your failure story by opening a Pull Request! Failure stories can be anything like blog posts, conference/meetup talks, incident postmortems, tweet storms, ...
I would also be glad to hear about your failure stories on Twitter: my handle is @try_except_