A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
- On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019
- Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018
- NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018
- Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018
- Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018
- AirMap Platform Service Outage - AirMap - incident report 2018
- Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018
- 101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018
- 101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017
- Major Outage: Current account payments mail fail - Monzo - Monzo Community post 2017
- Our First Kubernetes Outage - Saltside - blog post 2017
- Our Failure Migrating to Kubernetes - Saltside - blog post 2017
Kubernetes is a fairly complex system with many moving parts. Its ecosystem is constantly evolving and adding even more layers (service mesh, ..) to the mix. Considering this environment, we don't hear enough real-world horror stories to learn from each other! This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, ops, platform/infrastructure teams) to learn from others and reduce the unknown unknowns of running Kubernetes in production.
Please help the community and share a link to your failure story by opening a Pull Request! Failure stories can be anything like blog posts, conference/meetup talks, incident postmortems, tweet storms, ...
I would also be glad to hear about your failure stories on Twitter: my handle is @try_except_