Kubernetes Failure Stories

A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.

On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018
NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018
Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018
Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018
AirMap Platform Service Outage - AirMap - incident report 2018
Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018
101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018
101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017
Major Outage: Current account payments mail fail - Monzo - Monzo Community post 2017
Our First Kubernetes Outage - Saltside - blog post 2017
Our Failure Migrating to Kubernetes - Saltside - blog post 2017

Why

Kubernetes is a fairly complex system with many moving parts. Its ecosystem is constantly evolving and adding even more layers (service mesh, ..) to the mix. Considering this environment, we don't hear enough real-world horror stories to learn from each other! This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, ops, platform/infrastructure teams) to learn from others and reduce the unknown unknowns of running Kubernetes in production.

Contributing

Please help the community and share a link to your failure story by opening a Pull Request! Failure stories can be anything like blog posts, conference/meetup talks, incident postmortems, tweet storms, ...

I would also be glad to hear about your failure stories on Twitter: my handle is @try_except_

jameskumar/kubernetes-failure-stories

Kubernetes Failure Stories

Why

Contributing