/post-mortems

A collection of postmortems. Pull requests welcome!

A List of Post-mortems!



##Table of Contents

Config Errors

Hardware/Power Failures

Conflicts

Uncategorized

Other lists of postmortems

Analysis

Contributors

Config Errors

Cloudflare. A bad config (router rule) caused all of their edge routers to crash, taking down all of Cloudflare.

Etsy. Sending multicast traffic without properly configuring switches caused an Etsy global outage.

Facebook. A bad config took down both Facebook and Instagram.

Google. A bad config (autogenerated) took down most Google services.

Google. A bad config caused a quota service to fail, which caused multiple services to fail (including gmail).

Microsoft. A bad config took down Azure storage.

Stack Overflow. A bad firewall config blocked stackexchange/stackoverflow.

Valve. Although there's no official postmortem, it looks like a bad BGP config severed Valve's connection to Level 3, Telia, and Abovenet/Zayo, which resulted in a global Steam outage.

Hardware/Power Failures

Amazon. An unknown event caused a transformer to fail. One of the PLCs that checks that generator power is in phase failed for an unknown reason, which prevented a set of backup generators from coming online. This affected EC2, EBS, and RDS in EU West.

Amazon. Bad weather caued power failures throughout AWS US East. A single backup generator failed to deliver stable power when power switched over to backup and the generator was loaded. This is despite having passed a load tests two months earlier, and passing weekly power-on tests.

FirstEnergy / General Electric. FirstEnergy had a local failure when some transmission lines hit untrimmed foliage. The normal process is to have an alarm go off, which causes human operators to re-distribute power. But the GE system that was monitoring this had a bug which prevented the alarm from getting triggered, which eventually caused a cascading failure that eventually effected 55 million people.

Sun/Oracle. Sun famously didn't include ECC in a couple generations of server parts. This resulted in data corruption and crashing. Following Sun's typical MO, they made customers that reported a bug sign an NDA before explaining the issue.

Google. Successive lightning strikes on their European datacenter (europe-west1-b) caused loss of power to Google Compute Engine storage systems within that region. I/O errors were observed on a subset of Standard Persistent Disks (HDDs) and permanent data loss was observed on a small fraction of those.

Conflicts

CCP Games A typo and a name conflict caused the installer to sometimes delete the boot.ini file on installation of an expansion for EVE Online - with consequences.

GoCardless. All queries on a critical PostgreSQL table were blocked by the combination of an extremely fast database migration and a long-running read query, causing 15 seconds of downtime.

Knight Capital. A combination of conflicting deployed versions and re-using a previously used bit caused a $460M loss.

Uncategorized

Allegro. The Allegro platform suffered a failure of a subsystem responsible for asynchronous distributed task processing. The problem affected many areas, e.g. features such as purchasing numerous offers via cart and bulk offer editing (including price list editing) did not work at all. Moreover, it partially failed to send daily newsletter with new offers. Also some parts of internal administration panel were affected.

Amazon. Major Amazon EC2/RDS outage in US East Region. Human error during a routine networking upgrade led to a resource crunch, exacerbated by software bugs, that ultimately resulted in an outage across all US East Availability Zones, affecting many popular websites, as well as a loss of 0.07% of volumes.

Amazon. Elastic Load Balancer ran into problems when "a maintenance process that was inadvertently run against the production ELB state data".

AppNexus. A double free revealed by a database update caused all "impression bus" servers to crash simultaneously. This wasn't caught in staging and made it into production because a time delay is required to trigger the bug, and the staging period didn't have a built-in delay.

Bitly. Hosted source code repo contained credentials granting access to bitly backups, including hashed passwords.

BrowserStack. An old prototype machine with the Shellshock vulnerability still active had secret keys on it which ultimately led to a security breach of the Production system.

CCP Games. A problematic logging channel results in cluster nodes dying off during the cluster start sequence after rolling out a new game patch, resulting in a day's worth of troubleshooting and extended downtime.

CircleCI. A GitHub outage and recovery caused an unexpectedly large incoming load. For reasons that aren't specified, a large load causes CircleCI's queue system to slow down, in this case to handling one transaction per minute.

Dropbox. This postmortem is pretty thin and I'm not sure what happened. It sounds like, maybe, a scheduled OS upgrade somehow caused some machines to get wiped out, which took out some databases.

European Space Agency. An overflow occured when converting a 16-bit number to a 64-bit numer in the Ariane 5 intertial guidance system, causing the rocket to crash. The actual overflow occured in code that wasn't necessary for operation but was running anyway. According to one account, this caused a diagnostic error message to get printed out, and the diagnostic error message was somehow interpreted as actual valid data. According to another account, no trap handler was installed for the overflow.

Etsy. First, a deploy that was supposed to be a small bugfix deploy also caused live databases to get upgraded on running production machines. To make sure that this didn't cause any corruption, Etsy stopped serving traffic to run integrity checks. Second, an overflow in ids (signed 32-bit ints) caused some database operations to fail. Etsy didn't trust that this wouldn't result in data corruption and took down the site while the upgrade got pushed.

Gitlab. After the primary locked up and was restarted, it was brought back up with the wrong filesystem, causing a global outage.

Google. Checking the vendor string instead of feature flags renders NaCl unusable on otherwise compatible non-mainstream hardware platforms.

Google. A mail system emailed people more than 20 times. This happened because mail was sent with a batch cron job that sent mail to everyone who was marked as waiting for mail. This was a non-atomic operation and the batch job didn't mark people as not waiting until all messages were sent.

GPS/GLONASS. A bad update that caused incorrect orbital mechanics calculations caused GPS satellites that use GLONASS to broadcast incorrect positions for 10 hours. The bug was noticed and rolled back almost immediately due to (?) this didn't fix the issue.

Healthcare.gov.

Heroku. Having a system that requires scheduled manual updates resulted in an error which caused US customers to be unable to scale, stop or restart dynos, or route HTTP traffic, and also prevented all customers from being able to deploy.

Intel. A scripting bug caused the generation of the divider logic in the Pentium to very occasionally produce incorrect results. The bug wasn't caught in testing because of an incorrect assumption in a proof of correctness.

Joyent. Operations on Manta were blocked because a lock couldn't be obtained on their PostgreSQL metadata servers. This was due to a combination of PostgreSQL's transaction wraparound maintence taking a lock on something, and a Joyent query that unecessarily tried to take a global lock.

Kickstarter. Primary DB became inconsistent with all replicas, which wasn't detected until a query failed. This was caused by a MySQL bug which sometimes caused order by to be ignored.

Medium. Due to a series of unfortunate events, Polish users were unable to use their "Ś" key on Medium.

NASA. Use of different units of measurement (metric vs. English) caused Mars Climate Orbiter to fail. This is basically the same issue Sweden ran into in 1628 with its ship.

Netflix. Netflix's extensive preparations enable them to gracefully handle a degradation of Amazon EBS service.

Sentry. Sentry was down for most of the US working day due transaction ID Wraparound in Postgres.

Spotify. Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.

Sweden. Use of different rulers by builders caused the Vasa to be more heavily built on its port side and the ship's designer, not having built a ship with two gun decks before, overbuilt the upper decks, leading to a design that was top heavy. Twenty minutes into its maiden voyage in 1628, the ship heeled to port and sank.

Valve. Steam's desktop client deleted all local files and directories. The thing I find most interesting about this is that, after this blew up on social media, there were widespread reports that this was reported to Valve months earlier. But Valve doesn't triage most bugs, resulting in an extremely long time-to-mitigate, despite having multiple bugreports on this issue.

Unfortunately, most of the interesting post-mortems I know about are locked inside confidential pages at Google and Microsoft. Please add more links if you know of any interesting public post mortems! is a pretty good resource; other links to collections of post mortems are also appreciated.

Other lists of postmortems

Availability Digest website.

Google+ postmortems community.

John Daily's list of postmortems (in json).

Jeff Hammerbacher's list of postmortems.

NASA lessons learned database.

Wikimedia's postmortems.

Analysis

How Complex Systems Fail

John Allspaw on Resilience Engineering

Contributors

  • Ahmet Alp Balkan
  • Amber Yust
  • BigEd/Ed S?
  • Brock Boland
  • Connor Shea
  • Dan Luu
  • David Pate
  • Florent Genette
  • Grey Baker
  • James Graham
  • Jason Dusek
  • John Daily
  • jomo
  • Julia Hansbrough
  • Julian Szulc
  • Kunal Mehta
  • Luan Cestari
  • Mark Dennehy
  • Matt Day
  • Michael Robinson
  • Nat Welch
  • Nate Parsons
  • Raul Ochoa
  • Samuel Hunter
  • Siddharth Kannan
  • Vincent Ambo