/post-mortems

A collection of postmortems. Sorry for the delay in merging PRs!

A List of Post-mortems!



Table of Contents

Config Errors

Hardware/Power Failures

Conflicts

Time

Uncategorized

Other lists of postmortems

Analysis

Contributors



Config Errors

Cloudflare. A bad config (router rule) caused all of their edge routers to crash, taking down all of Cloudflare.

Etsy. Sending multicast traffic without properly configuring switches caused an Etsy global outage.

Facebook. A bad config took down both Facebook and Instagram.

GoCardless A bad config combined with an uncommon set of failures led to an outage of a database cluster, taking the API and Dashboard offline.

Google Cloud. A bad config (autogenerated) removed all Google Compute Engine IP blocks from BGP announcements.

Google. A bad config (autogenerated) took down most Google services.

Google. A bad config caused a quota service to fail, which caused multiple services to fail (including gmail).

Google. / was checked into the URL blacklist, causing every URL to show a warning.

Google. A bug in configuration roll-out to a load balancer lead to increased error rates for 22 minutes.

Heroku. An automated remote configuration change did not propagate fully. Web dynos could not be started.

Microsoft. A bad config took down Azure storage.

OWASA. The wrong push of a button lead to a water treatment plant shutting down due to too high levels of fluoride.

Stack Overflow. A bad firewall config blocked stackexchange/stackoverflow.

Sentry. Wrong Amazon S3 settings on backups lead to data leak.

TravisCI. A configuration issue (incomplete password rotation) led to "leaking" VMs, leading to elevated build queue times.

TravisCI. A configuration issue (automated age-based Google Compute Engine VM image cleanup job) caused stable base VM images to be deleted.

TravisCI. A configuration change made builds start to fail. Manual rollback broke.

Valve. Although there's no official postmortem, it looks like a bad BGP config severed Valve's connection to Level 3, Telia, and Abovenet/Zayo, which resulted in a global Steam outage.

Hardware/Power Failures

Amazon. An unknown event caused a transformer to fail. One of the PLCs that checks that generator power is in phase failed for an unknown reason, which prevented a set of backup generators from coming online. This affected EC2, EBS, and RDS in EU West.

Amazon. Bad weather caused power failures throughout AWS US East. A single backup generator failed to deliver stable power when power switched over to backup and the generator was loaded. This is despite having passed a load tests two months earlier, and passing weekly power-on tests.

Amazon. At 10:25pm PDT on June 4, loss of power at an AWS Sydney facility resulting from severe weather in that area lead to disruption to a significant number of instances in an Availability Zone. Due to the signature of the power loss, power isolation breakers did not engage, resulting in backup energy reserves draining into the degraded power grid.

ARPANET. A malfunctioning IMP (Interface Message Processor) corrupted routing data, software recomputed checksums propagating bad data with good checksums, incorrect sequence numbers caused buffers to fill, full buffers caused loss of keepalive packets and nodes took themselves off the network. From 1980.

FirstEnergy / General Electric. FirstEnergy had a local failure when some transmission lines hit untrimmed foliage. The normal process is to have an alarm go off, which causes human operators to re-distribute power. But the GE system that was monitoring this had a bug which prevented the alarm from getting triggered, which eventually caused a cascading failure that eventually affected 55 million people.

GitHub. On January 28th, 2016 GitHub experienced a disruption in the power at their primary datacenter.

Google. Successive lightning strikes on their European datacenter (europe-west1-b) caused loss of power to Google Compute Engine storage systems within that region. I/O errors were observed on a subset of Standard Persistent Disks (HDDs) and permanent data loss was observed on a small fraction of those.

Sun/Oracle. Sun famously didn't include ECC in a couple generations of server parts. This resulted in data corruption and crashing. Following Sun's typical MO, they made customers that reported a bug sign an NDA before explaining the issue.

Conflicts

CCP Games A typo and a name conflict caused the installer to sometimes delete the boot.ini file on installation of an expansion for EVE Online - with consequences.

GoCardless. All queries on a critical PostgreSQL table were blocked by the combination of an extremely fast database migration and a long-running read query, causing 15 seconds of downtime.

Google. Many changes to a rarely modified load balancer were applied through a very slow code path. This froze all public addressing changes for ~2 hours.

Knight Capital. A combination of conflicting deployed versions and re-using a previously used bit caused a $460M loss.

WebKit code repository. The WebKit repository, a Subversion repository configured to use deduplication, became unavailable after two files with the same SHA-1 hash were checked in as test data, with the intention of implementing a safety check for collisions. The two files had different md5 sums and so a checkout would fail a consistency check. For context, the first public SHA-1 hash collision had very recently been announced, with an example of two colliding files.

Time

Azure Certificates that were valid for one year were created. Instead of using an appropriate library, someone wrote code that computed one year to be the current date plus one year. On February 29th 2012, this resulted in the creation of certificates with an expiration date of February 29th 2013, which were rejected because of the invalid date. This caused an Azure global outage that lasted for most of a day.

Cloudflare Backwards time flow from tracking the 27th leap second on 2016-12-31T23:59:60Z caused the weighted round-robin selection of DNS resolvers (RRDNS) to panic and fail on some CNAME lookups. Go's time.Now() was incorrectly assumed to be monotonic; this injected negative values into calls to rand.Int63n(), which panics in that case.

Linux Leap second code was called from the timer interrupt handler, which held xtime_lock. That code did a printk to log the leap second. printk wakes up klogd, which can sometimes try to get the time, which waits on xtime_lock, causing a deadlock.

Linux When a leap second occurred, CLOCK_REALTIME was rewound by one second. This was not done via a mechanism that would update hrtimer base.offset (clock_was_set). This meant that when a timer interrupt happened, TIMER_ABSTIME CLOCK_REALTIME timers got expired one second early, including timers set for less than one second. This caused applications that used sleep for less than one second in a loop to spinwait without sleeping, causing high load on many systems. This caused a large number of web services to go down in 2012.

Uncategorized

Allegro. The Allegro platform suffered a failure of a subsystem responsible for asynchronous distributed task processing. The problem affected many areas, e.g. features such as purchasing numerous offers via cart and bulk offer editing (including price list editing) did not work at all. Moreover, it partially failed to send daily newsletter with new offers. Also some parts of internal administration panel were affected.

Amazon. Human error. On February 28th 2017 9:37AM PST, the Amazon S3 team was debugging a minor issue. Despite using an established playbook, one of the commands intending to remove a small number of servers was issued with a typo, inadvertently causing a larger set of servers to be removed. These servers supported critical S3 systems. As a result, dependent systems required a full restart to correctly operate, and the system underwent widespread outages for US-EAST-1 (Northern Virginia) until final resolution at 1:54PM PST. Since Amazon's own services such as EC2 and EBS rely on S3 as well, it caused a vast cascading failure which affected hundreds of companies.

Amazon. Message corruption caused the distributed server state function to overwhelm resources on the S3 request processing fleet.

Amazon. Human error during a routine networking upgrade led to a resource crunch, exacerbated by software bugs, that ultimately resulted in an outage across all US East Availability Zones as well as a loss of 0.07% of volumes.

Amazon. Elastic Load Balancer ran into problems when "a maintenance process that was inadvertently run against the production ELB state data".

Amazon. A "network disruption" caused metadata services to experience load that caused response times to exceed timeout values, causing storage nodes to take themselves down. Nodes that took themselves down continued to retry, ensuring that load on metadata services couldn't decrease.

AppNexus. A double free revealed by a database update caused all "impression bus" servers to crash simultaneously. This wasn't caught in staging and made it into production because a time delay is required to trigger the bug, and the staging period didn't have a built-in delay.

AT&T. A bad line of C code introduced a race hazard which in due course collapsed the phone network. After a planned outage, the quickfire resumption messages triggered the race, causing more reboots which retriggered the problem. "The problem repeated iteratively throughout the 114 switches in the network, blocking over 50 million calls in the nine hours it took to stabilize the system." From 1990.

BBC Online. In July 2014, BBC Online experienced a very long outage of several of its popular online services including the BBC iPlayer. When the database backend was overloaded, it had started to throttle requests from various services. Services that hadn't cached the database responses locally began timing out and eventually failed completely.

Bitly. Hosted source code repo contained credentials granting access to bitly backups, including hashed passwords.

BrowserStack. An old prototype machine with the Shellshock vulnerability still active had secret keys on it which ultimately led to a security breach of the Production system.

Buildkite. Database capacity downgrade in an attempt to minimise AWS spend resulted in lack of capacity to support Buildkite customers at peak, leading to cascading collapse of dependent servers.

CCP Games. A problematic logging channel caused cluster nodes dying off during the cluster start sequence after rolling out a new game patch.

Chef.io. The recipe community site Supermarket crashed two hours after launch due to intermittent unresponsiveness and increased latency. One of the main reasons for failure identified in the post mortem was very low health check timeouts.

CircleCI. A GitHub outage and recovery caused an unexpectedly large incoming load. For reasons that aren't specified, a large load causes CircleCI's queue system to slow down, in this case to handling one transaction per minute.

Cloudflare. A parser bug caused Cloudflare edge servers to return memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.

Discord. A flapping service lead to a thundering herd reconnecting to it once it came up. This lead to a cascading error where frontend services ran out of memory due to internal queues filling up.

Discord. "At approximately 14:01, a Redis instance acting as the primary for a highly-available cluster used by Discord's API services was migrated automatically by Google’s Cloud Platform. This migration caused the node to incorrectly drop offline, forcing the cluster to rebalance and trigger known issues with the way Discord API instances handle Redis failover. After resolving this partial outage, unnoticed issues on other services caused a cascading failure through Discord’s real time system. These issues caused enough critical impact that Discord’s engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes."

Dropbox. This postmortem is pretty thin and I'm not sure what happened. It sounds like, maybe, a scheduled OS upgrade somehow caused some machines to get wiped out, which took out some databases.

Epic Games. Extreme load (a new peak of 3.4 million concurrent users) resulted in a mix of partial and total service disruptions.

European Space Agency. An overflow occured when converting a 16-bit number to a 64-bit numer in the Ariane 5 intertial guidance system, causing the rocket to crash. The actual overflow occured in code that wasn't necessary for operation but was running anyway. According to one account, this caused a diagnostic error message to get printed out, and the diagnostic error message was somehow interpreted as actual valid data. According to another account, no trap handler was installed for the overflow.

Etsy. First, a deploy that was supposed to be a small bugfix deploy also caused live databases to get upgraded on running production machines. To make sure that this didn't cause any corruption, Etsy stopped serving traffic to run integrity checks. Second, an overflow in ids (signed 32-bit ints) caused some database operations to fail. Etsy didn't trust that this wouldn't result in data corruption and took down the site while the upgrade got pushed.

Foursquare. MongoDB fell over under load when it ran out of memory. The failure was catastrophic and not graceful due to a a query pattern that involved a read-load with low levels of locality (each user check-in caused a read of all check-ins for the user's history, and records were 300 bytes with no spatial locality, meaning that most of the data pulled in from each page was unnecessary). A lack of monitoring on the MongoDB instances caused the high load to go undetected until the load became catastrophic, causing 17 hours of downtime spanning two incidents in two days.

Gitlab 2014. After the primary locked up and was restarted, it was brought back up with the wrong filesystem, causing a global outage. See also HN discussion.

Gitlab 2017. Influx of requests overloaded the database, caused replication to lag, tired admin deleted the wrong directory, six hours of data lost. See also earlier report and HN discussion.

Gliffy. While attempting to resolve an issue with a backup system the production database was accidentally deleted causing a system outage.

Google. Checking the vendor string instead of feature flags renders NaCl unusable on otherwise compatible non-mainstream hardware platforms.

Google. A mail system emailed people more than 20 times. This happened because mail was sent with a batch cron job that sent mail to everyone who was marked as waiting for mail. This was a non-atomic operation and the batch job didn't mark people as not waiting until all messages were sent.

GPS/GLONASS. A bad update that caused incorrect orbital mechanics calculations caused GPS satellites that use GLONASS to broadcast incorrect positions for 10 hours. The bug was noticed and rolled back almost immediately due to (?) this didn't fix the issue.

Healthcare.gov.

Heroku. Having a system that requires scheduled manual updates resulted in an error which caused US customers to be unable to scale, stop or restart dynos, or route HTTP traffic, and also prevented all customers from being able to deploy.

Heroku. An upgrade silently disabled a check that was meant to prevent filesystem corruption in running containers. A subsequent deploy caused filesystem corruption in running containers.

Heroku. An upstream apt update broke pinned packages which lead to customers experiencing write permission failures to /dev.

Instapaper, also this. Limits were hit for a hosted database. It took many hours to migrate over to a new database.

Intel. A scripting bug caused the generation of the divider logic in the Pentium to very occasionally produce incorrect results. The bug wasn't caught in testing because of an incorrect assumption in a proof of correctness.

Joyent. Operations on Manta were blocked because a lock couldn't be obtained on their PostgreSQL metadata servers. This was due to a combination of PostgreSQL's transaction wraparound maintenance taking a lock on something, and a Joyent query that unnecessarily tried to take a global lock.

Kickstarter. Primary DB became inconsistent with all replicas, which wasn't detected until a query failed. This was caused by a MySQL bug which sometimes caused order by to be ignored.

Kings College London. 3PAR suffered catastrophic outage which highlighted a failure in internal process.

Mailgun. Secondary MongoDB servers became overloaded and while troubleshooting accidentally pushed a change that sent all secondary traffic to the primary MongoDB server, overloading it as well and exacerbating the problem.

Medium. Polish users were unable to use their "Ś" key on Medium.

NASA. A design flaw in the Apollo 11 rendezvous radar produced excess CPU load, causing the spacecraft computer to restart during lunar landing.

NASA. Use of different units of measurement (metric vs. English) caused Mars Climate Orbiter to fail. There were also organisational and procedural failures[ref] and defects in the navigation software[ref].

NASA. NASA's Mars Pathfinder spacecraft experienced system resets a few days after landing on Mars (1997). Debugging features were remotely enabled until the cause was found: a priority inversion problem in the VxWorks operating system. The OS software was remotely patched (all the way to Mars) to fix the problem by adding priority inheritance to the task scheduler.

Netflix. An EBS outage in one availability zone was mitigated by migrating to other availability zones.

Pagerduty. In April 2013, Pagerduty, a cloud service proving application uptime monitoring and real-time notifications, suffered an outage when two of its three independent cloud deployments in different data centers began experiencing connectivity issues and high network latency. It was found later that the two independent deployments shared a common peering point which was experiencing network instability. While the third deployment was still operational, Pagerduty's applications failed to establish quorum due to to high network latency and hence failed in their ability to send notifications.

PagerDuty. A third party service for sending SMS and making voice calls experienced an outage due to AWS having issues in a region.

Parity. $30 million of cryptocurrency value was diverted (stolen) with another $150 million diverted to a safe place (rescued), after a 4000-line software change containing a security bug was mistakenly labelled as a UI change, inadequately reviewed, deployed, and used by various unsuspecting third parties. See also this analysis.

Platform.sh. Outage during a scheduled maintenance window because there were too much data for Zookeeper to boot.

Reddit experienced an outage for 1.5 hours, followed by another 1.5 hours of degraded performance on Thursday August 11 2016. This was due to an error during a migration of a critical backend system.

Salesforce Initial disruption due to power failure in one datacenter led to cascading failures with a database cluster and file discrepancies resulting in cross data center failover issues.

Sentry. Transaction ID Wraparound in Postgres caused Sentry to go down for most of a working day.

Shapeshift. Poor security practices enabled an employee to steal $200,000 in cryptocurrency in 3 separate hacks over a 1 month period. The company's CEO expanded upon the story in a blog post.

Skyliner. A memory leak in a third party library lead to Skyliner being unavailable on two occasions.

Slack. A combination of factor results in a large number of Slack's users being disconnected to the server. The subsequent massive disconnection-reconnection process exceeded the database capacity and caused cascading connection failures, leading to 5% of Slack's users not being able to connect to the server for up to 2 hours.

Spotify. Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.

Spotify. Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.

Square. A cascading error from an adjacent service lead to merchant authentication service being overloaded. This impacted merchants for ~2 hours.

Stackdriver. In October 2013, Stackdriver, experienced an outage, when its Cassandra cluster crashed. Data published by various services into a message bus was being injested into the Cassandra cluster. When the cluster failed, the failure percolated to various producers, that ended up blocking on queue insert operations, eventually leading to the failure of the entire application.

Stack Exchange. Enabling StackEgg for all users resulted in heavy load on load balancers and consequently, a DDoS.

Stack Exchange. Backtracking implementation in the underlying regex engine turned out to be very expensive for a particular post leading to health-check failures and eventual outage.

Stack Exchange. Porting old Careers 2.0 code to the new Developer Story caused a leak of users' information.

Stack Exchange. The primary SQL-Server triggered a bugcheck on the SQL Server process, causing the Stack Exchange sites to go into read only mode, and eventually a complete outage.

Strava. Hit the signed integer limit on a primary key, causing uploads to fail.

Stripe. Manual operations are regularly executed on production databases. A manual operation was done incorrectly (missing dependency), causing the Stripe API to go down for 90 minutes.

Sweden. Use of different rulers by builders caused the Vasa to be more heavily built on its port side and the ship's designer, not having built a ship with two gun decks before, overbuilt the upper decks, leading to a design that was top heavy. Twenty minutes into its maiden voyage in 1628, the ship heeled to port and sank.

Tarsnap. A batch job which scans for unused blocks in Amazon S3 and marks them to be freed encountered a condition where all retries for freeing certain blocks would fail. The batch job logs its actions to local disk and this log grew without bound. When the filesystem filled, this caused other filesystem writes to fail, and the Tarsnap service stopped. Manually removing the log file restored service.

Telstra. A fire in a datacenter caused SMS text messages to be sent to random destinations. Corrupt messages were also experienced by customers.

Therac-25. The Therac-25 was a radiation therapy machine involved in at least six accidents between 1985 and 1987 in which patients were given massive overdoses of radiation. Because of concurrent programming errors, it sometimes gave its patients radiation doses that were thousands of times greater than normal, resulting in death or serious injury.

Twilio. In 2013, a temporary network partition in the redis cluster used for billing operations, caused a massive resynchronization from slaves. The overloaded master crashed and when it was restarted, it started up in read-only mode. The auto-recharge component in This resulted in failed transactions from Twilio's auto-recharge service, which unfortunately billed the customers before updating their balance internally. So the auto-recharge system continued to retry the transaction again and again, resulting in multiple charges to customer's credit cards.

Valve. Steam's desktop client deleted all local files and directories. The thing I find most interesting about this is that, after this blew up on social media, there were widespread reports that this was reported to Valve months earlier. But Valve doesn't triage most bugs, resulting in an extremely long time-to-mitigate, despite having multiple bug reports on this issue.

VZaar. A release was made from the wrong VCS branch. This lead to database changes being applied which broke production.

Yeller. A network partition in a cluster caused some messages to get delayed, up to 6-7 hours. For reasons that aren't clear, a rolling restart of the cluster healed the partition. There's some suspicious that it was due to cached routes, but there wasn't enough logging information to tell for sure.

CCP Games Devblog. Documents a Stackless Python memory reuse bug that took years to track down.

Unfortunately, most of the interesting post-mortems I know about are locked inside confidential pages at Google and Microsoft. Please add more links if you know of any interesting public post mortems! is a pretty good resource; other links to collections of post mortems are also appreciated.

Other lists of postmortems

Availability Digest website.

Google+ postmortems community.

John Daily's list of postmortems (in json).

Jeff Hammerbacher's list of postmortems.

NASA lessons learned database.

Tim Freeman's list of postmortems

Wikimedia's postmortems.

Autopsy.io's list of Startup failures.

SRE Weekly usually has an Outages section at the end.

Analysis

How Complex Systems Fail

John Allspaw on Resilience Engineering

Contributors

  • Aaron Wigley
  • Ahmet Alp Balkan
  • Allon Murienik
  • Amber Yust
  • Anthony Elizondo
  • Anuj Pahuja
  • Benjamin Gilbert
  • Brad Baris
  • Brendan McLoughlin
  • Brian Scanlan
  • Brock Boland
  • Chris Higgs
  • Chris Sinjakli
  • Connor Shea
  • Dan Luu
  • Dan Nguyen
  • David Pate
  • Dov Murik
  • Ed Spittles
  • Florent Genette
  • Franck Arnulfo
  • gnomon
  • Grey Baker
  • Isaac Rogers
  • Jacob Kaplan-Moss
  • James Graham
  • Jameson Lopp
  • Jason Dusek
  • Jens Rantil
  • John Daily
  • jomo
  • Julia Hansbrough
  • Julian Szulc
  • Justin Montgomery
  • KS Chan
  • Kevin Brown
  • KlavierCat
  • Kunal Mehta
  • Luan Cestari
  • Mark Dennehy
  • Massimiliano Arione
  • Matt Day
  • Michael Robinson
  • Mike Doherty
  • Mohit Agarwal
  • Nat Welch
  • Nate Parsons
  • Nick Sweeting
  • Owen Jacobson
  • Raul Ochoa
  • Ruairi Carroll
  • Samuel Hunter
  • Sean Escriva
  • Shriram Rajagopalan
  • Siddharth Kannan
  • Tim Freeman
  • Tom Crayford
  • Vaibhav Bhembre
  • Veit Heller
  • Vincent Ambo