How they SRE

A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

Introduction

How They SRE is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.

Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.

Note to readers: This list refers to some of the articles, posts, videos, tools, and techniques published before 2015. Please use such material with caution as there may be recent advances in technology and practices which offer better alternatives and perspectives.

Topics

Site Reliability Engineering
Hiring and Building SRE teams
SRE Culture
DevOps
Monitoring & Observability
Alerting
Incident Response & Post-Mortem
On-Call
Testing in Production
Chaos Engineering
Automation
Performance

Organizations

Achievers

Blog Posts

Airbnb

Blog Posts

Algolia

Blog Posts

Alibaba Cloud

Blog Posts

Asana

Blog Posts

ASOS

Blog Posts

Atlassian

Blog Posts

BackMarket

Blog Posts

How Back Market SREs prepared for Black Friday

Baidu

Videos

Basecamp

Blog Posts

Books

Shape Up

Bloomberg

Videos

Booking.com

Blog Posts

Videos

Capital One

Blog Posts

Major incidents & analysis reports

Videos

Coinbase

Blog Posts

Open Sourcing Coinbase’s Secure Deployment Pipeline

DAZN

Blog Posts

Site Reliability at DAZN

DBS

Blog Posts

Videos

SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS

DeepSource

Blog Posts

Dream11

Blog Posts

Dropbox

Blog Posts

Videos

Service Discovery Challenges at Scale

eBay

Blog Posts

Video

Madaari: Ordering for the Monkeys

Epic Games

Video

AWS re:Invent 2018: Epic Games Uses AWS to Deliver Fortnite to 200 Million Players

Etsy

Blog Posts

Videos

Expedia

Blog Posts

Fastly

Videos

Getaround

Blog Posts

GitHub

Blog Posts

Major incidents & analysis reports

Videos

One on One SRE

GitLab

Blog Posts

GoCardless

Blog Posts

Major incidents & analysis reports

GoDaddy

Blog Posts

Gojek

Blog Posts

Goldman Sachs

Blog Posts

Google

Blog Posts

Videos

Grab

Blog Posts

Grammarly

Blog Posts

Gusto

Blog Posts

Halodoc

Blog Posts

Site Reliability Engineering for Native mobile apps

Heroku

Blog Posts

IBM

Blog Posts

Indeed

Blog Posts

Videos

Are We Getting Better Yet? Progress Toward Safer Operations

Khan Academy

Blog Posts

Videos

Tools

On-Call

Loggi

Blog Posts

Loveholidays

Blog Posts

Macquarie

Blog Posts

Mattermost

Blog Posts

Meituan (美团)

Blog Posts

The development and practice of SRE in the cloud (云端的SRE发展与实践)

Mercari

Blog Posts

Videos

Microsoft

Videos

MIRO

Blog Posts

Monzo

Blog Posts

Videos

Eventually Consistent Service Discovery

Tools

Response

Netflix

Blog Posts

Major incidents & analysis reports

Post-mortem of October 22, 2012 AWS degradation

Videos

Podcasts

Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems

Tools

Dispatch

New Relic

Blog Posts

Nubank

Blog Posts

OpenAI

Blog Posts

PayPal

Blog Posts

Videos

Picnic

Blog Posts

Videos

Postman

Blog Posts

Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana

Prezi

Blog Posts

Red Hat

Blog Posts

Riot Games

Blog Posts

Salesforce

Blog Posts

Schibsted Media

Blog Posts

Reliability engineering for some of top 10 sites in Scandinavia

Scribd

Blog Posts

Shopify

Blog Posts

Videos

Sky Betting and Gaming

Blog Posts

Slack

Blog Posts

Videos

Slalom Build

Blog Posts

Soundcloud

Blog Posts

Spotify

Blog Posts

Videos

Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance

Squarespace

Blog Posts

Under the Hood: Ensuring Site Reliability

Videos

Stack Overflow

Blog Posts

Videos

Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline

Strava

Blog Posts

Stripe

Blog Posts

Videos

Target

Blog Posts

Teads

Blog Posts

Scaling your on-duty team

Tinder

Blog Posts

Tokopedia

Blog Posts

Trivago

Blog Posts

How To Get Fooled By Metrics

Twilio

Blog Posts

Twilio SRE Gameday Template

Twitter

Blog Posts

Uber

Blog Posts

Videos

Udemy

Blog Posts

upGrad

Blog Posts

VGW

Blog Posts

The SRE Incident Response game

Videos

Level Up Your Incident Response With Gameplay

Wikimedia Foundation

Videos

Wix

Blog Posts

Yelp

Blog Posts

The process: Implementing Yelp’s failover strategy

Videos

Yelp - What I Wish I Knew before Going On-Call

Zalando

Blog Posts

Zerodha

Blog Posts

Infrastructure monitoring with Prometheus at Zerodha

Zomato

Blog Posts

Huddle Diaries – DevOps and Data Platform

SRECon Mix Playlist

Videos

Resources

Books

Events

Other Resources

Awesome Lists

SRE Resources from various organizations

Incidents & postmortems

Newsletters

Credits

Inspired by Howtheytest from Abhijeet Vaikar
The list of organizations is referred from my other repo awesome-engineering
Banner image Cartoon vector created by vectorjuice - www.freepik.com

Other How They... repos

Contribute

Contributions welcome! Read the contribution guidelines first.

License

To the extent possible under law, Unmesh Gundecha has waived all copyright and related or neighboring rights to this work.

If you decide to use this anywhere please give a credit to @upgundecha on twitter, also If you like my work, check out other projects on my Github.

Krushna-Prasad-Sahoo/howtheysre

How they SRE

Introduction

Topics

Organizations

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Books

Videos

Blog Posts

Videos

Blog Posts

Major incidents & analysis reports

Videos

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Video

Video

Blog Posts

Videos

Blog Posts

Videos

Blog Posts

Blog Posts

Major incidents & analysis reports

Videos

Blog Posts

Blog Posts

Major incidents & analysis reports

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Blog Posts

Blog Posts

Videos

Tools

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Blog Posts

Videos

Videos

Blog Posts

Blog Posts

Videos

Tools

Blog Posts

Major incidents & analysis reports

Videos

Podcasts

Tools

Blog Posts