A curated list of awesome Site Reliability and Production Engineering resources.
"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE
Please take a look at the contribution guidelines first. Contributions are always welcome!
- Culture
- Education
- Books
- Hiring
- Reliability
- Alerting
- Monitoring
- On-Call
- Post-Mortem
- Capacity Planning
- Service Level Agreement
- Performance
- Articles
- Blogs
- Conferences & Meetups
- What is Site Reliability Engineering?
- Keys To SRE by Ben Treynor
- Google SRE Resources
- Notes from Production Engineering by Pedro Canahuati
- PostOps: Recovery from Operations
- Love DevOps? Wait 'till you meet SRE [video]
- How Google Does Planet-Scale Engineering for Planet-Scale Infra
- Site Reliability Engineering at Facebook
- A History of Site Reliability Engineering at Uber
- Case Study: Adopting SRE Principles at StackOverflow
- Site Reliability Engineering at Dropbox
- Site Reliability Engineers — Keeping Google up and running 24/7
- Site Reliability Engineering at Salesforce
- From Sys Admin to Netflix SRE
- SRE@Google: Thousands of DevOps Since 2004
- Transactional System Administration Is Killing Us and Must be Stopped
- Maslow's hierarchy of SRE needs
- PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
- From SysAdmin to Netflix SRE
- SRE: An incomplete guide to cultural Narnia - [Video]
- Putting Together Great SRE Teams
- Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air
- Toil: A Word Every Engineer Should Know
- Engineering Reliability into Web Sites: Google SRE
- DEVOPS & SRE AMA - Building High Performance Organizations
- Site Reliability Engineering with Paul Newson
- How SysAdmins Devalue Themselves
- The Softer Side of DevOps
- SRE, noun. See also: confidence, trust.
- Site Reliability Engineering with Stephen Weinberg
- We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything!
- We are the Google Site Reliability Engineering team. Ask us Anything!
- The Ops Identity Crisis
- The Irreproducibility Of Bugs In Large-Scale Production Systems
- Panel: Educating SRE
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- New to an SRE team?
- The Systems Engineering Side of Site Reliability Engineering
- Site Reliability Engineering: How Google Runs Production Systems
- The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
- Web Operations - Keeping the Data On Time
- The Checklist Manifesto: How to Get Things Right
- Microservices in Production - Standard Principles and Requirements
- Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization
- The Realities of the Job of Delivering Reliability
- Fail at Scale by Ben Maurer
- Embracing Failure: Fault-Injection and Service Reliability
- 10 Years of Crashing Google
- How we break things at Twitter: failure testing
- Reliable Cron across the Planet
- Push our limits - reliability testing at Twitter
- The Verification of a Distributed System by Caitie McCaffrey
- Weathering the Unexpected
- The Remediation Ballet
- SRE Hour: Tech Talks by Box & Yelp
- Simplicity: A Prerequisite for Reliability
- The Two Sides to Google Infrastructure for Everyone Else
- How Embracing Continuous Release Reduced Change Complexity
- Making "Push On Green" a Reality
- BeyondCorp: A New Approach to Enterprise Security
- Brainstorming Failure by Jeff Smith
- The Ripple Effect Of Outages And Downtime Cannot Be Underestimated
- The infrastructure behind Twitter: efficiency and optimization
- Dickerson's Hierarchy of Reliability
- The Morning Paper on Operability
- A Working Theory-of-Monitoring
- The Evolution of Monitoring Systems at Google - Tony Rippy
- Monitoring without Infrastructure @ Airbnb
- Monitoring distributed systems
- Observability at Uber Engineering: Past, Present, Future
- Being an On-Call Engineer: A Google SRE Perspective
- Inside Atlassian: how our site reliability engineers do incident management
- Inside Atlassian: how IT & SRE use ChatOps to run incident management
- Incident Response at Heroku
- Who's On Call?
- A collection of post-mortems
- Blameless PostMortems and a Just Culture
- A Tale of Postmortems
- Building a Blameless Post-Mortem Culture with Jason Hand
- The infinite hows
- Failure is Always An Option: How a Blameless Culture Leads to Better Results
- How to write an Incident Report / Postmortem
- SLA Aware Maintenance for Operators - Joe Smith
- If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues
- Service Level Agreements in the Cloud: Who cares?
- Making a point with SLAs
- What is SRE (Site Reliability Engineering)?
- Here’s How Google Makes Sure It (Almost) Never Goes Down
- Are site reliability engineers the next data scientists?
- Site Reliability Engineers: "solving the most interesting problems"
- Site Reliability Engineers: the "world’s most intense pit crew"
- Site reliability engineering kicks rote tasks out of IT ops
- Notes on Site Reliability Engineering
- Adventures in SRE-land: Welcome to Google Mission Control
- LinkedIn Preps Site Reliability Engineers (SREs) For Exciting Careers
- Book Review: Site Reliability Engineering - How Google Runs Production Systems
- #sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
- #incident_response channel at Hangops Slack - Discussion about Incident Response.
- Brendan Gregg's Blog - Highly Techincal Blog Posts About Systems Internals, Performance and SRE.
- Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
- High Scalability - Technical Blog Posts About Systems Architecture.
- rachelbythebay - Techincal Blog Posts.
- SRE Weekly - Weekly Site Reliability Newsletter.
- Production Ready - A mailing list about building resilient infrastructure and tools.
- Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
- SREally?!
- SRECon Conferences - The Official SRE Conference.
- LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
- SRE Tech Talks - SRE Talks Hosted by Google.
- South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
- San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
- Alice Goldfuss - SRE @ New Relic - Tweets About the SRE Culture.
- Brendan Gregg - SRE @ Netflix - Technical Resources about Systems, Performance and Site Reliability Engineers.
- Caitie McCaffrey - Tweets About Reliability and Distributed Systems.
- Dave Hahn - SRE @ Netflix.
- Highscal - Feed of the High Scalability Blog.
- Jennifer Petoff - Program Manager for Google's Site Reliability Engineering team.
- Jesse Dearing - SRE @ InVisionApp.
- Jonah horowitz - SRE @ Netflix.
- Niall Murphy - SRE @ Google.
- Nick Craver - SRE @ StackOverflow.
- SREBook - The Official Twitter Account of Site Reliability Engineering Book.
- SREcon - SRECon's Official Twitter Account.
- Thomas A. Limoncelli - Prominent Author About SysAdmin/DevOps/SRE.
- Todd Underwood - SRE @ Google.
- Twitter SRE - The Official Twitter Account of Twitter's SRE team.
- USENIX Association - The Official USENIX Twitter Account.