Awesome Site Reliability Engineering

A curated list of awesome Site Reliability and Production Engineering resources.

What is Site Reliability Engineering?

"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE

Contributing

Please take a look at the contribution guidelines first. Contributions are always welcome!

Culture
Education
Books
Hiring
Reliability
Alerting
Monitoring
On-Call
Post-Mortem
Capacity Planning
Service Level Agreement
Performance
Articles
Blogs
Conferences & Meetups
Twitter

Culture

Education

Books

Hiring

Reliability

Alerting

Monitoring

On-Call

Post-Mortem

Capacity Planning

Capacity Planning

Service Level Agreement

Performance

Performance Checklists for SREs

Programming

Articles

Real-time Messaging

#sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
#incident_response channel at Hangops Slack - Discussion about Incident Response.

Blogs

Brendan Gregg's Blog - Highly Techincal Blog Posts About Systems Internals, Performance and SRE.
Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
High Scalability - Technical Blog Posts About Systems Architecture.
rachelbythebay - Techincal Blog Posts.
SRE Weekly - Weekly Site Reliability Newsletter.
Production Ready - A mailing list about building resilient infrastructure and tools.
Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
SREally?!

Conferences & Meetups

SRECon Conferences - The Official SRE Conference.
LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
SRE Tech Talks - SRE Talks Hosted by Google.
South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.

Twitter

Alice Goldfuss - SRE @ New Relic - Tweets About the SRE Culture.
Brendan Gregg - SRE @ Netflix - Technical Resources about Systems, Performance and Site Reliability Engineers.
Caitie McCaffrey - Tweets About Reliability and Distributed Systems.
Dave Hahn - SRE @ Netflix.
Highscal - Feed of the High Scalability Blog.
Jennifer Petoff - Program Manager for Google's Site Reliability Engineering team.
Jesse Dearing - SRE @ InVisionApp.
Jonah horowitz - SRE @ Netflix.
Niall Murphy - SRE @ Google.
Nick Craver - SRE @ StackOverflow.
SREBook - The Official Twitter Account of Site Reliability Engineering Book.
SREcon - SRECon's Official Twitter Account.
Thomas A. Limoncelli - Prominent Author About SysAdmin/DevOps/SRE.
Todd Underwood - SRE @ Google.
Twitter SRE - The Official Twitter Account of Twitter's SRE team.
USENIX Association - The Official USENIX Twitter Account.

sanoojm/awesome-sre

Awesome Site Reliability Engineering

What is Site Reliability Engineering?

Contributing

Contents

Culture

Education

Books

Hiring

Reliability

Alerting

Monitoring

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Articles

Real-time Messaging

Blogs

Conferences & Meetups

Twitter