Awesome Site Reliability Engineering

A curated list of awesome Site Reliability and Production Engineering resources.

What is Site Reliability Engineering?

"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE

Contributing

Please take a look at the contribution guidelines first. Contributions are always welcome!

Culture
Education
Books
Hiring
Reliability
Monitoring & Observability & Alerting
On-Call
Post-Mortem
Capacity Planning
Service Level Agreement
Performance
Misc Articles
Blogs
Conferences & Meetups
Twitter

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

#sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
#incident_response channel at Hangops Slack - Discussion about Incident Response.

Blogs

Brendan Gregg's Blog - Highly Techincal Blog Posts About Systems Internals, Performance and SRE.
Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
High Scalability - Technical Blog Posts About Systems Architecture.
rachelbythebay - Techincal Blog Posts.
SRE Weekly - Weekly Site Reliability Newsletter.
Production Ready - A mailing list about building resilient infrastructure and tools.
Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
SREally?!
SysAdvent - One article for each day of December, ending on the 25th article.
Operations for Developers - A collection of resources for developers to strengthen their Ops skills.
ProdOps: From Product to Production
Stephen Thorne's Blog - Blog Posts About SRE
Increment - A digital magazine about how teams build and operate software systems at scale.
O’Reilly Systems Engineering and Operations Newsletter - Weekly systems engineering and operations news and insights from industry insiders.
GopherSRE - Blog Posts about Go and SRE.

Conferences & Meetups

SRECon Conferences - The Official SRE Conference.
LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
SRE Tech Talks - SRE Talks Hosted by Google.
South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
Front Range Site Reliability Engineering - SRE Meetup in Boulder/Denver/Golden/DTC/FoCo area.
Site Reliability Engineering Munich, Germany - SRE Meetup in the greater area of Oktoberfest city.

Twitter

Alice Goldfuss
Brendan Gregg
Caitie McCaffrey
Charity Majors
Dave Hahn
Highscal
Jennifer Petoff
Jesse Dearing
Jonah Horowitz
Julia Evans
Krishelle Hardson-Hurley
Niall Murphy
Nick Craver
SREBook - The Official Twitter Account of Site Reliability Engineering Book.
SREcon - SRECon's Official Twitter Account.
Tammy Bütow
Thomas A. Limoncelli
Todd Underwood
Twitter SRE - The Official Twitter Account of Twitter's SRE team.
USENIX Association - The Official USENIX Twitter Account.

HiFaraz/awesome-sre

Awesome Site Reliability Engineering

What is Site Reliability Engineering?

Contributing

Contents

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

Blogs

Conferences & Meetups

Twitter