Awesome Site Reliability Engineering

A curated list of awesome Site Reliability and Production Engineering resources.

What is Site Reliability Engineering?

"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE

Contributing

Please take a look at the contribution guidelines first. Contributions are always welcome!

Culture
Education
Books
Hiring
Reliability
Monitoring & Observability & Alerting
On-Call
Post-Mortem
Capacity Planning
Service Level Agreement
Performance
Misc Articles
Blogs
Newsletters
Conferences & Meetups
Twitter
SRE Tools

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

#sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
#incident_response channel at Hangops Slack - Discussion about Incident Response.
USENIX SREcon Slack

Blogs

Brendan Gregg's Blog - Highly Technical Blog Posts About Systems Internals, Performance and SRE.
Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
High Scalability - Technical Blog Posts About Systems Architecture.
rachelbythebay - Techincal Blog Posts.
Production Ready - A mailing list about building resilient infrastructure and tools.
Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
SysAdvent - One article for each day of December, ending on the 25th article.
Operations for Developers - A collection of resources for developers to strengthen their Ops skills.
Stephen Thorne's Blog - Blog Posts About SRE
Increment - A digital magazine about how teams build and operate software systems at scale.
GopherSRE - Blog Posts about Go and SRE.
Cindy Sridharan - Blog posts about distributed systems and their management.
Blameless Blog - Blog posts about SRE culture and practices.
Resilience Roundup - Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
Squadcast Blog - Blog posts about SRE best practices, reliability, on-call and incident management.

Newsletters

DevOpsLinks - A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.
KubeWeekly - The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas
SRE Weekly - Weekly Site Reliability Newsletter.
O’Reilly Systems Engineering and Operations Newsletter - Weekly systems engineering and operations news and insights from industry insiders.
ChaosEngineering.news - Chaos Engineering newsletter. All things Chaos Wngineering, directly to your inbox!

Conferences & Meetups

SRECon Conferences - The Official SRE Conference.
LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
SRE Tech Talks - SRE Talks Hosted by Google.
South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
Front Range Site Reliability Engineering - SRE Meetup in Boulder/Denver/Golden/DTC/FoCo area.
Site Reliability Engineering Munich, Germany - SRE Meetup in the greater area of Oktoberfest city.
ADDO - All Day DevOps - A 24 hour conference that is completely online and free.
Site Reliability Engineering Paris, France - SRE Meetup in the city of light.

Twitter

Google SRE Twitter Account - Google's SRE Twitter Account.
SREBook - The Official Twitter Account of Site Reliability Engineering Book.
SREcon - SRECon's Official Twitter Account.
SREWorkbook - The Official Twitter Account of Site Reliability Workbook.
The SRE Dev - SRE-related Posts from dev.to.
Twitter SRE - The Official Twitter Account of Twitter's SRE team.
Twitter SRE Weekly - The Official Twitter Account of SRE Weekly Newsletter.
USENIX Association - The Official USENIX Twitter Account.

SRE Tools

Awesome SRE Tools - A curated list of Site Reliability and Production Engineering tools
List of Continuous Integration services

IzekChen/awesome-sre

Awesome Site Reliability Engineering

What is Site Reliability Engineering?

Contributing

Contents

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

Blogs

Newsletters

Conferences & Meetups

Twitter

SRE Tools