Due to my background in automation at large scale, I'm often asked for advice on SRE practices. Fortunately I worked with some of the best SRE's in the business, including @NathenHarvey - Developer Advocate for Google Cloud Platform.
When I meet a customer or colleague who askes about SRE, I always ask these initial questions:
- How did you find out about your last outage ?
- What was the impact to your customers ?
- How did your organisation respond ?
Site Reliability Engineering (SRE) is a set of principles, practices, and organisational constructs that seek to balance the reliability of a service with the need to continually deliver new features.
The REAL definition.
Yes, it was created by Google. SRE is what you get when you treat operations as a software problem. Using code for availability, latency, performance and capacity.
Some things to remember:
- The MOST important feature of any system is it’s reliability.
- Our monitoring does not only decide our reliability because our Users do.
- Reliability comes from great engineered Software, Operations and Business working
together
.
The 3 most important measures for an SRE.
Metrics that describe users experience.
- Response times
- Ticket Queue monitoring
- Net Promoter Score
- Twitter sentiment
- Reddit issues
Our target health
- how good does our customer need it ?
Contractable obligation
- Penalties
- Restore times
We want to be asking ourselves:
When do we need to make a system more reliable ?
What is an Error Budget
???
- Targeting less than 100% reliability means targeting more than 0% unreliability
- An acceptable rate or errors
- This is a budget that can be allocated
Eg.
SLO: 99.9%
Error budget: 100% - 99.9% = 0.1%
For a 1B query/month service - 1 million errors to “Spend” !
What to do when you have budget to spend and what to do when your out of budget !
- prioritise postmortem items
- Automate deployment pipelines
- Improve monitoring and observability
- Require SRE consultation
- Expected system changes
- Inevitable failure in hardware, networks, etc
- Planned downtime
- Risky experiments
This is the hierachy of SRE's as defined by Google.
Firefighters say We have never responded to an emergency
! They practice to put out a fire on a burning house.
SRE’s must practice to respond to incidents because outages WILL happen !
As they say:
An incident is an unplanned investment !
The Framework
is as follows:
Detect -> Investigate -> Mitigate -> Fix -> Learn
- Free books - https://sre.google/books/
- Splunk SRE site - https://www.observability.splunk.com/en_us/infrastructure-monitoring/guide-to-sre-and-the-four-golden-signals-of-monitoring.html
So, what are the answers to my questions above ?
-
How did you find out about your last outage ?
It’s your SLI.
-
What was the impact to your customers ?
It’s your SLO.
-
How did your organisation respond ?
Error budget consequences.
Create a CoP or Community of Practice. Nominate some key leaders from each group and include everyone. The CoP is a place for everyone to learn, pracice skills, plan, design and build better platforms, applications and monitoring.
Discussion topics for the CoP to start with:
- SCM's GitOps.
- DevSecOps
- CI/CD Pipelines.
- Infra as Code. (Terraform, Config Mgmt.)
- Security as Code.
- Application Automation.
- Immutable Artifacts and Artifact Repo's.
- Blameless Postmortums.
- SLIs, SLOs and SLAs / Error Budgets.
- Monitoring.
- Alerting.
- Observability.
- Feature Toggles.
- Auto Scaling.
- Self Healing Apps and Infra.
- Automated Testing.