- What to promise to whom
- What metrics to measure
- How much reliability is good enough
- SLA
- Violating SLAs requires costly compensation.
- SLO
- They should always be stronger than your SLAs because customers are usually impacted before the SLA is actually breached.
- An SLO is effectively an internal promise to meet customer expectations.
- When you meet your target SLOs, users are happy. When we miss target SLOs, users are sad.
- Happiness Test
- The test states that services need target SLOs that capture the performance and availability levels that if barely met would keep a typical customer happy.
- The challenge is quantifying and measuring the happiness of your customers since you can't do this directly.
- You have to be able to make sure you're thinking about all groups of your customers.
- Measuring Reliability
- Time to start playing(=latency)
- No interruptions or issues with playback
- SLI(Service Level Indicators)
- (good events / valid events)
- ex)
- request latency
- Error rate = ratio of erros or successes/total requests
- Error rate = erros or successes/throughput
- pros, cons, tradeoffs
- For example, perhaps you're already exporting the data for your SLI to your monitoring system, which is a big plus. But maybe it's not saving history far back enough, or you've come with a perfect measure of user experience but actually implementing that measurement is way too complex.
- How do you set SLOs for SLIs?
- Edge cases
- The impact of outages isn't always constant over time.
- ex) Balck Friday
- they might shift from wanting a three nines available service for most of the year to something closer to four nines, which they address by temporary strategies, such as over-provisioning resources, implementing change freezes, or utilizing war rooms.
- ex) Outage duration can impact customer
happiness
- 4 one-hour outages vs. 1 four-hour outage vs. constant 0.5% errors.
- ex) Balck Friday
- It's common to set a target for the median user and for the long tail to make sure that it doesn't get too long.
- The impact of outages isn't always constant over time.
- 100% is the wrong target
"100% is the wrong reliability target for basically everything." - Ben Treynor, Google, LLC
- Higher availability costs you more to provide, reducing your ability to make changes and release new features.
- If you want to improve reliability, sooner or later, you're going to need to slow down changes by having things like increased testing, less frequent releases and more manual analysis.
- Iterating
- Running a service with an SLO is an adaptive and iterative process.
- You should expect to regularly re-evaluate the details of your chosen SLIs and the SLO targets you put on.
- Learn how to...
- Quantify missed reliability targets with error budgets.
- Make business decisions based on reliability targets.
- Make a service more reliable.
- Error budgets
- Inverse of avaiability.
- How your service is allowed to be.
- 99.9% success(SLO) = 0.1% failure(error budget) = 40.32 minutes of downtime per month(28 days).
- Benefits
- Common incentives for Devs and SREs
- Dev team can self-manage risk
- Unrealistic goals become unattractive
- Everything is trade-off
- When you try to maximize your reliability, you're limiting development velocity of new features, quicker releases, etc.
- Align Incentives
- Devs can take risks and push more quickly.
- SRE team can work more proactively.
- Reliability vs. (
$$Cost$$ , Velocity) - Effective SLOs
- Have executive buy-in.
- Have consequences.
- Are accurately measured.
- Advanced Techniques
- Dynamic release cadence
- Based on remaining error budget
- "Rainy Day" Fund
- Covers unexpected events
- Error budget-based alerts
- Exhaustion rate drives alerting
- "Silver bullets"
- For "critical" new features
- Silver bullets aren't required to fix latency, quality or reliability problems that are causing a negative user experience.
- Silver bullets do not roll over.
- The use of a silver bullet is generally regarded as a failure, and it should trigger a post mortem or other retrospective to learn what went wrong and how to fix it for next time.
- Dynamic release cadence
Excessive helpfulness is harmful.
- Axes of improvement
- TTD: time-to-detect
- TTR: time-to-repair or time-to-resolution
- Improve reliability by
- Reducing detection time: lower TTD
- Reducing repair time: lower TTR
- Reducing impact %: Canary deployment
- Reducing frequency: higher TTF
- Operational approach to increasing reliability
- Report on uneven error budget spend
- Provide input on achieving targets
- Standardize Infrastructure
- Consult on system design
- Build safe release and rollback
- Author postmortems
- Use phased rollouts
- Module summary
- Define your problem space: SLOs and SLIs.
- Make your system as reliable as it must be, but no more.
- Error budgets are your primary basis of communication.
- SLOs are not set in stone forever.
- The team relationship has to be strong to make this work.
- Learn how to...
- Choose the right metric.
- Measure your SLIs.
- Choose an SLI specification.
- Refine an SLI implementation.
- Reduce SLI complexity.
- Set SLO targets.
- User happiness in metric form
- Ideally, you wanted to define SLIs that have a predictable, mostly linear relationship with happiness of your users.
- The properties of good SLI metrics
- (X) System metrics(load average, CPU utilization, memory usage, bandwidth ..)
- Users don't directly see your CPUs pegged at 100 percent.
- They see your service responding slowly.
- (X) Internal state(thread pool fullness, request queue length ..)
- The data is noisy, and there are many reasons why large changes could occur.
- (V) Has predictable relationship with user haapiness.
- (V) Shows service is working as users expect it to.
- (V) Expressed as: (good events) / (valid events)
- (V) Aggregated over a long time horizon.
- (X) System metrics(load average, CPU utilization, memory usage, bandwidth ..)
- Ways of measuring SLIs
- First up is deriving SLI metrics from processing server side request logs or data.
- Covering all the edge cases in a complex journey with a synthetic client, can be a huge engineering task.
- Measuring from the client's perspective also captures a lot of factors that are outside of your direct control, which can lower the signal to noise ratio of prospective SLIs.
- First up is deriving SLI metrics from processing server side request logs or data.
- The SLI menu
- Request/Response
- Availability
- Latency
- Quality
- Data Processing
- Coverage
- Correctness
- Freshness
- Throughput
- Storage
- Durability
- Request/Response
- The SLI equation
- SLI : (good events / valid events)
- SLIs fall between 0 and 100%
- 0% : nothing works
- 100% : nothing is broken
- Consistency makes building common tooling easier
- What makes an event valid?
- Sometimes you may need to completely exclude some events recorded by your underlying monitoring metrics from being included in your SLI, so they cannot consume your Error Budget.
- What makes an event good?
- SLIs fall between 0 and 100%
- SLI : (good events / valid events)