Questionaire: What makes a great SRE?

Question

Questionaire: What makes a great SRE?

Closed this issue 6 years ago · 7 comments

Becoming an great SRE is a challenging task. Apparat from rich knowledge about hardware, operating systems and software engineering, there is a fair bit of data that is generated by users and machines, that needs to be analyzed. This document gathers examples of such applications.

Please contribute, and vote for things you find important.

Guidelines:

Each comment should describe one single example
Each examples should be as concrete as possible
Examples should be abstract. No mentioning of technologies (mysql) or tools (ganglia)
Each example should complete the sentence: "An great SRE should be able to..."

Answer 1 · 2016-07-08T14:47:08.000Z

Calculate the Uptime of a node or a service over a given amount of time

Answer 2 · 2016-07-08T14:48:13.000Z

Understand the data presented in graphs presented by monitoring tools:

Display period vs. Aggregation Period, SpikeErosion
Linear and Log scale
Best practices for creating graphs: Caption, Legend, Units, ...

Answer 3 · 2016-07-08T14:53:27.000Z

Monitor and calculate percentiles of API latencies over:

different time periods
different levels of aggregation (node, service, endpoint)

Answer 4 · 2016-07-08T14:55:17.000Z

Select appropriate metrics to monitor an new service for availablility and service quality

Answer 5 · 2016-07-08T15:16:15.000Z

Determining the demarcation line between hardware and software issues

i.e. correlating info in different graphs to pinpoint the issue quickly
e.g. if load is high and transactions per second is not increasing we could have saturated the NIC or have run out of bandwidth from the provider side, or it could be the cache hit ratio in the database

Contributed by Riley Berton

Answer 6 · 2016-07-08T16:41:03.000Z

Differentiate between normal operation and unusual operation.

Answer 7 · 2016-07-08T16:41:47.000Z

Be proactive - effectively predict problematic behavior.