HeinrichHartmann/Statistics-for-Engineers

Questionaire: What makes a great SRE?

Closed this issue ยท 7 comments

Becoming an great SRE is a challenging task. Apparat from rich knowledge about hardware, operating systems and software engineering, there is a fair bit of data that is generated by users and machines, that needs to be analyzed. This document gathers examples of such applications.

Please contribute, and vote for things you find important.

Guidelines:

  • Each comment should describe one single example
  • Each examples should be as concrete as possible
  • Examples should be abstract. No mentioning of technologies (mysql) or tools (ganglia)
  • Each example should complete the sentence: "An great SRE should be able to..."

Calculate the Uptime of a node or a service over a given amount of time

Understand the data presented in graphs presented by monitoring tools:

  • Display period vs. Aggregation Period, SpikeErosion
  • Linear and Log scale
  • Best practices for creating graphs: Caption, Legend, Units, ...

Monitor and calculate percentiles of API latencies over:

  • different time periods
  • different levels of aggregation (node, service, endpoint)

Select appropriate metrics to monitor an new service for availablility and service quality

Determining the demarcation line between hardware and software issues

i.e. correlating info in different graphs to pinpoint the issue quickly
e.g. if load is high and transactions per second is not increasing we could have saturated the NIC or have run out of bandwidth from the provider side, or it could be the cache hit ratio in the database

Contributed by Riley Berton

Differentiate between normal operation and unusual operation.

Be proactive - effectively predict problematic behavior.