
An Opinionated Roadmap to Become an SRE (Concepts > Tools)

An opinionated roadmap to become an SRE (Concepts > Tools)

Distributed systems

  • Concepts
    • Fallacies of distributed computing
    • Synchronous vs. asynchronous
    • Event log vs. message queue
    • Exactly-once delivery
    • Different types of message failure
    • Orchestration vs. choreography
    • Causality
    • CDN
    • Hashing
      • Consistent hashing
      • Geohashing
      • Perfect hashing
    • Read-heavy vs. write-heavy impacts
    • Federation
    • Latency
      • Latency, throughput, goodput
      • Latency numbers every programmer should know
      • How to prevent latency variability
      • Tail latency
    • How to reduce sharing
    • Idempotency
    • Load balancer
      • Concepts
      • Layer 4 vs. layer 7 load balancer
    • Liveness vs. safety properties
    • Microservices: pros and cons
    • REST
    • gRPC
    • Service mesh
    • Source of truth
    • Stateful vs. stateless
    • Total vs. partial order
    • Why can't we rely on the system clock in distributed systems
    • Vector clock
  • Cache
    • When to use a cache
    • Cache-aside vs. read-through
    • Eviction policy
    • Refresh-ahead
    • Write-through vs. write-back
    • Distributed cache
    • Performance cache vs. capacity cache
  • Databases
    • Different types of databases
      • NoSQL vs. SQL databases
      • Relational vs. document
      • Column-oriented databases
      • Graph databases
      • Vector database
      • Objects-based storage
    • ACID
    • Partitioning
      • Criteria
      • Methods
      • Replication vs. partition
    • Hotspot
    • CALM theorem
    • CAP theorem
    • PACELC theorem
    • Cardinality
    • Chain replication
    • Consensus
    • Concurrency control
    • Consistency models
    • Isolation levels
    • Serializability
    • Linearizability
    • CRDT
    • Indexes
      • Tradeoff
      • Primary vs. secondary indexes
    • Denormalization
    • View & materialized view
    • Transaction
    • Distributed transactions downsides
    • Strategies to handle rebalancing
    • Leader election
    • MVCC
    • N+1 select problem
    • Quorum
    • Raft
    • Read repair
    • Single-leader, multi-leader, leaderless replication
    • Split-brain
    • 2PC
    • 3PC
    • WAL
    • Write and read amplification
  • Data structure
    • Probabilistic data structures
      • Bloom filter
      • Count-min sketch
      • HyperLogLog
    • Storage
      • LSM tree
      • B-tree
      • SSTable


  • Concepts
    • Difference between availability, resiliency, robustness, fault-tolerance, and reliability
    • Why is it wrong to target 100% availability
    • Blast radius
    • Failure domain
    • Cascading failures
    • Hard vs. soft dependencies
    • Scalability
      • Concepts
      • Knee point
      • Ceiling
    • Number one source of outages
    • Tail tolerance
    • Toil
  • Patterns/Anti-patterns
    • Bulkhead pattern
    • Circuit breaker
    • Exponential backoff
    • Jitter
    • Graceful degradation
    • Load shedding
    • Retry amplification
    • Backpressure
    • Rate limiting
    • Request hedging
  • Practices
    • Chaos engineering


  • Concepts
    • What's the difference between monitoring and observability
    • Trace vs. metric vs. log
    • Golden signals
    • Observer effect
    • Percentile
    • Streetlight anti-method
    • Time-series based monitoring lies
    • USE method
    • Main metrics for cache
    • Why should we be careful about average performance metrics
  • Alerting
    • Alerting strategy
    • Alerting fatigue concept
    • Characteristic of a good alert
    • Slow vs. fast burn alert


  • Concepts
    • Bake time
    • Feature flag
    • Feature freeze
    • Rollout supervision
  • Rollout types
    • Blue green rollout
    • Canary rollout
    • Progressive rollout
    • Shadow rollout


  • Concepts
    • SLI vs. SLO vs. SLA
    • Error budget
  • SLO
    • Difference between KPIs and SLOs
    • Benefits of having alerts based on SLOs
    • Why is exceeding an SLO not necessarily a good thing
    • SLO for data (freshness, completeness, consistency, etc.)
    • SLO for mobiles
    • SLO for services


  • Container
  • Container orchestration


  • Scripting
  • Filesystem
  • Memory
  • Processes
  • Resource utilization
  • Network


  • ARP protocol
  • Bandwidth
  • BGP
  • CoDel
  • CORS
  • DNS
  • Ping vs. heartbeat
  • TCP
    • TCP vs. UDP
    • Congestion control
    • Connection backlog
    • Flow control
    • Handshake
  • HTTP
  • HTTP/2
  • Head of line blocking
  • Health checks: passive vs. active
  • Internet model
  • NTP
  • OSI model
  • Routers
  • Switch
  • Network topologies
  • What happens if you type google.com in your browser


  • Authentication
  • Certificate
  • Certificate authority
  • Cipher
  • Confidentiality
  • Encryption
  • TLS
  • PKI
  • Signature


  • Core analysis loop
  • Correlation vs. causation
  • First principle
  • Five whys technique
  • Incident management
    • How to address an incident (assess, mitigate, resolve)
    • Incident roles
    • How to write a postmortem
    • 3C principles (Coordinate, Communicate, maintain Control)


  • SRE role
  • Version control

Soft skills

  • Communication
    • Writing
    • Oral
    • Presentation
  • Collaboration
  • Problem solving
  • Curiosity
  • Navigating ambiguity
  • Staying humble