This repository is an attempt to consolidate useful resources for Site Reliability Engineer (SRE) interview preparation.
Please take a look at the contribution guidelines first. Contributions are always welcome!
- Simple: What happens when you type in ‘www.cnn.com’ in your browser?
- Detailed: What happens when you type google.com into your browser's address box and press enter?
- Introduction to Linux – Full Course for Beginners
- What every SRE should know about GNU/Linux shell related internals: file descriptors, pipes, terminals, user sessions, process groups and daemons
- How Does Linux Boot Process Work?
- An introduction to the Linux boot and startup processes
- What happens when we turn on computer?
- What happens when we turn on computer?
- From Power up to login prompt
- Understanding Inodes
- Understand UNIX / Linux Inodes Basics with Examples
- Understanding proc filesystem
- Common Mount Options
- Understanding Linux filesystems: ext4 and beyond
- Explain the basics of Linux kernel
- Kernel Space and User Space
- Linux Kernel Process Management
- Linux Addressing
- Linux Kernel Memory Management
- STACK AND HEAP
- Paging and Segmentation
- Linux Kernel System Calls
- The Virtual Filesystem
- Concurrency and Race Conditions
- Memory Leak
- What is a kernel Panic?
- Book about the linux kernel
- Linux troubleshooting tools
- Linux Performance Analysis in 60,000 Milliseconds
- strace
- lsof
- Linux system debugging
- SaaS where users can test their Linux troubleshooting skills
- The Internet explained from first principles
- Network protocols for anyone who knows a programming language
- Introduction to Linux interfaces for virtual networking
- Multi-tier load-balancing with Linux
- Introduction to modern network load balancing and proxying
- Load Balancing Algorithms
- Introduction to Docker and Containers
- Containers Patterns
- Docker Container Anti Patterns
- Anti-Patterns When Building Container Images
- Deploying and Scaling Microservices with Docker and Kubernetes
- Demystifying the Kubernetes Iceberg
- What happens when ... Kubernetes edition!
- Kubernetes Production Patterns
- Kubernetes production best practices
- A Guide to the Kubernetes Networking Model
- 47 Things To Become a Kubernetes Expert
- Kubernetes Best Practices 101
- 15 Kubernetes Best Practices Every Developer Should Know
- THE KUBERNETES NETWORKING GUIDE
- The life of a DNS query in Kubernetes
- Terraform
- A Comprehensive Guide to Terraform
- Ansible
- Getting Started With Terraform on AWS
- Google Cloud: Best practices for using Terraform
- Things You Should Know About Databases
- 7 Database Paradigms
- CAP theorem
- Evolutionary Database Design
- ACID vs BASE in Databases
- Understanding Database Sharding
- Database Replication
- SQL vs. NoSQL Database: When to Use, How to Choose
- How do database indexes work?
- Redis Explained
- Database Sharding Explained
- 7 Pipeline Design Patterns for Continuous Delivery
- CI/CD patterns
- Six Strategies for Application Deployment
- A tour of Go
- Go by Example
- Go Tutorials & Examples
- Learn Go with Tests
- Getting up and running with Go
- Effective Go
- Go Design Patterns
- Go Memory Management
- Style Guide
- Style Decisions
- Best Practices
- 50 Shades of Go: Traps, Gotchas, and Common Mistakes for New Golang Devs
- AlgoExpert
- Hacking a Google Interview – Handout 1
- Hacking a Google Interview – Handout 2
- Hacking a Google Interview – Handout 3
- SystemsExpert course from AlgoExpert
- System Design 101
- Grokking the System Design Interview
- The System Design Primer
- Crack the System Design Interview
- System design interview for IT companies
- Web Architecture 101
- What's in a Production Web Application?
- Distributed systems
- Failover
- Monoliths, Service Architecture, and Microservices
- Scale From Zero To Millions Of Users
- SLOs & You: A Guide To Service Level Objectives
- Setting up Service Monitoring — The Why’s and What’s
- How NOT to Measure Latency
- The four Golden Signals of Kubernetes monitoring
- Introduction to Prometheus
- Prometheus Relabeling Training
- Avoid These 6 Mistakes When Getting Started With Prometheus
- A Deep Dive Into the Four Types of Prometheus Metrics
- How Prometheus Querying Works
- PromQL Cheat Sheet
- The practical guide to incident management
- Incident Response
- Postmortems
- Runbooks
- Identifying and tracking toil using SRE principles
- Building SRE from Scratch
- SRE at Google: Our complete list of CRE life lessons
- Incident Management vs. Incident Response - What's the Difference?
- Practical Guide to SRE: Using SLOs to Increase Reliability
- Practical Guide to SRE: Automating On-Call
- Going from Zero to SRE
- An Incident Command Training Handbook
- Howie guide to post‑incident investigations
- Rundown of LinkedIn’s SRE practices
- Rundown of Uber’s SRE practice
- SRE in the Real World
- SRE Engagement Models
- SRE Checklist
- Why bother with SLI and SLO?
- The System Resiliency Pyramid
- A collection of questions to practice with for SRE interviews
- SRE Interview Questions
- Sysadmin Test Questions
- Kubernetes job interview questions
- DevOps Guide
- Questions I ask in SRE interviews
- DevOps Roadmap: Learn to become a DevOps Engineer or SRE
- The Must-Know Terraform Interview Questions
- SRE Interviews in Silicon Valley
- Preparing the SRE interview
- How to Get Into SRE
- My Job Interview at Google
- Path to Site Reliability Management
- Becoming a Site Reliability Engineer
- How I get a job at Google as SRE
- Become A DevOps Engineer in 2023: [Detailed Guide]
- How to Get an SRE Role
- Site Reliability Engineering
- The Site Reliability Workbook
- Seeking SRE
- Building Secure and Reliable Systems
- Implementing Service Level Objectives