Copyright 2014, 2016 Kyle Kingsbury
This outline accompanies a 12-16 hour overview class on distributed systems fundamentals. The course aims to introduce software engineers to the practical basics of distributed systems, through lecture and discussion. Participants will gain an intuitive understanding of key distributed systems terms, an overview of the algorithmic landscape, and explore production concerns.
Lamport, 1987:
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
- First glance: *nix boxen in our colo, running processes communicating via
TCP or UDP.
- Or boxes in EC2, rackspace, etc
- Maybe communicating over infiniband
- Separated by inches and a LAN
- Or by kilometers and the internet
- Most mobile apps are also taking part in a distributed system
- Communicating over a truly awful network
- Same goes for desktop web browsers
- It's not just servers--it's clients too!
- More generally: distributed systems are
- Made up of parts which interact
- Slowly
- And often unreliably
- Whatever those mean for you
- So also:
- Redundant CPUs in an airplane
- ATMs and Point-of-Sale terminals
- Space probes
- Paying bills
- Doctors making referrals
- Drunk friends trying to make plans via text message
- Every business meeting ever
- We call each part of a distributed system a node
- Also known as processes, agents, or actors
- Characteristic latency
- Operations inside a node are "fast"
- Operations between nodes are "slow"
- What's fast or slow depends on what the system does
- Nodes are coherent
- You know when problems occur
- Things effectively happen in order
- Typically modeled as some kind of single-threaded state machine
- A node could itself be a distributed system
- But so long as that system as a whole provides "fast, coherent" operations, we can treat it as a single node.
- Formal models for processes
- Pi calculus
- Ambient calculus
- Communicating Sequential Processes
- Formal models for node failure
- Crash-stop
- Crash-recover
- Crash-amnesia
- Byzantine
- Nodes interact via a network
- Humans interact via spoken words
- Particles interact via fields
- Computers interact via IP, or UDP, or SCTP, or ...
- We model those interactions as discrete messages sent between nodes
- Messages take time to propagate
- This is the "slow" part of the distributed system
- We call this "latency"
- We can represent the interaction of nodes and the network as a diagram
- Time flows up
- Nodes are vertical lines separated by space
- Messages as slanted paths connecting nodes
- Execute in lockstep
- Fixed bounds on transmission delay
- Accurate global clock
- Easy to prove stuff about
- You probably don't have one
- Execute independently, whenever
- Unbounded message delays
- No global clocks
- IP networks are definitely asynchronous
- But in practice the really pathological stuff doesn't happen
- Most networks recover in seconds to weeks, not "never"
- Conversely, human timescales are on the orders of seconds to weeks
- So we can't pretend the problems don't exist
- Asynchronous networks are allowed to
- Duplicate
- Delay
- Drop
- Reorder
- Drops and delays are indistinguishible
- Byzantine networks are allowed to mess with messages arbitrarily
- Including rewriting their content
- They mostly don't happen in real networks
- Mostly
- TCP works. Use it.
- Not perfect; you can go faster
- But you'll know when this is the case
- In practice, TCP prevents duplicates and reorders in the context of a single
TCP conn
- But you're probably gonna open more than one connection
- If for no other reason than TCP conns eventually fail
- And when that happens, you'll either a.) have missed messages or b.) retry
- You can reconstruct an ordering by encoding your own sequence numbers on top of TCP
- Same addressing rules as TCP, but no stream invariants
- Lots of people want UDP "for speed"
- Don't consider that routers and nodes can and will arbitrarily drop packets
- Don't consider that their packets will be duplicated
- And reordered
- "But at least it's unbiased right?"
- WRONG!
- This causes all kinds of havoc in, say, metrics collection
- And debugging it is hard
- TCP gives you flow control and repacks logical messages into packets
- You'll need to re-build flow-control and backpressure
- TLS over UDP is a thing, but tough
- UDP is really useful where TCP FSM overhead is prohibitive
- Memory pressure
- Lots of short-lived conns and socket reuse
- Especially useful where best-effort delivery maps well to the system goals
- Voice calls: people will apologize and repeat themselves
- Games: stutters and lag, but catch up later
- Higher-level protocols impose sanity on underlying chaos
- When a system is split into independent parts, we still want some kind of order for events
- Clocks help us order things: first this, THEN that
- In theory, the operating system clock gives you a partial order on system events
- Caveat: NTP is probably not as good as you think
- Caveat: Definitely not well-synced between nodes
- Caveat: Hardware can drift
- Caveat: By centuries
- Caveat: POSIX time is not monotonic by definition
- Caveat: The timescales you want to measure may not be attainable
- Caveat: Threads can sleep
- Caveat: Runtimes can sleep
- Caveat: OS's can sleep
- Caveat: "Hardware" can sleep
- Just don't.
- Lamport 1977: "Time, Clocks, and the Ordering of Events in a Distributed System"
- One clock per process
- Increments monotonically with each state transition:
t' = t + 1
- Included with every message sent
t' = max(t, t_msg + 1)
- If we have a total ordering of processes, we can impose a total order on
events
- But that order could be pretty unintuitive
- Generalizes Lamport clocks to a vector of all process clocks
t_i' = max(t_i, t_msg_i)
- For every operation, increment that process' clock in the vector
- Provides a partial order
- Specifically, given a pair of events, we can determine causal relationships
- Independent
- A in causal past of B
- B in causal past of A
- Specifically, given a pair of events, we can determine causal relationships
- Pragmatically: the past is shared; the present is independent
- Only "present", independent states need to be preserved
- Ancestor states can be discarded
- Lets us garbage-collect the past
- O(processes) in space
- Requires coordination for GC
- Or sacrifice correctness and prune old vclock entries
- Variants
- Dotted Version Vectors - for client/server systems, orders more events
- Interval Tree Clocks - for when processes come and go
- Much better than NTP
- Globally distributed total orders on the scale of milliseconds
- Can do one possibly conflicting thing per uncertainty window
- Only people with this right now are Google
- Spanner: globally distributed strongly consistent transactions
- And they're not sharing
- More expensive than you'd like
- Several hundred per GPS receiver
- Atomic clocks for local corroboration: $$$$?
- Need multiple types of GPS: vendors can get it wrong
- I don't know who's doing it yet, but I'd bet datacenters in the future will offer dedicated HW interfaces for bounded-accuracy time.
We've covered the fundamental primitives of distributed systems. Nodes exchange messages through a network, and both nodes and networks can fail in various ways. Protocols like TCP and UDP give us primitive channels for processes to communicate, and we can order events using clocks. Now, we'll discuss some high-level properties of distributed systems.
- Availability is basically the fraction of attempted operations which succeed.
- Naive: every operation succeeds
- In consistency lit: every operation on a non-failing node succeeds
- Nothing you can do about the failing nodes
- Every operation against a non-failing node succeeds
- With the constraint that clients always talk to the same nodes
- Better than if the system weren't distributed.
- e.g. tolerant of up to f failures, but no more
- Maybe some operations fail
- Operations succeed if they occur on a node which can communicate with a majority of the cluster
- Operations against minority components may fail
- We talk a lot about "uptime"
- Are systems up if nobody uses them?
- Is it worse to be down during peak hours?
- Can measure "fraction of requests satisfied during a time window"
- Then plot that fraction over windows at different times
- Timescale affects reported uptime
- Apdex
- Not all successes are equal
- Classify operations into "OK", "meh", and "awful"
- Apdex = P(OK) + P(meh)/2
- Again, can report on a yearly basis
- "We achieved 99.999 apdex for the year"
- And on finer timescales!
- "Apdex for the user service just dropped to 0.5; page ops!"
- Ideally: integral of happiness delivered by your service?
- A consistency model is the set of "safe" histories of events in the system
- Once I read a value, any subsequent read will return that state or later values
- If I make a write, any subsequent writes I make will take place after the first write
- Once I write a value, any subsequent read I perform will return that write (or later values)
- Once I read a value, any subsequent write will take place after that read
- All operations (transactions) appear to execute atomically
- In some order
- No constraints on what that order is
- Perfectly okay to read from the past, for instance
- Suppose operations can be linked by a DAG of causal relationships
- A write that follows a read, for instance, is causally related
- Assuming the process didn't just throw away the read data
- Operations not linked in that DAG are concurrent
- A write that follows a read, for instance, is causally related
- Constraint: before a process can execute an operation, all its precursors must have executed on that node
- Concurrent ops can be freely reordered
- Like causal consistency, constrains possible orders
- All operations appear to execute atomically
- Every process agrees on the order of operations
- Operations from a given process always occur in order
- But nodes can lag behind
- All operations appear to execute atomically
- Every process agrees on the order of operations
- Every operation appears to take place between its invocation and completion times
- Real-time, external constraints let us build very strong systems
- ANSI SQL's ACID isolation levels are weird
- Basically codified the effects of existing vendor implementations
- Definitions in the spec are ambiguous
- Adya 1999: Weak Consistency: A Generalized Theory and Optimistic
Implementations for Distributed Transactions
- Each ANSI SQL isolation level prohibits a weird phenomenon
- Read Uncommitted
- Prevents P0: dirty writes
- w1(x) ... w2(x)
- Can't write over another transaction's data until it commits
- Can read data while a transaction is still modifying it
- Can read data that will be rolled back
- Prevents P0: dirty writes
- Read Committed
- Prevents P1: dirty reads
- w1(x) ... r2(x)
- Can't read a transaction's uncommitted values
- Prevents P1: dirty reads
- Repeatable Read
- Prevents P2: fuzzy reads
- r1(x) ... w2(x)
- Once a transaction reads a value, it won't change until the transaction commits
- Prevents P2: fuzzy reads
- Serializable
- Prevents P3: phantoms
- Given some predicate P
- r1(P) ... w2(y in P)
- Once a transaction reads a set of elements satisfying a query, that set won't change until the transaction commits
- Not just values, but which values even would have participated.
- Prevents P3: phantoms
- Cursor Stability
- Transactions have a set of cursors
- A cursor refers to an object being accessed by the transaction
- Read locks are held until cursor is removed, or commit
- At commit time, cursor is upgraded to a writelock
- Prevents lost-update
- Transactions have a set of cursors
- Snapshot Isolation
- Transactions always read from a snapshot of committed data, taken before the transaction begins
- Commit can only occur if no other committed transaction with an
overlapping [start..commit] interval has written to any of the objects
we wrote
- First-committer-wins
- Ideally, we want total availability and linearizability
- Consistency requires coordination
- If every order is allowed, we don't need to do any work!
- If we want to disallow some orders of events, we have to exchange messages
- Coordinating comes (generally) with costs
- More consistency is slower
- More consistency is more intuitive
- More consistency is less available
- CAP Theorem: Linearizability OR total availability
- But wait, there's more!
- Bailis 2014: Highly Available Transactions: Virtues and Limitations
- Other theorems disallow totally or sticky available...
- Strong serializable
- Serializable
- Repeatable Read
- Cursor Stability
- Snapshot Isolation
- You can have sticky available...
- Causal
- PRAM
- Read Your Writes
- You can have totally available...
- Read Uncommitted
- Read Committed
- Monotonic Atomic View
- Writes Follow Reads
- Monotonic Reads
- Monotonic Writes
- Fox & Brewer, 1999: Harvest, Yield, and Scalable Tolerant Systems
- Yield: probability of completing a request
- Harvest: fraction of data reflected in the response
- Examples
- Node faults in a search engine can cause some results to go missing
- Updates may be reflected on some nodes but not others
- Consider an AP system split by a partition
- You can write data that some people can't read
- Streaming video degrades to preserve low latency
- This is not an excuse to violate your safety invariants
- Just helps you quantify how much you can exceed safety invariants
- e.g. "99% of the time, you can read 90% of your prior writes"
- Strongly dependent on workload, HW, topology, etc
- Can tune harvest vs yield on a per-request basis
- "As much as possible in 10ms, please"
- "I need everything, and I understand you might not be able to answer"
- So, you've got a spectrum of choices!
- Chances are different parts of your infrastructure have different needs
- Pick the weakest model that meets your constraints
- But consider probabilistic bounds; visibility lag might be prohibitive
- See Probabilistically Bounded Staleness in Dynamo Quorums
- Not all data is equal
- Big data is usually less important
- Small data is usually critical
- Linearizable user ops, causally consistent social feeds
Availability is a measure of how often operations succeed. Consistency models are the rules that govern what operations can happen and when. Stronger consistency models generally come at the cost of performance and availability. Next, we'll talk about different ways to build systems, from weak to strong consistency.
- Consistency As Logical Monotonicity
- If you can prove a system is logically monotonic, it is coordination free
- What the heck is "coordination"
- For that matter, what's "monotonic"?
- Montonicity, informally, is retraction-free
- Deductions from partial information are never invalidated by new information
- Both relational algebra and Datalog without negation are monotone
- Ameloot, et al, 2011: Relational transducers for declarative networking
- Theorem which shows coordination-free networks of processes unaware of the
network extent can only compute only monotone queries in Datalog
- This is not an easy read
- "Coordination-free" doesn't mean no communication
- Algo succeeds even in face of arbitrary horizontal partitions
- Theorem which shows coordination-free networks of processes unaware of the
network extent can only compute only monotone queries in Datalog
- In very loose practical terms
- Try to phrase your problem such that you only add new facts to the system
- When you compute a new fact based on what's currently known, can you ensure that fact will never be retracted?
- Consider special "sealing facts" that mark a block of facts as complete
- These "grow-only" algorithms are usually easier to implement
- Likely tradeoff: incomplete reads
- Bloom language
- Unordered programming with flow analysis
- Can tell you where coordination would be required
- Message broadcast system
- Useful for cluster management, service discovery, CDNs, etc
- Very weak consistency
- Very high availability
- Global broadcast
- Send a message to every other node
- O(nodes)
- Mesh networks
- Epidemic models
- Relay to your neighbors
- Propagation times on the order of max-free-path
- Spanning trees
- Instead of a mesh, use a tree
- Hop up to a connector node which relays to other connector nodes
- Reduces superfluous messages
- Reduces latency
- Plumtree (Leit ̃ao, Pereira, & Rodrigues, 2007: Epidemic Broadcast Trees)
- Order-free datatypes that converge
- Counters, sets, maps, etc
- Tolerate dupes, delays, and reorders
- Unlike sequentially consistent systems, no "single source of truth"
- But unlike naive eventually consistent systems, never lose information
- Unless you explicitly make them lose information
- Works well in highly-available systems
- Web/mobile clients
- Dynamo
- Gossip
- INRIA: Shapiro, Preguiça, Baquero, Zawirski, 2011: "A comprehensive study of
Convergent and Commutative Replicated Data Types"
- Composed of a data type X and a merge function m, which is:
- Associative: m(x1, m(x2, x3)) = m(m(x1, x2), x3)
- Commutative: m(x1, x2) = m(x2, x1)
- Idempotent: m(x1, x1) = m(x1)
- Composed of a data type X and a merge function m, which is:
- Easy to build. Easy to reason about. Gets rid of all kinds of headaches.
- Did communication fail? Just retry! It'll converge!
- Did messages arrive out of order? It's fine!
- How do I synchronize two replicas? Just merge!
- Downsides
- Some algorithms need order and can't be expressed with CRDTs
- Reads may be arbitrarily stale
- Higher space costs
- Bailis, Davidson, Fekete, et al, 2013: "Highly Available Transactions,
Virtues and Limitations"
- Guaranteed responses from any replica
- Low latency (1-3 orders of magnitude faster than serializable protocols!)
- Read Committed
- Monotonic Atomic View
- Excellent for commutative/monotonic systems
- Foreign key constraints for multi-item updates
- Limited uniqueness constraints
- Can ensure convergence given arbitrary finite delay ("eventual consistency")
- Good candidates for geographically distributed systems
- Probably best in concert with stronger transactional systems
- See also: COPS, Swift, Eiger, Calvin, etc
-
The consensus problem:
- Three process types
- Proposers: propose values
- Acceptors: choose a value
- Learners: read the chosen value
- Classes of acceptors
- N acceptors total
- F acceptors allowed to fail
- M malicious acceptors
- Three invariants:
- Nontriviality: Only values proposed can be learned
- Safety: At most one value can be learned
- Liveness: If a proposer p, a learner l, and a set of N-F acceptors are non-faulty and can communicate with each other, and if p proposes a value, l will eventually learn a value.
- Three process types
-
Whole classes of systems are equivalent to the consensus problem
- So any proofs we have here apply to those systems too
- Lock services
- Ordered logs
- Replicated state machines
-
FLP tells us consensus is impossible in asynchronous networks
- Kill a process at the right time and you can break any consensus algo
- True but not as bad as you might think
- Realistically, networks work often enough to reach consensus
- Moreover, FLP assumes deterministic processes
- Real computers aren't deterministic
- Ben-Or 1983: "Another Advantage of free choice"
- Nondeterministic algorithms can achieve consensus
-
Lamport 2002: tight bounds for asynchronous consensus
- With at least two proposers, or one malicious proposer, N > 2F + M
- "Need a majority"
- With at least 2 proposers, or one malicious proposer, it takes at least 2 message delays to learn a proposal.
- With at least two proposers, or one malicious proposer, N > 2F + M
-
This is a pragmatically achievable bound
- In stable clusters, you can get away with only a single round-trip to a majority of nodes.
- More during cluster transitions.
- Paxos is the Gold Standard of consensus algorithms
- Lamport 1970 - The Part Time Parliament
- Written as a description of an imaginary Greek democracy
- Lamport 2001 - Paxos Made Simple
- "The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers [5]. In fact, it is among the simplest and most obvious of distributed algorithms... The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system—an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems [4]."
- Google 2007 - Paxos Made Live
- Notes from productionizing Chubby, Google's lock service
- Van Renesse 2011 - Paxos Made Moderately Complex
- Turns out you gotta optimize
- Also pseudocode would help
- A page of pseudocode -> several thousand lines of C++
- Lamport 1970 - The Part Time Parliament
- Provides consensus on independent proposals
- Typically deployed in majority quorums, 5 or 7 nodes
- Several optimizations
- Multi-Paxos
- Fast Paxos
- Generalized Paxos
- It's not always clear which of these optimizations to use, and which can be safely combined
- Each implementation uses a slightly different flavor
- Used in a variety of production systems
- Chubby
- Cassandra
- Riak
- FoundationDB
- WANdisco SVN servers
- We're not even sure Chubby is Paxos, as described in the papers
- Paxos is really more of a family of algorithms than a well-described single entity
- Though we're pretty darn confident in the various proofs at this point
- ZAB is the Zookeeper Atomic Broadcast protocol
- Junqueira, Reed, and Serafini 2011 - Zab: High-performance broadcast for primary-backup systems
- Differs from Paxos
- Provides sequential consistency (linearizable writes, lagging ordered reads)
- Useful because ZK clients typically want fast local reads
- But there's also a SYNC command that guarantees real-time visibility
- (SYNC + op) allows linearizable reads as well
- Again, majority quorum, 5 or 7 nodes
- Presented as a replication protocol, but also a consensus algorithm
- Transaction processing plus a view change algorithm
- Majority-known values are guaranteed to survive into the future
- I'm not aware of any production systems, but I'm sure they're out there
- Along with Paxos, inspired Raft in some ways
- Ongaro & Ousterhout 2014 - In Search of an Understandable Consensus Algorithm
- Lamport says it's easy, but we still have trouble grokking Paxos
- What if there were a consensus algorithm we could actually understand?
- Paxos approaches independent decisions when what we want is state machines
- Maintains a replicated log of state machine transitions instead
- Also builds in cluster membership transitions, which is key for real systems
- Very new, but we have a TLA+ proof of the core algorithm
- Can be used to write arbitrary sequential or linearizable state machines
- RethinkDB
- etcd
- Consul
Systems which only add facts, not retract them, require less coordination to build. We can use gossip systems to broadcast messages to other processes, CRDTs to merge updates from our peers, and HATs for weakly isolated transactions. Serializability and linearizability require consensus, which we can obtain through Paxos, ZAB, VR, or Raft. Now, we'll talk about different scales of distributed systems.
- Different kinds of systems have different definitions of "slow"
- Different goals
- Different algorithms
- Multicore (and especially NUMA) architectures are sort of like a distributed system
- Nodes don't fail pathologically, but message exchange is slow!
- Synchronous network provided by a bus (e.g. Intel QPI)
- Whole complicated set of protocols in HW & microcode to make memory look sane
- Non-temporal store instructions (e.g. MOVNTI)
- They provide abstractions to hide that distribution
- MFENCE/SFENCE/LFENCE
- Introduce a serialization point against load/store instructions
- Characteristic latencies: ~100 cycles / ~30 ns
- Really depends on HW, caches, instructions, etc
- CMPXCHG Compare-and-Swap (sequentially consistent modification of memory)
- LOCK
- Lock the full memory subsystem across cores!
- MFENCE/SFENCE/LFENCE
- But those abstractions come with costs
- Hardware lock elision may help but is nascent
- Blog: Mechanical Sympathy
- Avoid coordination between cores wherever possible
- Context switches (process or thread!) can be expensive
- Processor pinning can really improve things
- When writing multithreaded programs, try to divide your work into
independent chunks
- Try to align memory barriers to work unit boundaries
- Allows the processor to cheat as much as possible within a work unit
- You'll often deploy replicated systems across something like an ethernet LAN
- Message latencies can be as low as 100 micros
- But across any sizable network (EC2), expect low millis
- Sometimes, packets could be delayed by five minutes
- Plan for this
- Network is within an order of mag compared to uncached disk seeks
- Or faster, in EC2
- EC2 disk latencies can routinely hit 20ms
- 200ms?
- 20,000 ms???
- Because EBS is actually other computers
- LMAO if you think anything in EC2 is real
- Wait, real disks do this too?
- What even are IO schedulers?
- Wait, real disks do this too?
- 20,000 ms???
- 200ms?
- EC2 disk latencies can routinely hit 20ms
- Or faster, in EC2
- But network is waaaay slower than memory/computation
- If your aim is throughput, work units should probably take longer than a millisecond
- But there are other reasons to distribute
- Sharding resources
- Isolating failures
- You deploy worldwide for two reasons
- End-user latency
- Humans can detect ~10ms lag, will tolerate ~100ms
- SF--Denver: 50ms
- SF--Tokyo: 100 ms
- SF--Madrid: 200 ms
- Only way to beat the speed of light: move the service closer
- Humans can detect ~10ms lag, will tolerate ~100ms
- Disaster recovery
- Datacenter power is good but not perfect
- Hurricanes are a thing
- Entire Amazon regions can and will fail
- Yes, regions, not AZs
- End-user latency
- Minimum of 1 round-trip for consensus
- Maybe as bad as 4 rounds
- Maybe 4 rounds all the time if you have a bad Paxos impl (e.g. Cassandra)
- So if you do Paxos between datacenters, be ready for that cost!
- Because the minimum latencies are higher than users will tolerate
- Cache cache cache
- Queue writes and relay asynchronously
- Consider reduced consistency guarantees in exchange for lower latency
- CRDTs can always give you safe local writes
- Causal consistency and HATs can be good calls here
- Maybe as bad as 4 rounds
- What about strongly consistent stuff?
- Chances are a geographically distributed service has natural planes of
cleavage
- EU users live on EU servers; US users live on US servers
- Use consensus to migrate users between datacenters
- Pin/proxy updates to home datacenter
- Which is hopefully the closest datacenter!
- But maybe not! I believe Facebook still pushes all writes through 1 DC!
- Where sequential consistency is OK, cache reads locally!
- You probably leverage caching in a single DC already
- Chances are a geographically distributed service has natural planes of
cleavage
We discussed three characteristic scales for distributed systems: multicore processors coupled with a synchronous network, computers linked by a LAN, and datacenters linked by the internet or dedicated fiber. CPU consequences are largely performance concerns: knowing how to minimize coordination. On LANs, latencies are short enough for many network hops before users take notice. In geographically replicated systems, high latencies drive eventually consistent and datacenter-pinned solutions.
- General recommendations for building distributed systems
- Hard-won experience
- Repeating what other experts tell me
- Over beers
- Hearsay
- Oversimplifications
- Cargo-culting
- Stuff I just made up
- YMMV
- Rule 1: don't distribute where you don't have to
- Is this thing small enough to fit on one node?
- "I have a big data problem"
- Softlayer will rent you a box with 3TB of ram for $5000/mo.
- Supermicro: 6TB in one box for ~$115,000
- "I have a big data problem"
- Can this service tolerate a single node's guarantees?
- Could we just stand up another one if it breaks?
- Could manual intervention take the place of the distributed algorithm?
- Is this thing small enough to fit on one node?
- If we have to distribute, can we push the work onto some other software?
- What about a distributed database or log?
- Can we pay Amazon to do this for us?
- Conversely, what are the care and feeding costs?
- How much do you have to learn to use/operate that distributed system?
- Buy really expensive hardware
- Make changes to software and hardware in a controlled fashion
- Dry-run deployments against staging environments
- Possible to build very reliable networks and machines
- At the cost of moving slower, buying more expensive HW, finding talent
- HW/nework failure still happens, but sufficiently rare => low priority
- OK fine; we're gonna go down
- What's our SLA anyway?
- Can we recover by hand?
- Can we pay someone to fix it?
- Could insurance cover the damage?
- Could we just call the customer and apologize?
- Sounds silly, but may be much cheaper
- We can never prevent 100% of system failures
- Consciously choosing to recover above the level of the system
- This is how financial companies and retailers do it!
- Backups are essentially sequential consistency, BUT you lose a window of ops.
- When done correctly
- Some backup programs don't snapshot state, which leads to FS or DB corruption
- Broken fkey relationships, missing files, etc...
- Allow you to recover in a matter of minutes to days
- But more than fault recovery, they allow you to step back in time
- Useful for recovering from logical faults
- Distributed DB did its job correctly, but you told it to delete key data
- Useful for recovering from logical faults
- When done correctly
- OK, so failure is less of an option
- Want to reduce the probability of failure
- Have the same state and same computation take place on several nodes
- I'm not a huge believer in active-spare
- Spare might have cold caches, broken disks, old versions, etc
- Spares tend to fail when becoming active
- Active-active wherever possible
- Predictability over efficiency
- Also not a huge fan of only having 2 copies
- Node failure probabilities just too high
- OK for not-important data
- I generally want three copies of data
- For important stuff, 4 or 5
- For Paxos and other majority-quorum systems, odd numbers: 3, 5, 7 common
- Common DR strategy: Paxos across 5 nodes; 3 or 4 in primary DC
- Ops can complete as soon as the local nodes ack; low latencies
- Resilient to single-node failure (though latencies will spike)
- But you still have a seqentially consistent backup in the other DC
- So in the event you lose an entire DC, all's not lost
- See Camille Fournier's talks on ZK deployment
- I'm not a huge believer in active-spare
- Redundancy improves availability so long as failures are uncorrelated
- Failures are not uncorrelated
- Disks from the same batch failing at the same time
- Same-rack nodes failing when the top-of-rack switch blows
- Same-DC nodes failing when the UPS blows
- See entire EC2 AZ failures
- Running the same bad computation on every node will break every node
- Expensive queries
- Riak list-keys
- Cassandra doomstones
- Cascading failures
- Thundering-herd
- TCP incast
- Failures are not uncorrelated
- The problem is too big
- Break the problem into parts small enough to fit on a node
- Not too small: small parts => high overhead
- Not too big: need to rebalance work units gradually from node to node
- Somewhere around 10-100 work units/node is ideal, IMO
- Ideal: work units of equal size
- Beware hotspots
- Beware changing workloads with time
- Know your bounds in advance
- How big can a single part get before overwhelming a node?
- How do we enforce that limit before it sinks a node in prod?
- Then sinks all the other nodes, one by one, as the system rebalances
- Allocating shards to nodes
- Often built in to DB
- Good candidate for ZK, Etcd, and so on
- See Boundary's Ordasity
- Data that never changes is trivial to store
- Never requires coordination
- Cheap replication and recovery
- Minimal repacking on disk
- Useful for Cassandra, Riak, any LSM-tree DB.
- Or for logs like Kafka!
- Easy to reason about: either present or it's not
- Eliminates all kinds of transactional headaches
- Extremely cachable
- Extremely high availability and durability, tunable write latency
- Low read latencies: can respond from closest replica
- Especially valuable for geographic distribution
- Requires garbage collection!
- But there are good ways to do this
- Pointers to immutable values
- Pointers are small! Only metadata!
- Can fit huge numbers of pointers on a small DB
- Good candidate for consensus services or relational DBs
- And typically, not many pointers in the system
- Your entire DB could be represented by a single pointer
- Datomic only has ~5 identities
- Strongly consistent operations over identities can be backed by immutable
HA storage
- Take advantage of AP storage latencies and scale
- Take advantage of strong consistency over small datasets provided by consensus systems
- Write availability limited by identity store
- But, reads eminently cachable if you only need sequential consistency
- Can be even cheaper if you only need serializability
- See Rich Hickey's talks on Datomic architecture
- See Pat Helland's 2013 RICON West keynote on Salesforce's storage
- The problem is composed of interacting logical pieces
- Pieces have distinct code, performance, storage needs
- Monolithic applications are essentially multitenant systems
- Multitenancy is tough
- But its often okay to run multiple logical "services" in the same process
- Monolithic applications are essentially multitenant systems
- Divide your system into logical services for discete parts of the domain
model
- OO approach: each noun is a service
- User service
- Video service
- Index service
- Functional approach: each verb is a service
- Auth service
- Search service
- Dispatch/routing service
- Most big systems I know of use a hybrid
- Services for nouns is a good way to enforce datatype invariants
- Services for verbs is a good way to enforce transformation invariants
- So have a basic User service, which is used by an Auth service
- Where you draw the line... well that's tricky
- Services come with overhead: have as few as possible
- Consider work units
- Separate services which need to scale independently
- Colocate services with tight dependencies and tight latency budgets
- Colocate services which use complementary resources (e.g. disk and CPU)
- By hand: Run memcache on rendering nodes
- Newer shops: Google Borg, Mesos, Kubernetes
- OO approach: each noun is a service
- Production software is a fundamentally social artifact
- Natural alignment: a team or person owns a specific service
- Jo Freeman, "The Tyranny of Structurelessness"
- Responsibility and power should be explicit
- Rotate people through roles to prevent fiefdoms
- Promotes information sharing
- But don't rotate too often
- Ramp-up costs in software are very high
- Jo Freeman, "The Tyranny of Structurelessness"
- As the team grows, its mission and thinking will formalize
- So too will services and their boundaries
- Gradually accruing body of assumptions about service relation to the world
- Punctuated by rewrites to respond to changing external pressures
- Tushman & Romanelli, 1985: Organizational Evolution
- Inter-team communication goes hand in hand with inter-service communication
- Conway's law
- Invest time in good team relationships for neighbor services
- Or conversely, choose teams that work well together to implement dependencies
- Consider team size and codebase complexity when sizing services
- Services can be libraries
- Initially, all your services should be libraries
- Perfectly OK to depend on a user library in multiple services
- Libraries with well-defined boundaries are easy to extract into services later
- Services probably need a client library
- That library might be "Open a socket" or an HTTP client
- Leverage HTTP headers!
- Accept headers for versioning
- Lots of support for caching and proxying
- Haproxy is an excellent router for both HTTP and TCP services
- Leverage HTTP headers!
- Eventually, library might include mock IO
- Service team is responsible for testing that the service provides an API
- When the API is known to be stable, every client can assume it works
- Removes the need for network calls in test suites
- Dramatic reduction in test runtime and dev environment complexity
- That library might be "Open a socket" or an HTTP client
When possible, try to use a single node instead of a distributed system. Accept that some failures are unavoidable: SLAs and apologies can be cost-effective. To handle catastrophic failure, we use backups. To improve reliability, we introduce redundancy. To scale to large problems, we divide the problem into shards. Immutable values are easy to store and cache, and can be referenced by mutable identies, allowing us to build strongly consistent systems at large scale. As software grows, different components must scale independently, and we break out libraries into distinct services. Service structure goes hand-in-hand with teams.
- More than design considerations
- Proofs are important, but real systems do IO
- Understanding a distributed system in production requires close cooperation
of people with many roles
- Development
- QA
- Operations
- Empathy matters
- Developers have to care about production
- Ops has to care about implementation
- Good communication enables faster diagnosis
- Type systems are great for preventing logical errors
- Which reduces your testing burden
- However, they are not great at predicting or controlling runtime performance
- So, you need a solid test suite
- Ideally, you want a slider for rigorousness
- Quick example-based tests that run in a few seconds
- More thorough property-based tests that can run overnight
- Be able to simulate an entire cluster in-process
- Control concurrent interleavings with simulated networks
- Automated hardware faults
- Jeff Hodges: The worst bug you'll ever hear is "it's slow"
- Happens all the time, really difficult to localize
- Because the system is distributed, have to profile multiple nodes
- Not many profilers are built for this
- Sigelman et al, 2010: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- Zipkin
- Big tooling investment
- Profilers are good at finding CPU problems
- But high latency is often a sign of IO, not CPU
- Disk latency
- Network latency
- GC latency
- Queue latency
- Try to localize problem using application-level metrics
- Then dig in to process and OS performance
- Slowness (and outright errors) in prod stem from the interactions between
systems
- Why? Because your thorough test suite probably verified that the single system was mostly correct
- So we need a way to understand what the system is doing in prod
- In relation to its dependencies
- Which can, in turn, drive new tests
- In a way, good monitoring is like continuous testing
- But not a replacement: these are distinct domains
- Both provide assurance that your changes are OK
- Want high-frequency monitoring
- Production behaviors can take place on 1ms scales
- TCP incast
- ~1ms resolution ideally
- Ops response time, in the limit, scales linearly with observation latency
- ~1 second end to end latency
- Ideally, millisecond latencies, maybe ms resolution too
- Usually cost-prohibitive; back off to 1s or 10s
- Sometimes you can tolerate 60s
- Production behaviors can take place on 1ms scales
- And for capacity planning, hourly/daily seasonality is more useful
- Instrumentation should be tightly coupled to the app
- Measure only what matters
- Responding to requests is important
- Node CPU doesn't matter as much
- Key metrics for most systems
- Apdex: successful response WITHIN latency SLA
- Latency profiles: 0, 0.5, 0.95, 0.99, 1
- Overall throughput
- Queue statistics
- Subjective experience of other systems latency/throughput
- The DB might think it's healthy, but clients could see it as slow
- Combinatorial explosion--best to use this when drilling into a failure
- You probably have to write this instrumentation yourself
- Invest in a metrics library
- Measure only what matters
- Out-of-the-box monitoring usually doesn't measure what really matters: your
app's behavior
- But it can be really useful in tracking down causes of problems
- Host metrics like CPU, disk, etc
- Where your app does something common (e.g. rails apps) tools like New Relic work well
- Superpower: distributed tracing infra (Zipkin, Dapper, etc)
- Significant time investment
- Load tests are only useful insofar as the simulated load matches the actual load
- Consider dumping production traffic
- Awesome: kill a process with SIGUSR1, it dumps five minutes of request load
- Awesome: tcpdump/tcpreplay harnesses for requests
- Awesome: shadowing live prod traffic to your staging/QA nodes
- Protocol versioning is, as far as I know, a wide-open problem
- Do include a version tag with all messages
- Do include compatibility logic
- Inform clients when their request can't be honored
- And instrument this so you know which systems have to be upgraded
- Rollouts are often how you fix problems
- Spend the time to get automated, reliable deploys
- Amplifies everything else you do
- Have nodes smoothly cycle through to prevent traffic interruption
- This implies you'll have multiple versions of your software running at
once
- Versioning rears its ugly head
- This implies you'll have multiple versions of your software running at
once
- Inform load balancer that they're going out of rotation
- Coordinate to prevent cascading failures
- Roll out only to a fraction of load or fraction of users
- Gradually ramp up number of users on the new software
- Either revert or roll forward when you see errors
- Consider shadowing traffic in prod and comparing old/new versions
- Good way to determine if new code is faster & correct
- A lot of your software's features can degrade piecewise
- When a feature is problematic, disable it
- Use a highly available coordination service to decide which codepaths to
enable
- This service should have minimal dependencies
- Don't use the primary DB
- This service should have minimal dependencies
- When things go wrong, you can tune the system's behavior
- When coordination service is down, fail safe!
- Every queue is a place for things to go horribly, horribly wrong
- No node has unbounded memory. Your queues must be bounded
- But how big? Nobody knows
- Instrument your queues in prod to find out
- Queues exist to smooth out fluctuations in load
- If your load is higher than capacity, no queue will save you
- Shed load or apply backpressure when queues become full
- Instrument this
- When load-shedding occurs, alarm bells should ring
- Backpressure is visible as upstream latency
- Instrument queue depths
- High depths is a clue that you need to add node capacity
- End to end queue latency should be smaller than fluctuation timescales
- Raising the queue size can be tempting, but is a vicious cycle
- High depths is a clue that you need to add node capacity
- All of this is HARD. I don't have good answers for you
- Ask Jeff Hodges why it's hard: see his RICON West 2014 talk
- If your load is higher than capacity, no queue will save you
Running distributed systems requires cooperation between developers, QA, and operations engineers. Static analysis, and a test suite including example- and property-based tests, can help ensure program correctness, but understanding production behavior requires comprehensive instrumentation and alerting. Mature distributed systems teams often invest in tooling: traffic shadowing, versioning, incremental deploys, and feature flags. Finally, queues require special care.