This file contains a list of papers in the field of distributed consensus. Many of the papers fit into more than one section, however, for simplicity, each paper is listed only in the most relevant section.
Theoretical results relating to distributed consensus
- Impossibility of Distributed Consensus with One Faulty Process
- aka the FLP result
- Unreliable Failure Detectors for Reliable Distributed Systems
- Time, Clocks, and the Ordering of Events in a Distributed System
- The Weakest Failure Detector for Solving Consensus
- Omega Meets Paxos: Leader Election and Stability without Eventual Timely Links
- Lower Bounds for Asynchronous Consensus
- On the Minimal Synchronism Needed for Distributed Consensus
- The implementation of reliable distributed multiprocess systems
Surveys, tutorials & evaluations of consensus algorithms
- Classic Paxos vs. Fast Paxos: Caveat Emptor
- Vive La Difference: Paxos vs. Viewstamped Replication vs. Zab
- The ABCD’s of Paxos
- Paxos Made Simple
- describes single-degree Paxos as well as Multi-Paxos from SMR
- Revisiting the PAXOS algorithm
- Total order broadcast and multicast algorithms: Taxonomy and survey
- The Performance of Paxos in the Cloud
- How to Build a Highly Available System Using Consensus
- Consensus in the Cloud: Paxos Systems Demystified
- Paxos Made Moderately Complex
- Tutorial Summary: Paxos Explained from Scratch
- A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
Algorithms for distributed consensus
- Consensus in the Presence of Partial Synchrony
- The Part-Time Parliament
- Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems
- Viewstamped Replication Revisited
- Efficient Message Ordering in Dynamic Networks
- Stoppable Paxos
- Fast Paxos
- Consensus on Transaction Commit
- Generalized Consensus and Paxos
- Vertical Paxos and Primary-Backup Replication
- Disk Paxos
- Cheap Paxos
- Paxos Made Practical
- Reconfiguring a State Machine
- Reliable communication in the presence of failures
- Efficient message ordering in dynamic networks
- On Collision-fast Atomic Broadcast
- Dynamic atomic storage without consensus
- Specifying and Using a Partitionable Group Communication Service
- Active Disk Paxos with infinitely many processes
- CASPaxos: Replicated State Machines without logs
- Multicoordinated Paxos
- Paxos Quorum Leases: Fast Reads Without Sacrificing Writes
- extends the idea of master read leases to allow the master to promise to use a specified subset of acceptors in every majority quorum. Acceptors in this quorum can then serve reads locally.
- similar to master read leases, it relies on clock synchrony.
Implementing consensus using specialist hardware, SDN, IP-multicast, RDMA etc
- Paxos Made Switch-y
- NetPaxos: consensus at network speed
- Consensus in a Box: Inexpensive Coordination in Hardware
- Ring Paxos: A high-throughput atomic broadcast protocol
- Multi-Ring Paxos
- Taming uncertainty in distributed systems with help from the network
- Derecho: Group Communication at the Speed of Light
- Groups, Subgroups and Auto-Sharding in Derecho: A Customizable RDMA Framework for Highly Available Cloud Services
- Derecho: Fast State Machine Replication for Cloud Services
- Just say NO to Paxos Overhead: Replacing Consensus with Network Ordering
- AllConcur: Leaderless Concurrent Atomic Broadcast
- DARE: High-Performance State Machine Replication on RDMA Networks
Implementing consensus for geo-distributed systems
- MDCC: Multi-Data Center Consistency
- Canopus: A Scalable and Massively Parallel Consensus Protocol
- There Is More Consensus in Egalitarian Parliaments
- DPaxos: Managing Data Closer to Users for Low-Latency and Mobile Applications
- SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines
- FleetDB: Follow-the-workload Data Migration for Globe-Spanning Databases
- Mencius: Building Efficient Replicated State Machines for WANs
- Geo-replicated storage with scalable deferred update replication
- Multileader WAN Paxos: Ruling the Archipelago with Fast Consensus
- Scalable Consistency in Scatter
- GlobalFS: A Strongly Consistent Multi-Site File System
- CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems
- Modelling Paxos performance in wide area
- Low-Latency Multi-Datacenter Databases using Replicated Commit
- FaunaDB: An Architectural Overview
Distributed consensus in production
- Paxos Made Live - An Engineering Perspective
- The Chubby lock service for loosely-coupled distributed systems
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- seems to use an unusual definition of Multi-Paxos where each instance is district but the 1a/1b messages for slot i is piggybacked onto 2a2/b for i-1
- uses SMR with witnesses, replicas with participate in log replication but do not run a state machine and read-only replicas which only run a state machine.
- Zab: High-performance broadcast for primary-backup systems
- ZooKeeper: Wait-free coordination for Internet-scale systems
- Widely utilized Apache licensed open source project written in Java [project website]
- Architecture is similar to Google's Chubby but unlike Chubby is described in detail and open source
- Writes are linearizable, reads may be stale unless sync is called first
- Clients may have multiple outstanding requests, they will be handled FIFO
- Uses primary-backup replication instead of state machine replication
- PaxosStore: High-availability Storage Made Practical in WeChat
- Windows Azure Storage: a highly available cloud storage service with strong consistency
- Distributed Coordination Engine (DConE)
- Bizur: A Key-value Consensus Algorithm for Scalable File-systems
Implementations of consensus
- Replication and fault-tolerance in the ISIS system
- The ISIS project: real experience with a fault tolerant programming system
- In Search of an Understandable Consensus Algorithm (Extended Version)
- aka the RAFT consensus paper
- algorithm implemented by etcd
- open sourced and widely utilized including by kubernetes
- written in golang
- implemented by CockroachDB
- Consensus: Bridging Theory and Practice
- PhD thesis describing RAFT consensus in more detail
- S-Paxos: Offloading the Leader for High Throughput State Machine Replication
- Optimizing Paxos with batching and pipelining
- Paxos for System Builders: An Overview
- Partitioned Paxos via the Network Data Plane
- The SMART Way to Migrate Replicated Stateful Services
- Boxwood: Abstractions as the Foundation for Storage Infrastructure
- Replication in the Harp File System
- CORFU: A Distributed Shared Log
- Tango: Distributed data structures over a shared log
- The Farsite Project: A Retrospective
- Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore
- Paxos replicated state machines as the basis of a high-performance data store
- Granola: Low-Overhead Distributed Transaction Coordination
- Making Fast Consensus Generally Faster
- Scalable State-Machine Replication
- Calvin: Fast Distributed Transactions for Partitioned Database Systems
- Paxos made transparent
- Designing Distributed Systems Using Approximate Synchrony in Data Center Networks
- No compromises: distributed transactions with consistency, availability, and performance
- Building Consistent Transactions with Inconsistent Replication
- Weakly consistent key-val store with support for linearizability as requested
- Useful related blog post Lessons learned from TAPIR(s)
- geo-distributed performance is evaluated
- Commodifying Replicated State Machines with OpenReplica
Linearizability & SMR
- Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
- Linearizability: A Correctness Condition for Concurrent Objects
- Implementing Linearizability at Large Scale and Low Latency
- Cheap and Available State Machine Replication
Weaker consistency models
- Existential Consistency: Measuring and Understanding Consistency at Facebook
- What consistency does your key-value store actually provide?
- offline consistency checking of key-value traces
- TAO: Facebook’s Distributed Data Store for the Social Graph
- Consistency in Non-Transactional Distributed Storage Systems
- Dynamo: Amazon’s Highly Available Key-value Store
- Bigtable: A Distributed Storage System for Structured Data
- Spanner: Google’s Globally-Distributed Database
- Provides linearizability but it assumes a bounded clock drift
- Google implement this using Truetime, GPS and atomic clocks in their data centers instead of NTP
- Closed source but now available as a cloud service, Cloud Spanner
- Cassandra - A Decentralized Structured Storage System
- not discussion in paper but Cassandra now uses Paxos for lightweight transactions.
- Spanner, TrueTime & The CAP Theorem
- Towards Robust Distributed Systems
- PODC keynote in which Eric Brewer proposed the now infamous CAP theorem
- Quantifying eventual consistency with PBS
- Eventual Consistency Today: Limitations, Extensions, and Beyond
- Fine-grained consistency for geo-replicated systems
- The many faces of consistency
- Benchmarking Cloud Serving Systems with YCSB
- Popular benchmarking tool for key-values stores
- Actively maintained open source project with support for various data stores
- Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency
- This paper introduced the idea of leases for distributed caches. This idea is used in master leases and read quorum leases
Failures & Bugs
- The Network is Reliable: An informal survey of real-world communications failures
- Communication Costs in Real-world Networks
- What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems
Correctness of consensus algorithms
- IronFleet: Proving Practical Distributed Systems Correct
- Paxos Made EPR: Decidable Reasoning about Distributed Protocols
- Verdi: A framework for implementing and formally verifying distributed systems
- A Proof of Correctness for Egalitarian Paxos
- Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers
- Modeling Paxos and Flexible Paxos in Pluscal and TLA+
Quorum systems
- A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases
- Weighted Voting for Replicated Data
- An Efficient and Fault-tolerant Solution for Distributed Mutual Exclusion
- The Generalized Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data
- The Reliability of Voting Mechanisms
- The grid protocol: a high performance scheme for maintaining replicated data
- How to Assign Votes in a Distributed System
- Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data
- The Load, Capacity, and Availability of Quorum Systems
- A √N algorithm for mutual exclusion in decentralized systems
- The Availability of Quorum Systems
- Coterie Availability in Sites
- The virtue of dependent failures in multi-site systems
- Crumbling Walls: A Class of Practical and Efficient Quorum Systems
Reading lists
- Testing Distributed Systems by Andrey Satarin
- An introduction to distributed systems by Kyle Kingsbury
The following lists contain places to watch for new writings in the field of distributed consensus.
Blogroll
- Jepsen by Kyle Kingsbury
- Aphyr by Kyle Kingsbury
- The Paper Trail
- Brave new geek by Tyler Treat
- Highly Available, Seldom Consistent by Peter Bailis
- Christopher Meiklejohn
- Denis Rystsov
- Metadata by Murat Demirbas
- Slash dev slash null
- David Turner
- Aleksey Charapko
- The Morning Paper by Adrian Colyer
- Hacking, Distributed by Emin Gün Sirer
- All Things Distributed by Werner Vogels
Academic conferences & symposiums
- ACM Symposium on Principles of Distributed Computing (PODC)
- IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
- IEEE International Conference on Distributed Computing Systems (ICDCS)
- International Conference on Principles of Distributed Systems (OPODIS)
- International Symposium on Distributed Computing (DISC)
- International Symposium on Reliable Distributed Systems (SRDS)
- ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)
- USENIX Symposium on Networked Systems Design and Implementation (NSDI)
- USENIX Symposium on Operating Systems Design and Implementation (OSDI) - Biennial evens
- ACM Symposium on Operating Systems Principles (SOSP) - Biennial odds
- USENIX Annual Technical Conference (ATC)
- European Conference on Computer Systems (EuroSys)
- ACM SIGMOD/PODS Conference
- ACM SIGMETRICS / IFIP Performance
- ACM Annual Conference of the Special Interest Group on Data Communication (SIGCOMM)
- USENIX Conference on File and Storage Technologies (FAST)
- ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
- International Conference on Very Large Data Bases (VLDB)
- ACM Symposium on Cloud Computing (SoCC)
- ACM Symposium on Theory of Computing (STOC)
Workshops
- Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC)
- ACM SIGOPS Workshop on Large-Scale Distributed Systems and Middleware (LADIS)
- USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage)
- Workshop on Hot Topics in Operating Systems (HotOS)
- ACM Workshop on Hot Topics in Networks (HotNets)
- USENIX Workshop on Hot Topics in Cloud Computing (HotCloud)
- USENIX Workshop on Hot Topics in Edge Computing (HotEdge)
Journals & Magazines
- ACM Transactions on Computer Systems (TOCS)
- Journal of the ACM (JACM)
- Communications of the ACM (CACM)
- SIGOPS Operating Systems Review (OSR)
- ACM Computing Surveys (CSUR)
- ACM Transactions on Database Systems (TODS)
- ACM Queue
- ACM SIGACT News
- IEEE Transactions on Dependable and Secure Computing (TDSC)
- IEEE Transactions on Parallel and Distributed Systems (TPDS)
- IEEE Transactions on Computers (TC)
- IEEE Transactions on Software Engineering