Distributed Consensus Reading List 📚

Since its inception in the 1980s, distributed consensus has been the subject of extensive academic research. Whilst definitions vary, distributed consensus (or equivalently, atomic broadcast) most often refers to the problem of how to decide an ordered sequence of values between a set of distributed nodes. This can be used to implement an append-only replicated log which can be utilized either directly or indirectly, to provide services such as primary backup replication or state machine replication. These abstractions can, in turn, form the building blocks of new abstractions, such as a distributed key-value store. Some consensus algorithms instead decide only a single value or a partially ordered sequence of values. What unifies distributed consensus algorithms is the fact that they are always safe, regardless of delays and crashes (though they are not necessarily Byzantine fault tolerance), and are guaranteed to make progress provided sufficient liveness.

This is a long list of papers relating to distributed consensus. Many of the papers listed below fit into more than one section. However, for simplicity, each paper is listed only in the most relevant section. Where possible, open access links for each paper have been provided.

Contributions via pull requests are welcome.

⭐️ Influential papers - If you are looking for a starting point, a subset of the most influential papers on distributed consensus are highlighted using a yellow star. ⭐️

💎 Hidden gems - Papers which I personally love but are not as highly cited as the influential papers 💎

Key: acmdl = ACM Digital Library

The sections are as follows:

Distributed Consensus

Theoretical results

This section lists theoretical results relating to distributed consensus.

  • ⭐️ Time, Clocks, and the Ordering of Events in a Distributed System, CACM 1978 [acmdl,pdf]
    • Easily one of the most influential papers in distributed computing. Introduces the "happens-before" relation and Lamport clocks.
  • The implementation of reliable distributed multiprocess systems, Computer Networks 1978 [pdf]
    • Precursor to Paxos, notes on achieving fault-tolerance with some degree of clock synchronization.
  • ⭐️ Impossibility of Distributed Consensus with One Faulty Process, JACM 1985 [acmdl,pdf]
    • The famous FLP result, proving that distributed consensus algorithms need synchrony to guarantee progress.
  • On the Minimal Synchronism Needed for Distributed Consensus, JACM 1987 [acmdl,pdf]
    • Follow up to the FLP result with some extra considerations
  • Unreliable Failure Detectors for Reliable Distributed Systems, JACM 1996 [acmdl,pdf]
  • The Weakest Failure Detector for Solving Consensus, JACM 1996 [acmdl,pdf]
  • Omega Meets Paxos: Leader Election and Stability without Eventual Timely Links, DISC 2005 [acmdl,pdf]
  • Lower Bounds for Asynchronous Consensus, Distributed Computing 2006 [acmdl,pdf]
  • The Heard-Of Model: Computing in Distributed Systems with Benign Failures, Distributed Computing 2009 [acmdl,pdf]
  • Virtually Synchronous Methodology for Dynamic Service Replication, MS Tech report 2010 [pdf]

Surveys

This section lists surveys, tutorials, and systemization of knowledge papers covering distributed consensus algorithms.

  • A Modular Approach to Fault-Tolerant Broadcasts and Related Problems, Tech Report 1994 [acmdl,pdf]
  • How to Build a Highly Available System Using Consensus, WDAG 1996 [acmdl,pdf]
  • Revisiting the PAXOS algorithm, WDAG 1997 [acmdl,pdf]
  • The ABCD’s of Paxos, PODC 2001 [acmdl,pdf]
  • ⭐️ Paxos Made Simple, SIGACT News 2001 [pdf]
    • Describes single-degree Paxos as well as Multi-Paxos for SMR
    • Is much more readable than the original Paxos paper
    • Featured in the morning paper
  • Deconstructing paxos, SIGACT News 2003 [pdf,acmdl]
  • Total order broadcast and multicast algorithms: Taxonomy and survey, CSUR 2004 [acmdl,pdf]
  • 💎 Vive La Difference: Paxos vs. Viewstamped Replication vs. Zab, TDSC 2005 [pdf]
  • The Alpha of Indulgent Consensus, Comp Journal 2007 [acmdl,pdf]
  • The Paxos Register, SRDS 2007 [acmdl,pdf]
  • Classic Paxos vs. Fast Paxos: Caveat Emptor, HotDep 2007 [acmdl,pdf]
  • Tutorial Summary: Paxos Explained from Scratch, OPODIS 2013 [acmdl,pdf]
  • 💎 Paxos Made Moderately Complex, CSUR 2015 [acmdl,pdf]
  • On the Parallels between Paxos and Raft, and how to Port Optimizations, PODC 2019 [acmdl,pdf]
  • Paxos vs Raft: Have we reached consensus on distributed consensus?, PaPoC 2020 [acmdl,arxiv]
  • 60 Years of Mastering Concurrent Computing through Sequential Thinking, SIGACT News 2020 [acmdl]
  • What's Live? Understanding Distributed Consensus, PODC 2021 [acmdl,arxiv]
  • SoK: A Generalized Multi-Leader State Machine Replication Tutorial, JSys 2021 [pdf]

Algorithms for consensus

This section lists papers describing algorithms for distributed consensus. These papers tend to be theory papers (venues such as PODC, DISC, OPODIS) whereas the Implementations of consensus section focuses on systems papers.

  • Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols, PODC 1983 [acmdl]
    • As known as the Ben-Or algorithm
  • Reliable communication in the presence of failures, TOCS 1987 [acmdl,pdf]
  • ⭐️ Consensus in the Presence of Partial Synchrony, JACM 1988 [acmdl,pdf]
  • Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, PODC 1988 [acmdl,pdf]
  • Efficient Message Ordering in Dynamic Networks, PODC 1996 [acmdl,pdf]
  • ⭐️ The Part-Time Parliament, TOCS 1998 [acmdl,pdf]
  • Disk Paxos, DISC 2000 [acmdl,pdf]
    • This paper describes how to replace acceptors in Paxos with disks
    • Each disk is divided into blocks, one for each proposer. Each proposer may only write to its own block and read from other blocks, which they do in each of the two usual Paxos phases
    • Each block contains the rough equivalent to last promised ballot number and last accepted proposal for the assigned proposer
  • Specifying and Using a Partitionable Group Communication Service, TOCS 2001 [acmdl,pdf]
  • Active Disk Paxos with infinitely many processes, PODC 2002 [acmdl,pdf]
    • This paper makes Disk Paxos more “Paxos like” by assuming the disks support more operations e.g. conditional write
    • ADP claims that Disk Paxos requires a fixed set of proposers and that ADP fixes this.
  • Cheap Paxos, DSN 2004 [acmdl,pdf]
  • Generalized Consensus and Paxos, Tech Report 2005 [pdf]
    • Introduces the idea of deciding a partial ordering of values instead of a total ordering
  • Fast Paxos, Distributed Computing 2006 [pdf]
    • Variant of Paxos where proposers can bypass the leader by allowing multiple values to be proposed in the same ballot. This requires stronger quorums intersection, e.g. fast paxos needs 3/4 of acceptors (instead of a simple majority) to provide the same liveness guarantees as classic Paxos.
  • Consensus on Transaction Commit, TODS 2006 [acmdl,pdf]
  • Multicoordinated Paxos, PODC 2007 [acmdl,pdf]
    • Variant of Paxos which replaces the one leader with a group of leaders. Clients send operations to all leaders and they all propose values to the acceptors. Acceptors only accept a value if they have received proposals from a quorum of leaders. Similar to the non-equivocation phase in BFT. Liveness now does not depend on the leader.
  • Stoppable Paxos, Tech Report 2008 [pdf]
  • Yet Another Visit to Paxos, Tech report 2009 [pdf]
  • Dynamic atomic storage without consensus, JACM 2011 [acmdl,pdf]
  • Fast Genuine Generalized Consensus, SRDS 2011 [acmdl,pdf]
  • 💎 Viewstamped Replication Revisited, Tech Report 2012 [pdf]
  • On Collision-fast Atomic Broadcast, AINA 2014 [pdf]
  • Paxos Quorum Leases: Fast Reads Without Sacrificing Writes, SOCC 2014 [acmdl,pdf]
    • Extends the idea of master read leases to allow the master to promise to use a specified subset of acceptors in every majority quorum. Acceptors in this quorum can then serve reads locally.
    • Similar to master read leases, it relies on clock synchrony.
  • Consus: Taming the Paxi, Unpublished 2016 [arxiv]
  • Flexible Paxos: Quorum Intersection Revisited, OPODIS 2016 [arxiv,pdf]
  • 💎 CASPaxos: Replicated State Machines without logs, Unpublished 2018 [arxiv]
  • Fast Flexible Paxos: Relaxing Quorum Intersection for Fast Paxos, ICDCN 2021 [acmdl,arxiv]
  • Spire: A Cooperative, Phase-Symmetric Solution to Distributed Consensus, IEEE Access 2021 [pdf]
    • Consensus algorithm which permits multiple proposals in the same round (similar to Fast Paxos) but uses two phases instead of larger quorums.
  • Paxos Made Practical, Unpublished [pdf]
  • Relaxed Paxos: Quorum intersection revisited (again), PaPoC 2022 [acmdl,arvix]
  • Live SMR in partitionable networks [pdf]
  • QuePaxa: Escaping the tyranny of timeouts in consensus, SOSP 2023 [pdf]

Consensus for specialist hardware

This section lists papers describing consensus algorithms using specialist hardware such as SDN, IP-multicast, or RDMA.

  • Ring Paxos: A high-throughput atomic broadcast protocol, DSN 2010 [pdf,code]
  • Multi-Ring Paxos, DSN 2012 [acmdl,pdf]
  • NetPaxos: consensus at network speed, SOSR 2015 [acmdl,pdf]
  • Taming uncertainty in distributed systems with help from the network, Eurosys 2015 [acmdl,pdf]
  • DARE: High-Performance State Machine Replication on RDMA Networks, HPDC 2015 [acmdl,pdf]
  • Paxos Made Switch-y, CCR 2016 [acmdl,pdf]
  • Consensus in a Box: Inexpensive Coordination in Hardware, NSDI 2016 [acmdl,pdf]
  • Distributed Consensus and Implications of NVM on Database Management Systems, ACM Queue 2016 [acmdl,html]
  • AllConcur: Leaderless Concurrent Atomic Broadcast, HPDC 2017 [acmdl,pdf]
  • APUS: Fast and Scalable Paxos on RDMA, SoCC 2017 [acmdl,pdf]
  • When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network, APNet 2017 [acmdl,pdf]
  • P4xos: Consensus as a Network Service, Tech Report 2018 [pdf]
    • P4xos is also evaluated in The Case For In-Network Computing On Demand [acmdl,pdf,code]
  • Derecho: Fast State Machine Replication for Cloud Services, TOCS 2019 [acmdl,pdf,code]
    • Derecho: Group Communication at the Speed of Light, Unpublished [pdf]
    • Groups, Subgroups and Auto-Sharding in Derecho: A Customizable RDMA Framework for Highly Available Cloud Services, Unpublished [pdf]
  • NetChain: Scale-Free Sub-RTT Coordination, NSDI 2018 [acmdl,pdf]
  • Kernel Paxos, SRDS 2018 [pdf]
  • Partitioned Paxos via the Network Data Plane, Tech Report 2019 [pdf]
  • The Impact of RDMA on Agreement, PODC 2019 [pdf]
  • CCF: A Framework for Building Confidential Verifiable Replicated Service, Whitepaper 2019 [pdf,code]
  • HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services, Eurosys 2020 [acmdl]
  • FLAIR: Accelerating Reads with Consistency-Aware Network Routing, NSDI 2020 [acmdl,pdf]
  • Microsecond Consensus for Microsecond Applications, OSDI 2020 [arxiv]
  • High availability in cheap distributed key value storage, SoCC 2020 [acmdl]
  • Odyssey: The Impact of Modern Hardware on Strongly-Consistent Replication Protocols, Eurosys 2021 [acmdl, pdf,techreport,thesis]

Consensus for geo-distributed systems

This section covers papers describing consensus algorithms for WANs and/or geo-replicated systems. Many of these algorithms (such as EPaxos) are leaderless and decide a partial-ordering over values instead of the more traditional total-ordering approach.

  • Mencius: Building Efficient Replicated State Machines for WANs, OSDI 2008 [acmdl,pdf]
  • Scalable Consistency in Scatter, SOSP 2011 [acmdl,pdf]
  • MDCC: Multi-Data Center Consistency, Eurosys 2013 [acmdl,pdf]
  • There Is More Consensus in Egalitarian Parliaments, SOSP 2013 [acmdl,pdf]
    • This paper describes EPaxos, which realizes Generalized Paxos and makes some further improvements (e.g. reducing the size of fast quorums by 1).
  • Geo-replicated storage with scalable deferred update replication, DSN 2013 [acmdl,pdf]
  • Low-Latency Multi-Datacenter Databases using Replicated Commit, VLDB 2013 [acmdl,pdf]
  • Be General and Don’t Give Up Consistency in Geo-Replicated Transactional Systems, OPODIS 2014 [pdf]
  • CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems, FAST 2015 [acmdl,pdf]
  • GlobalFS: A Strongly Consistent Multi-Site File System, SRDS 2016 [pdf]
  • Canopus: A Scalable and Massively Parallel Consensus Protocol, CoNEXT 2017 [acmdl,pdf]
  • Multileader WAN Paxos: Ruling the Archipelago with Fast Consensus, Tech report 2017 [pdf]
  • WPaxos: Wide Area Network Flexible Consensus, Unpublished 2017 [pdf]
  • Speeding up Consensus by Chasing Fast Decisions, DSN 2017 [pdf]
    • Implements an optimization to EPaxos
  • Leader Set Selection for Low-Latency Geo-Replicated State Machine, IEEE TPDS 2017 [pdf]
  • DPaxos: Managing Data Closer to Users for Low-Latency and Mobile Applications, SIGMOD 2018 [acmdl,pdf]
  • SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines, SoCC 2018 [acmdl,pdf]
  • FleetDB: Follow-the-workload Data Migration for Globe-Spanning Databases, Tech report 2018 [pdf]
  • Geographic State Machine Replication, SRDS 2018 [pdf]
  • Session Guarantees with Raft and Hybrid Logical Clocks, ICDCN 2019 [acmdl]
  • Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed Storage, NSDI 2020 [pdf]
  • State-Machine Replication for Planet-Scale Systems, Eurosys 2020 [acmdl,arxiv]
  • Low-Latency Geo-Replicated State Machines with Guaranteed Writes, PaPoC 2020 [acmdl]
  • EPaxos Revisited, NSDI 2021 [pdf]
  • Efficient Replication via Timestamp Stability, Eurosys 2021 [acmdl,arxiv]
    • Describes Tempo, a leaderless partial ordering protocol that uses timestamp ordering.
  • Reducing the Latency of Dependent Operations in Large-Scale Geo-Distributed Systems, PhD Thesis 2021 [pdf]
  • LEGOStore: A Linearizable Geo-Distributed Store Combining Replication and Erasure Coding, Preprint 2021 [arxiv]
  • RACS and SADL: Towards Robust SMR in the Wide-Area Network, Preprint 2024 [arxiv]

Consensus in production

This section lists papers describing experiences of deploying distributed consensus in production.

  • ⭐️ The Chubby lock service for loosely-coupled distributed systems, OSDI 2006 [acmdl,pdf]
  • ⭐️ Paxos Made Live - An Engineering Perspective, PODC 2007 [acmdl,pdf]
  • ⭐️ ZooKeeper: Wait-free coordination for Internet-scale systems, ATC 2010 [acmdl,pdf]
  • Windows Azure Storage: a highly available cloud storage service with strong consistency, SOSP 2011 [acmdl,pdf]
  • Megastore: Providing Scalable, Highly Available Storage for Interactive Services, CIDR 2011 [pdf]
    • Megastore uses SMR with witnesses, replicas that participate in log replication but do not run a state machine and read-only replicas that only run a state machine. This paper seems to use an unusual definition of Multi-Paxos where each instance is district but the 1a/1b messages for slot i is piggybacked onto 2a2/b for slot i-1.
  • Zab: High-performance broadcast for primary-backup systems, DSN 2011 [acmdl,pdf]
    • Widely utilized Apache licensed open source project written in Java project website
    • Apache Kafka uses Zookeeper, as well as its own replication protocol, by described here. This is no longer true.
    • Architecture is similar to Google's Chubby but unlike Chubby is described in detail and is open source
    • Writes are linearizable, reads may be stale. Note: calling sync before a read doesn't make it linearizable
    • Clients may have multiple outstanding requests, they will be handled FIFO
    • Uses primary-backup replication instead of state machine replication
    • Featured in the morning paper
  • Large-scale cluster management at Google with Borg, Eurosys 2015 [acmdl,pdf]
  • PaxosStore: High-availability Storage Made Practical in WeChat, VLDB 2017 [acmdl,pdf]
  • Bizur: A Key-value Consensus Algorithm for Scalable File-systems, Unpublished 2017 [pdf]
  • SLOG: Serializable, Low-latency, Geo-replicated Transactions, VLDB 2019 [acmdl,pdf]
  • CockroachDB: The Resilient Geo-Distributed SQL Database, ICMD 2020 [acmdl]
  • Millions of Tiny Databases, NSDI 2020 [pdf]
  • Virtual Consensus in Delos, OSDI 2020 [pdf]
  • Log-structured Protocols in Delos, SOSP 2021 [pdf]

Implementations of consensus

This section lists papers describing implementations of distributed consensus algorithms.

  • Replication and fault-tolerance in the ISIS system, Tech Report 1985 [acmdl,pdf]
  • The ISIS project: real experience with a fault tolerant programming system, OSR 1991 [acmdl,pdf]
  • Replication in the Harp File System, SOSP 1991 [acmdl,pdf]
  • Boxwood: Abstractions as the Foundation for Storage Infrastructure, OSDI 2004 [acmdl,pdf]
  • The Farsite Project: A Retrospective, OSR 2007 [acmdl,pdf]
  • Paxos for System Builders: An Overview, LADIS 2008 [acmdl,pdf]
  • Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore, VLDB 2011 [acmdl,pdf]
  • Paxos replicated state machines as the basis of a high-performance data store, NSDI 2011 [acmdl,pdf]
  • Granola: Low-Overhead Distributed Transaction Coordination, ATC 2012 [acmdl,pdf]
  • S-Paxos: Offloading the Leader for High Throughput State Machine Replication, SRDS 2012 [acmdl,pdf]
  • Calvin: Fast Distributed Transactions for Partitioned Database Systems, SIGMOD 2012 [acmdl,pdf]
  • Commodifying Replicated State Machines with OpenReplica, Tech report 2012 [pdf]
  • Optimizing Paxos with batching and pipelining, Theoretical Computer Science 2013 [acmdl,pdf]
  • Tango: Distributed data structures over a shared log, SOSP 2013 [acmdl,pdf]
  • CORFU: A Distributed Shared Log, TOCS 2013 [acmdl,pdf]
  • Scalable State-Machine Replication, DSN 2014 [acmdl,pdf]
  • When Paxos Meets Erasure Code: Reduce Network and Storage Cost in State Machine Replication, HPDC 2014 [acmdl]
  • ⭐️ In Search of an Understandable Consensus Algorithm, ATC 2014 [acmdl,pdf,code,thesis]
  • Paxos made transparent, SOSP 2015 [acmdl,pdf]
  • Designing Distributed Systems Using Approximate Synchrony in Data Center Networks, NSDI 2015 [acmdl,pdf]
  • No compromises: distributed transactions with consistency, availability, and performance, SOSP 2015 [acmdl,pdf]
  • Building Consistent Transactions with Inconsistent Replication, SOSP 2015 [acmdl,pdf]
  • MetaSync: File Synchronization Across Multiple Untrusted Storage Services, ATC 2015 [pdf,acmdl]
  • Making Fast Consensus Generally Faster, DSN 2016 [pdf]
  • Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics, SIGMOD 2017 [acmdl,pdf]
  • Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols, HotCloud 2017 [pdf,acmdl,slides]
  • Bolt-On Global Consistency for the Cloud, SoCC 2018 [acmdl,pdf]
  • Stable and Consistent Membership at Scale with Rapid, ATC 2018 [pdf]
    • Uses Fast Paxos to decide on membership changes. Conflicts are rare as the proposed value is the output of a membership algorithm so proposers usually propose the same proposal.
    • Fast Paxos implementation is here and here
  • The FuzzyLog: A Partially Ordered Shared Log, OSDI 2018 [pdf]
  • Aegean: Replication beyond the client-server model, SOSP 2019 [acmdl]
    • Also supports BFT
  • Exploiting Commutativity For Practical Fast Replication, NSDI 2019 [acmdl,pdf,thesis]
  • Unifying Consensus and Atomic Commitment for Effective Cloud Data Management, VLDB 2019 [acmdl,pdf]
  • Linearizable Quorum Reads in Paxos, HotStorage 2019 [pdf,slides]
    • A two phase quorum read algorithm which does not require the leader and does not rely on bounded clock drift like read leases.
  • RMWPaxos: Fault-Tolerant In-Place Consensus Sequences, Unpublished 2020 [arxiv]
  • Bipartisan Paxos: A Modular State Machine Replication Protocol, Unpublished [pdf]
  • Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log, NSDI 2020 [pdf]
  • Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol, ASPLOS 2020 [acmdl,arxiv,thesis]
  • PigPaxos: Devouring the communication bottlenecks in distributed consensus, ICMD 2020 [arxiv,acmdl]
  • CRaft: An Erasure-coding-supported Version of Raft for Reducing Storage Cost and Network Cost, FAST 2020 [pdf]
  • Scaling Replicated State Machines with Compartmentalization, VLDB 2021 [acmdl,arxiv,pdf]
  • Rabia: Simplifying State-Machine Replication Through Randomization, SOSP 2021 [arxiv]
  • Boki: Stateful Serverless Computing with Shared Logs, SOSP 2021 [pdf]
    • Latest edition in the line of papers on shared totally-ordered logs: Tango, Corfu, vCorfu, Scalog, Delos
  • Gossip Consensus, Middleware 2021 [pdf]
    • Looks at using gossip to reduce communication overhead of Multi-Paxos. An unstructured version of PigPaxos/Canopus.
  • Baxos: Backing off for Robust and Efficient Consensus, Preprint 2022 [arxiv]

Evaluations of consensus

This section lists papers describing standalone evaluations of consensus algorithms.

  • The Performance of Paxos in the Cloud, SRDS 2014 [acmdl,pdf]
  • Consensus in the Cloud: Paxos Systems Demystified, Tech report 2016 [pdf]
  • Spectrum: A Framework for Adapting Consensus Protocols, Unpublished 2019 [pdf]
  • Dissecting the Performance of Strongly-Consistent Replication Protocols, SIGMOD 2019 [acmdl,pdf]
  • Blockchains and Distributed Databases: a Twin Study [arxiv]
    • Performance anaylsis of 5 consensus systems, 3 non-byzantine algorithms (including etcd) and 2 byzantine consensus algorithms
  • Scalable but Wasteful: Current State of Replication in the Cloud, HotStorage 2021 [acmdl]
    • Study of the efficiency (CPU utilization) of Multi-Paxos vs EPaxos, finding that EPaxos provides better throughput than Multi-Paxos at the cost of much worse efficiency.

State machine replication

This section lists papers about the application of consensus to State Machine Replication (SMR/RSMs) and Linearizability.

  • ⭐️ Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial, CSUR 1990 [acmdl,pdf]
  • ⭐️ Linearizability: A Correctness Condition for Concurrent Objects, TOPLAS 1990 [acmdl,pdf]
  • Implementing Linearizability at Large Scale and Low Latency, SOSP 2015 [acmdl,pdf]
  • Cheap and Available State Machine Replication, ATC 2016 [acmdl,pdf]
  • Fine-Grained Replicated State Machines for a Cluster Storage System, NSDI 2020 [pdf]
  • Rolis: A software approach to efficiently replicating multi-core transactions, Eurosys 2022 [odf]
    • Optimistic execution of transactions before ordering in CFT-SMR.
  • State Machine Replication Scalability Made Simple, Eurosys 2022 [pdf,extended version]

Reconfiguration

This section lists papers on reconfiguration & leader election.

  • The SMART Way to Migrate Replicated Stateful Services, EuroSys 2006 [acmdl,pdf]
  • 💎 Vertical Paxos and Primary-Backup Replication, PODC 2009 [acmdl,pdf]
  • Reconfiguring a State Machine, SIGACT News 2010 [acmdl,pdf]
  • Dynamic Reconfiguration of Primary/Backup Clusters, ATC 2012 [acmdl,pdf]
  • Take me to your leader! Online Optimization of Distributed Storage Configurations, VLDB 2015 [pdf]
  • Unbounded Pipelining in Dynamically Reconfigurable Paxos Clusters, Unpublished 2016 [pdf]
  • Matchmaker Paxos: A Reconfigurable Consensus Protocol, JSys 2021 [pdf,arxiv]

Related Topics

Weaker consistency models

This section lists papers that discuss alternative consistency models to linearizability and/or systems that depend upon synchrony for correctness (not just liveness).

  • Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency, SOSP 1989 [acmdl,pdf]
    • This paper introduced the idea of leases for distributed caches. This idea is used in master leases and read quorum leases.
  • ⭐️ Towards Robust Distributed Systems, PODC 2000 [acmdl,pdf]
  • Chain replication for supporting high throughput and availability, OSDI 2004 [acmdl,pdf]
  • Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 [acmdl,pdf]
  • Bigtable: A Distributed Storage System for Structured Data, TOCS 2008 [acmdl,pdf]
  • What consistency does your key-value store actually provide?, HotDep 2010 [acmdl,pdf]
    • Offline consistency checking of key-value traces
  • Cassandra - A Decentralized Structured Storage System, OSR 2010 [acmdl,pdf]
  • Benchmarking Cloud Serving Systems with YCSB, SoCC 2010 [acmdl,pdf,code]
    • Popular benchmarking tool for key-values stores, actively maintained with support for various data stores.
  • Spanner: Google’s Globally-Distributed Database, OSDI 2012 [acmdl,pdf]
    • Provides linearizability but it assumes a bounded clock drift
    • Google implements this using Truetime, GPS, and atomic clocks in their data centers instead of NTP.
    • Closed source but now available as a cloud service called Cloud Spanner.
  • TAO: Facebook’s Distributed Data Store for the Social Graph, ATC 2013 [acmdl,pdf]
  • Highly Available Transactions: Virtues and Limitations, VLDB 2013 [pdf]
  • Eventual Consistency Today: Limitations, Extensions, and Beyond, ACM Queue 2013 [acmdl,pdf]
  • Quantifying eventual consistency with PBS, CACM 2014 [acmdl,pdf]
  • Existential Consistency: Measuring and Understanding Consistency at Facebook, SOSP 2015 [acmdl,pdf]
  • Minimizing coordination in replicated systems, PaPoC 2015 [acmdl,pdf]
  • Consistency in Non-Transactional Distributed Storage Systems, CSUR 2016 [acmdl,pdf]
  • Just say NO to Paxos Overhead: Replacing Consensus with Network Ordering, OSDI 2016 [acmdl,pdf]
  • The many faces of consistency, DE 2016 [pdf]
  • Spanner, TrueTime & The CAP Theorem, Tech Report 2017 [pdf]
  • Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, SIGMOD 2017 [acmdl]
  • Fine-grained consistency for geo-replicated systems, ATC 2018 [pdf]
  • Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes, SIGMOD 2018 [acmdl]
  • Sharding the Shards: Managing Datastore Locality at Scale with Akkio, OSDI 2018 [acmdl,pdf]
  • Tunable Consistency in MongoDB, VLDB 2019 [pdf]
  • On mixing eventual and strong consistency: Bayou revisited, PODC 2019 [arxiv,pdf]
  • Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection, VLDB 2020 [pdf]
    • Implemented on programmable switches
    • Comes with TLA+ spec in tech report
  • Strong and Efficient Consistency with Consistency-Aware Durability, FAST 2020 [pdf]
  • Regular Sequential Serializability and Regular Sequential Consistency, SOSP 2021 [pdf]
    • New consistency models which are invariant equivalent to linearizability.
  • Making CRDTs Byzantine Fault Tolerant, PaPoC 2022 [pdf,acmdl]
  • Stabilizer: Geo-Replication with User-defined Consistency, ISDCS 2022 [pdf]

Failures

This section lists papers that analyze and/or handle real-world failures of distributed systems.

  • Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications, SIGCOMM 2011 [acmdl,pdf]
  • The Network is Reliable: An informal survey of real-world communications failures, ACM Queue 2014 [acmdl,pdf]
  • What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC 2014 [acmdl,pdf]
  • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI 2014 [acmdl,pdf]
  • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, HotOS 2017 [acmdl]
  • Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions, FAST 2017 [acmdl,pdf]
  • An Analysis of Network-Partitioning Failures in Cloud Systems, OSDI 2018 [acmdl,pdf]
  • CrashTuner: Detecting Crash-Recovery Bugs in Cloud Systems via Meta-Info Analysis, SOSP 2019 [acmdl]
  • The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure, SOSP 2019 [acmdl]
  • Toward a Generic Fault Tolerance Technique for Partial Network Partitioning, OSDI 2020 [pdf]
  • Tolerating Slowdowns in Replicated State Machines using Copilots, OSDI 2020 [pdf]
  • Metastable Failures in Distributed Systems, HotOS 2021 [acmdl]
  • Immunizing Systems from Distant Failures by Limiting Lamport Exposure, HotNets 2021 [acmdl]
  • Cores That Don’t Count, HotOS 2021 [acmdl,pdf,talk]
  • Metastable Failures in the Wild, OSDI 2022 [pdf]

Clocks

The liveness of distributed consensus depends on some degree of clock synchronization. The following section lists papers on the topic of clock synchronization.

  • IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, Standard 1588-2008 [ieee]
  • Globally Synchronized Time via Datacenter Networks, SIGCOMM 2016 [acmdl,pdf]
  • Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization, NSDI 2018 [acmdl,pdf]
  • Sundial: Fault-tolerant Clock Synchronization for Datacenters, OSDI 2020 [pdf]
  • Systems Research is Running out of Time, HotOS 2021 [acmdl,pdf,talk]
    • Some great examples of things that can go wrong with clocks.
  • Graham: Synchronizing Clocks by Leveraging Local Clock Properties, NSDI 2022 [pdf]
    • Substantially reduced clock drift when time synchronization (such as NTP, PTP, Sundial) fails. Uses only commodity hardware. Won best paper award.

Correctness of consensus algorithms

This section lists papers on proving or testing the correctness of consensus algorithms.

  • Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers, Book 2002 [acmdl,pdf,website,amazon]
    • Directory of TLA+ specs, including many for consensus algorithms TLA+ Examples
  • I Do Declare: Consensus in a Logic Language, NetDB 2009 [pdf]
  • A Proof of Correctness for Egalitarian Paxos, Tech report 2013 [pdf]
  • Verdi: A framework for implementing and formally verifying distributed systems, PLDI 2015 [acmdl,pdf]
  • IronFleet: Proving Practical Distributed Systems Correct, SOSP 2015 [acmdl,pdf]
  • Lineage-driven Fault Injection, SIGMOD 2015 [acmdl,pdf]
  • How Amazon web services uses formal methods, CACM 2015 [acmdl,html]
  • PSYNC: A partially synchronous language for fault-tolerant distributed algorithms, POPL 2016 [acmdl,pdf]
  • Ivy: safety verification by interactive generalization, PLDI 2016 [acmdl,pdf,code]
  • Brief Announcement: A Family of Leaderless Generalized-Consensus Algorithms, PODC 2016 [acmdl,pdf]
  • Paxos Made EPR: Decidable Reasoning about Distributed Protocols, OOPSLA 2017 [acmdl,arxiv]
  • Growing a protocol, HotCloud 2017 [acmdl,pdf]
  • An Empirical Study on the Correctness of Formally Verified Distributed Systems, Eurosys 2017 [acmdl,pdf]
  • Teaching Rigorous Distributed Systems With Efficient Model Checking, EuroSys 2019 [acmdl,pdf]
    • Describes DSLabs, an alternative to TLA+, for model checking distributed protocols.
    • Featured in the morning paper
  • FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems, Eurosys 2019 [acmdl,pdf]
  • Proving the Correctness of Disk Paxos in Isabelle/HOL, Unpublished 2019 [pdf]
  • I4: Incremental Inference of Inductive Invariants for Verification of Distributed Protocols, SOSP 2019 [acmdl,code]
  • Scaling symbolic evaluation for automated verification of systems code with Serval, SOSP 2019 [acmdl]
  • WormSpace: A Modular Foundation for Simple, Verifiable Distributed Systems, SoCC 2019 [acmdl]
  • TLA+ model checking made symbolic, OOPSLA 2019 [acmdl]
  • Towards an Automatic Proof of Lamport’s Paxos, FMCAD 2021 [arxiv]
    • Automatic inference of Paxos's inductive invariants
  • DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols, OSDI 2021 [code,pdf]
    • Next step in the inference of inductive invariants for distributed protocols, following on from Ivy and I4. Still does not support Paxos.
  • Much ADO about Failures: A Fault-Aware Model for Compositional Verification of Strongly Consistent Distributed Systems, OOPSLA 2021 [pdf]
    • Formal proofs of distributed protocols in Coq including Multi-Paxos, produces verified C executables.
  • Adore: Atomic Distributed Objects with Certified Reconfiguration, PLDI 2022 [pdf]
  • Formal Verification of a Distributed Dynamic Reconfiguration Protocol, CCP 2022 [acmdl,arxiv,pdf,code,talk]
    • Also see: Design and Analysis of a Logless Dynamic Reconfiguration Protocol, OPODIS 2021 [pdf]
  • Plain and Simple Inductive Invariant Inference for Distributed Protocols in TLA+, FMCAD 2022 [pdf]
  • Towards Formal Verification of HotStuff-based Byzantine Fault Tolerant Consensus in Agda, NFM 2022 [arxiv]

Quorum systems

This section lists papers on quorum systems.

  • ⭐️ A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases, TODS 1979 [acmdl,pdf]
  • ⭐️ Weighted Voting for Replicated Data, SOSP 1979 [acmdl,pdf]
  • How to Assign Votes in a Distributed System, JACM 1985 [acmdl,pdf]
  • A √N algorithm for mutual exclusion in decentralized systems, TOCS 1985 [acmdl,pdf]
  • A Quorum-Consensus Replication Method for Abstract Data Types, TOCS 1984 [acmdl]
  • The Reliability of Voting Mechanisms, TC 1987 [acmdl,pdf]
  • The Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data, VLDB 1990 [pdf]
  • An Efficient and Fault-tolerant Solution for Distributed Mutual Exclusion, TOCS 1991 [acmdl,pdf]
  • Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data, TC 1991 [acmdl,pdf]
  • The Generalized Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data, TODS 1992 [acmdl,pdf]
  • The Grid Protocol: A High Performance Scheme for Maintaining Replicated Data, TKDE 1992 [acmdl,pdf]
  • Enhancing concurrency and availability for database systems, Thesis [acmdl]
  • The Availability of Quorum Systems, Tech report 1993 [acmdl,pdf]
  • Crumbling Walls: A Class of Practical and Efficient Quorum Systems, PODC 1995 [acmdl,pdf]
  • Evaluating quorum systems over the Internet, PODC 1996 [acmdl]
  • An Adaptive Data Replication Algorithm, TODC 1997 [acmdl]
  • 💎 The Load, Capacity, and Availability of Quorum Systems, SIAM 1998 [acmdl,pdf]
  • Optimal availability quorum systems: Theory and practice, IPL 1998 [pdf]
  • Are Quorums an Alternative for Data Replication?, TODS 2003 [acmdl]
  • Coterie Availability in Sites, DISC 2005 [acmdl,pdf]
  • The virtue of dependent failures in multi-site systems, HotDep 2005 [acmdl,pdf]
  • Read-Write Quorum Systems Made Practical, PaPoC 2021 [acmdl,arxiv,code,pdf,talk]

Byzantine fault tolerance

This section lists papers on Byzantine Fault Tolerance (BFT), often used as the basis of permissioned blockchains.

BFT surveys

  • SoK: Consensus in the Age of Blockchains, AFT 2019 [acmdl,arxiv]
  • BFT in Blockchains: From Protocols to Use Cases, ACM Computing Surveys 2021 [acmdl]
    • New survey paper on BFT, more up-to-date than "Consensus in the Age of Blockchains".

BFT in theory

  • ⭐️ Reaching Agreement in the Presence of Faults, JACM 1980 [pdf]
    • Considered to be the first proof that Byzantine agreement requires at least 3f+1 nodes to tolerate f faults.
  • ⭐️ The Byzantine Generals Problem, TPLS 1982 [acmdl,pdf]
    • Famous Lamport paper which popularized the Byzantine agreement problem
  • Asynchronous consensus and broadcast protocols, JACM 1985 [acmdl]
    • Another proof that crash fault tolerance requires 2f+1 nodes and BFT requires 3f+1 nodes.
  • Byzantine quorum systems, STOC 1997 [acmdl,pdf]
  • The load and availability of Byzantine quorum systems, PODC 1997 [acmdl]
    • Follow up to Byzantine quorum systems paper.
  • Byzantine disk paxos: optimal resilience with byzantine shared memory, PODC 2004 [acmdl,pdf]
  • Fast Byzantine Consensus, IEEE TDSC 2006 [acmdl,pdf]
    • Describes FaB, similar in nature to Q/U.
  • Bosco: One-Step Byzantine Asynchronous Consensus, DISC 2008 [acmdl,pdf]
  • Byzantine consensus in 1 round trip (instead of the usual three) using quorums of 4f+1 from 5f+1 nodes.
  • Matrix Signatures: From MACs to Digital Signatures in Distributed Systems, DISC 2008 [pdf]
  • Leaderless Byzantine Paxos, DISC 2011 [pdf]
  • Byzantizing Paxos by Refinement, DISC 2011 [acmdl,pdf]
  • Revisiting Fast Practical Byzantine Fault Tolerance, Unpublished 2017 [arxiv]
    • Describes bugs in Zyzzyva and FaB
  • Making Byzantine Consensus Live, DISC 2020 [arvix,talk]
    • Liveness proof for various BFT protocols including view synchtronization
  • Order-Fairness for Byzantine Consensus, Crypto 2020 [acmdl,pdf]
  • Quadratic worst-case message complexity for State Machine Replication in the partial synchrony model, Preprint 2022 [arxiv]
  • Liveness and Latency of Byzantine State-Machine Replication, Preprint 2022 [arxiv]
  • Byzantine Agreement in Polynomial Time with Near-Optimal Resilience. Preprint 2022 [arxiv]
  • On the Correctness of Speculative Consensus, Preprint 2022 [arxiv]
  • Basilic: Resilient Optimal Consensus Protocols With Benign and Deceitful Faults, Preprint 2022 [arxiv]

BFT in practice

  • ⭐️ Practical Byzantine Fault Tolerance, OSDI 1999 [acmdl,pdf,proof,talk]
    • Considered to be the first practical BFT-SMR protocol.
  • Separating agreement from execution for byzantine fault tolerant services, SOSP 2003 [acmdl,pdf]
    • Proposes decoupling consensus from state machine execution, similar to the distinction in Paxos between proposers/acceptors and learners.
    • Byzantized variant of Disk Paxos.
  • Fault-Scalable Byzantine Fault-Tolerant Services, SOSP 2005 [acmdl]
    • Describes the Q/U protocol, leaderless but requires 5f+1 nodes instead of 3f+1 nodes
  • HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault Tolerance, OSDI 2006 [acmdl,pdf]
  • Zyzzyva: speculative byzantine fault tolerance, SOSP 2007 [acmdl,pdf]
  • Attested Append-Only Memory: Making Adversaries Stick to their Word, SOSP 2007 [acmdl]
  • Tolerating Byzantine Faults in Transaction Processing Systems using Commit Barrier Scheduling, OSR 2007 [acmdl]
  • Upright cluster services, SOSP 2009 [acmdl,pdf,code]
    • Develops a BFT fork of Zookeeper and HDFS, source code does not seem to be used/maintained
  • Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults, NSDI 2009 [acmdl,pdf]
  • TrInc: Small Trusted Hardware for Large Distributed Systems, NSDI 2009 [pdf,acmdl]
  • Zzyzx: Scalable Fault Tolerance through Byzantine Locking, DSN 2010 [pdf]
  • Byzantine Chain Replication, OPODIS 2012 [pdf]
  • Automatic Reconfiguration for Large-Scale Reliable Storage Systems, TDSC 2012 [pdf]
    • Describes an approach to reconfigure BFT systems
  • State Machine Replication for the Masses with BFT-SMART, DSN 2014 [code,pdf]
    • BFT-SMR implementation, similar to PBFT. Often used as a benchmark against which new BFT protocols are evaluated.
  • The Next 700 BFT Protocols, TOCS 2015 [acmdl,pdf]
  • Algorand: Scaling Byzantine Agreements for Cryptocurrencies, SOSP 2017 [acmdl,pdf]
  • Hybrids on Steroids: SGX-Based High Performance BFT, Eurosys 2017 [acmdl,pdf]
  • Hardening Cassandra Against Byzantine Failures, OPODIS 2017 [pdf]
  • Casper the Friendly Finality Gadget, Tech report 2017 [arxiv]
    • Casper FFG, the proposed PoS alternative for Ethereum (aka Eth2)
  • BFT Protocols Under Fire, NSDI 2008 [pdf]
  • Algorand: A secure and efficient distributed ledger, TCS 2019 [code,pdf]
    • 2nd of the two Algorand papers
  • HotStuff: BFT Consensus with Linearity and Responsiveness, PODC 2019 [acmdl,arxiv]
  • The latest gossip on BFT consensus, Unpublished 2018 [arxiv]
  • SBFT: a Scalable and Decentralized Trust Infrastructure, DSN 2019 [arxiv,code]
  • Stellar Consensus by Instantiation, DISC 2019 [pdf]
    • Includes Isabelle/HOL proof in AFP
  • Fast and secure global payments with Stellar, SOSP 2019 [acmdl]
    • Formal verification in Ivy and Isabelle/HOL
  • Flexible Byzantine Fault Tolerance, CCS 2019 [acmdl,pdf]
  • Byzantine Ordered Consensus without Byzantine Oligarchy, OSDI 2020 [acmdl,pdf,talk]
  • Making Reads in BFT State Machine Replication Fast, Linearizable, and Live, SRDS 2021 [arxiv]
    • Identifies and fixes a liveness issue in PBFT's fast path for non-linearizable read-only operations
  • Be Aware of Your Leaders, Unpublished 2021 [pdf]
    • Reputation-based leader rotation algorithm as an alternative to simple round robin.
  • Basil: Breaking up BFT with ACID (transactions), SOSP 2021 [acmdl,arxiv,pdf]
  • BigBFT: A Multileader Byzantine Fault Tolerance Protocol for High Throughput, 2021 [arxiv]
  • Scaling Membership of Byzantine Consensus, TOCS 2021 [acmdl]
  • DiemBFT v4: State Machine Replication in the Diem Blockchain, White paper 2021 [pdf]
    • Describes the latest version of DiemBFT, based on a variant of HotStuff with 2-phases and quadratic view changes.
  • Dissecting the Performance of Chained-BFT, Preprint 2021 [arxiv]
  • Crime and Punishment in Distributed Byzantine Decision Tasks, Preprint 2022 [arxiv]
  • Dissecting BFT Consensus: In Trusted Components we Trust!, Preprint 2022 [arxiv]
  • Scalable Byzantine Fault Tolerance via Partial Decentralization, Preprint 2022 [arxiv]
  • Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing, Preprint 2022 [arxiv]
    • Parallel execution of transactions in BFT-SMR, implemented in DiemBFT.
  • Hierarchical Consensus: A Horizontal Scaling Framework for Blockchains, Preprint 2022 [pdf]
  • IA-CCF: Individual Accountability for Permissioned Ledgers, NSDI 2022 [arxiv,pdf]
  • DispersedLedger: High-Throughput Byzantine Consensus on Variable Bandwidth Networks, NSDI 2022 [pdf]
  • DAMYSUS: Streamlined BFT Consensus Leveraging, Eurosys 2022 [acmdl
  • State Machine Replication Scalability Made Simple, Eurosys 2022 [acmdl]
  • Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus, Eurosys 2022 [acmdl]
  • UTT: Decentralized Ecash with Accountable Privacy, Preprint 2022 [pdf]
  • Treaty: Secure Distributed Transactions, DSN 2022 [pdf]
  • SplitBFT: Improving Byzantine Fault Tolerance Safety Using Trusted Compartments, Middleware 2022 [acmdl],arxiv]

Alternative fault models in distributed consensus

Most of these papers handle crash faults or byzantine faults. This section considers the fault models between crash and byzantine.

  • Practical Hardening of Crash-Tolerant Systems, ATC 2012 [acmdl,pdf]
  • Visigoth Fault Tolerance, EuroSys 2015 [acmdl,pdf]
  • XFT: Practical Fault Tolerance beyond Crashes, OSDI 2016 [acmdl,pdf]
  • 💎 Protocol-Aware Recovery for Consensus-Based Storage, FAST 2018 [acmdl,pdf]
    • Enabling nodes who lose persistent storage (e.g. due to corruption) to rejoin consensus systems without reconfiguration.
    • Implemented & evaluated in LogCabin and Zookeeper, but no source code is available
    • Best paper award at FAST 2018
    • Authors' claim to have model checked with TLA+ but no spec is available
    • Featured in the morning paper

Misc

Blog posts, books, talks, dissertations, etc...


Future reading list

The following lists contain places to watch for new writings in the field of distributed consensus. They are in no particular order.

Blogroll

Reading lists

Academic conferences & symposiums

Dan Tsafrir maintains a useful list of systems conferences by deadline.

Academic workshops

Academic journals & magazines