DO NOT DISTRIBUTE OR PUBLICLY POST SOLUTIONS TO THESE LABS. MAKE ALL FORKS OF THIS REPOSITORY WITH SOLUTION CODE PRIVATE.
Ellis Michael
University of Washington
DSLabs is a new framework for creating, testing, model checking, visualizing, and debugging distributed systems lab assignments.
The best way to understand distributed systems is by implementing them. And as the old saying goes, "practice doesn't make perfect, perfect practice makes perfect." That is, it's one thing to write code which usually works; it's another thing entirely to write code which works in all cases. The latter endeavor is far more useful for understanding the complexities of the distributed programming model and specific distributed protocols.
Testing distributed systems, however, is notoriously difficult. One thing we found in previous iterations of the distributed systems class at UW is that many students would write implementations which passed all of our automated tests but nevertheless were incorrect, often in non-trivial ways. Some of these bugs would only manifest themselves in later assignments, while others would go entirely unnoticed by our tests. We were able to manually inspect students' submissions and uncover some of these errors, but this approach to grading does not scale and does not provide the immediate feedback of automated tests.
The DSLabs framework and labs are engineered around the goal of helping students understand and correctly implement distributed systems. The framework provides a suite of tools for creating automated tests, including model checking tests which systematically explore the state-space of students' implementations. These tests are much more likely to catch many common distributed systems bugs, especially bugs which rely on precise orderings of messages. Moreover, when a bug is found, these search-based tests output a trace which generates the error, making debugging dramatically simpler. Finally, DSLabs is integrated with a tool for visualizing executions of distributed systems, which allows students to graphically explore executions of their systems and visualize invariant-violating traces found by the model-checker.
The DSLabs framework is built around message-passing state machines (also known as I/O automata or distributed actors), which we call nodes. These basic units of a distributed system consist of a set of message and timer handlers; these handlers define how the node updates its internal state, sends messages, and sets timers in response to an incoming message or timer. These nodes are run in single-threaded event loops, which take messages from network and timers from the node's timer queue and call the node's handlers for those events.
This model of computation is typically the one we use when introduce distributed systems for the first time and the one we use when we want to reason about distributed protocols and prove their correctness. The philosophy behind this framework is that by creating for students a programming environment which mirrors the mathematical model distributed protocols are described in, we put them on the best footing to be able to reason about their own implementations.
The lab infrastructure has a suite of tools for creating automated test cases for distributed systems. These tools make it easy to express the scenarios the system should be tested against (e.g., varying client workloads, network conditions, failure patterns, etc.) and then run students' implementations on an emulated network (it is also possible to replace the emulated network interface with an interface to the actual network).
While executing certain scenarios is useful in uncovering bugs in students' implementations, it is difficult to test all possible scenarios that might occur. Moreover, once these tests uncover a problem, it is a challenge to discover its root cause. Because the DSLabs framework has its node-centric view of distributed computation, it enables a more thorough form of testing – model checking.
Model checking a distributed system is conceptually simple. First, the initial state of the system is configured. Then, we say that one state of the system, s₂, (consisting of the internal state of all nodes, the state of their timer queues, and the state of the network) is the successor of another state s₁ if it can be obtained from s₁ by delivering a single message or timer that is pending in s₁. A state might have multiple successor states. Model checking is the systematic exploration of this state graph, the simplest approach being breadth-first search. The DSLabs model-checker lets us define invariants that should be preserved (e.g. linearizability) and then search though all possible ordering of events to make sure those invariants are preserved in students' implementations. When an invariant violation is found, the model-checker can produce a minimal trace which leads to the invariant violation.
While model checking distributed systems is useful and has been used extensively in industry and academia to find bugs in distributed systems, exploration of the state graph is still a fundamentally hard problem – the size of the graph is typically exponential as a function of depth. To extend the usefulness of model checking even further, the test infrastructure lets us prune the portion of the state graph we explore for an individual test, guiding the search towards common problems while still exploring all possible executions in the remaining portion of the state space.
This framework is integrated with the Oddity distributed systems visualization tool, created by Doug Woos. This tool allows students to interactively explore executions of the distributed systems they build. By exploring executions of their distributed system, students can very quickly test their own hypotheses about how their nodes should behave, helping them discover bugs in their protocols and gain a deeper understanding for the way their systems work. Additionally, the tool is used to visualize the invariant-violating traces produced by the model-checker.
We currently have four individual assignments in this framework. In these projects, students incrementally build a distributed, fault-tolerant, sharded, transactional key/value store!
- Lab 0 provides a simple ping protocol as an example.
- Lab 1 has students implement an exactly-once RPC protocol on top of an asynchronous network. They re-use the pieces they build in lab 1 in later labs.
- Lab 2 introduces students to fault-tolerance by having them implement a primary-backup protocol.
- Lab 3 asks students to take the lessons learned in lab 2 and implement Paxos.
- Lab 4 has students build a sharded key/value store out of multiple replica groups, each of which uses Paxos internally for replication. They finish by implementing a two-phase commit protocol to handle multi-key updates.
Parts of this sequence of assignments (especially labs 2 and 4) are adapted from the MIT 6.824 Labs. The finished product is a system whose core design is very similar to production storage systems like Google's Spanner.
We have used the DSLabs framework and assignments in distributed systems classes at the University of Washington.
framework/src
contains the interface students program against.framework/tst
contains in the testing infrastructure.framework/tst-self
contains the tests for the for the interface and testing infrastructure.labs
contains a subdirectory for each lab. The lab directories each have asrc
directory initialized with skeleton code where students write their implementations, as well as atst
directory containing the tests for that lab.handout-files
contains files to be directly copied into the student handout, including the mainREADME
andrun-tests.py
.grading
contains scripts created by previous TAs for the course to batch grade submissions.
The master
branch of this repository is not setup to be distributed to
students as-is. The Makefile
has targets to build the handout
directory and
handout.tar.gz
, which contain a single JAR with the compiled framework,
testing infrastructure, and all dependencies. The handout
branch of this
repository is an auto-built version of the handout.
The framework and assignments are ready to be used, but they are also works in progress. There are more features planned, and we welcome pull requests.
Bug reports and feature requests should be submitted using the GitHub issues tool. Email Ellis Michael (emichael@cs.washington.edu) with any other questions.
If you use these labs in a course you teach, I'd love to hear from you!