15-859HH Final Project: Log-Time Text CRDT

See the project presentation for background.

src/crdt/text_logn.ts containts the text CRDT implementation. It has an API similar to Collabs's built-in CText, but a different implementation. It performs all operations in O(log(n) + c) time, where n is the text length including deleted elements, and c <= numUsers is the maximum amount of concurrency (concurrent edits at the same position).

As a demo, npm start runs the collaborative plain text editor from Collabs's demos, but using this text CRDT instead of CText.

The src/site/ folder and this repo's build setup is mostly copied from Collabs's plaintext editor demo, while the src/benchmarks/ folder reuses some code from Collabs's benchmarks. However, the src/crdt folder was written from scratch for this project.

Usage

Installation

First, install Node.js. Then run npm i.

`npm run build`

Build the project.

npm run benchmarks

Run benchmarks, putting results in outfolder/single.csv and outfolder/multi.csv. These compare the per-operation processing time for four implementations:

collabs: The Collabs library's built-in text CRDT, CText.
ground_truth: A simple implementation that uses the ground-truth tree with one character per tree node, with no optimizations (practical or asymptotic).
balanced: Same as ground_truth, but with an extra balanced view of the text that it used to convert between nodes and indices in O(log(n)) time.
logn: Final implementation with O(log(n) + c) asympototic time for all operations (send insert, receive insert, send delete, receive delete), including the time to emit events describing received operations and their indices.

Each output file corresponds to one benchmark:

single.csv shows the times to perform (send) all operations in Martin Kleppmann's automerge-perf text editing trace. Times are averaged across groups of 1000 operations; the reported value is the average time per op, in ns.
multi.csv shows the time to process (receive) messages generated by 10 users each typing 100 left-to-right characters concurrently.

`npm start`

Run the demo server. Open http://localhost:3000/ to view. Use multiple browser windows at once to test collaboration.

See @collabs/container-testing-server for usage info.

`npm run clean`

Delete dist/ and build/.

Algorithm

The algorithm uses the simple binary-ish tree CRDT described in these slides. Each device maintains three views of the tree:

A "ground-truth" tree: the literal tree described in those slides, with one character per node.
A balanced (AVL) tree representing the ordered list of nodes, including deleted nodes. This is used to convert between nodes and indices in O(log(n)) time. Note that this tree is local, not replicated: different devices may balance the tree in different ways.
A means of mapping from each node to its leftmost (respectively, rightmost) descendants in O(log(n)) time. Specifically, this means is a collection of AVL trees, one per contiguous leftmost path. To find a node's leftmost descendant, you go up its AVL tree to the root, then down the AVL tree's right side to its last element. When a node is inserted, if it is a leftmost child, it is appended to its parent's AVL tree; if it breaks an existing leftmost path, that path's AVL tree is split. The code calls this set of trees a "Split-Append List Manager (SALM)". Note that these trees are again local, not replicated.

Future work

Benchmark right-to-left insertions. This is the worst case for Collabs (devolves to linear time insertions) but should be fine for the log-time algorithm.
Better multi.csv benchmark. I was hoping to see processing time spike every 100 ops when you insert the first character from a new user (necessitating a linear-time search to its nearest sibling's rightmost/leftmost descendant), but all I can see are random spikes, probably due to garbage collection.
Investigate whether the 3rd tree view is worthwhile. It is most likely not worth it in practice, i.e., it will increase the average cost per op in order to make a few rare ops (concurrent insertions) faster. Perhaps it is also not worth it asympotically, i.e., we can show that the balanced algorithm (first two views only) has amortized O(log(n) + c) time per op?
Combine these algorithmic optimizations with Collabs's CText's practical optimizations. This is important to make memory usage practical - currently the log-time algorithm uses 600 bytes per character!

mweidner037/15859HH-project