A distributed database for Yjs documents.
Ydb is the ideal endpoint for shared editing apps based on Yjs. It implements a unique way to persist shared documents and implements a new syncing mechanism that allows users to sync thousands of documents in a single call. While Ydb is an endpoint for Yjs, Ydb itself does not understand the content of a shared document.
The goal of Yjs was always to bring a reliable approach to edit documents in real-time and while offline. Shared documents are stored in indexeddb to allow faster access and sync offline content to other clients when a connection is available. However, syncing thousands of documents is not feasible with Yjs sync, because it requires to load the shared document to memory.
Previous implementations of Yjs endpoints are based on key-value storages like LevelDB. However, storage for shared documents based on Yjs is unique for several reasons:
- A Yjs document is an append-only log of operations. It does not require indexing or querying that is an integral part of conventional databases. In Ydb, a client fetches changes based on the byte-size of the shared document. If the size doesn't match, Ydb computes the unknown content as the tail of the document and sends it to the client. More about offset-based syncing and shrinking of shared documents in #Syncing
- Shared documents are edited over a short period. There is a good chance that the shared document is never modified after that. When a document is not edited over an extended period, Ydb can store the document in Object Storage instead. Depending on the provider you use, Object Storage may be 10x times cheaper than storage on a local HDD.
- While users are active, Ydb also shares presence information like names of active users on a document/collection, cursor location, and availability status. Presence information is not persisted but shared among all active users.
In terms of the CAP theorem, Ydb is an available (A), partition-tolerant (P) distributed database. This means that the communication between the servers may fail, but clients can always sync their documents. It has been shown that it is impossible to implement a distributed database, that also provides consistency (C) - clients always receive the latest content. However, since operations in Yjs are commutative, Ydb also ensures eventual consistency (C) among the Ydb instances and clients.
Document: A shared document that is uniquely defined by its UUID. A document must be part of a collection. Internally it is handled as an append-only log of changes. A document may be shrunk by doing garbage collection on the shared document with Yjs.
Collection: A collection is a set of documents. A collection must be associated with a user account.
(document) Host: Each document is assigned to a Ydb instance (see [#Data distribution and hashing](#Data distribution and hashing)). The Ydb instance that owns a document is the document host. It will handle the distribution of updates and persistence information. This is not necessarily the instance we connect to. Hence the distinction.
Ydb stores documents as a simple append-only file on the hard drive. The document
instance also holds the amount of information ever created on the document as offset and a unique identifier documentSessionID that is randomly generated by a Ydb instance. The documentSessionID guarantees that client and host talk about the same document. In case a Ydb instance closes unexpectedly, Ydb elects new hosts for the documents hosted on the failed instance. In this case, the documentSessionID changes because Ydb can't ensure that the clients talk about the same document anymore. Since the session id changes, clients are forced to resync the document by either doing a [Yjs based sync](#Yjs based sync) or a [Brute-force sync](#Brute-force sync).
TODO: sync protocol here
Yjs based sync is a minimal sync in terms of data exchanged. However, it requires both client and server to load the document to memory.
- Sync Step 1: Client computes the state vector of the local document and sends it to the server. This is basically an array of integer pairs:
[ [userid1, amountOfContentCreatedByUserId1], ..]
- Sync step 2: Server loads document to memory. Then it computes all missing operations based on sync step 1 and sends it to the client. It also computes a state vector and sends it to the client. Then the document is removed from memory. 3 Update server: The client computers all missing changes based on the state vector provided by the server and sends it to the server.
- The server appends missing operations from the client to the local document file. Client and server are in sync.
Ydb does not understand the Yjs model at the moment. Therefore we use [Brute-force sync](#Brute-force sync) as it is easier to implement and more reliable for now. Ydb will support Yjs based sync in the future.
Simple. If the ``documentSessionID` does not match, the client enforces a brute-force sync. Both client and server exchange the local version of the document and append it to the local document structure.
The client will be able to recognize duplicate operations. However, the size of the document stored on the server essentially doubles until it is shrunk.
Keep in mind that brute-force sync is only enforced when a Ydb instance dies unexpectedly. The advantage of brute-force sync is that it does not require Ydb to load the Yjs structure to memory. However, the disadvantage is that potentially vast amount of content needs to be exchanged between client and server until they sync. Brute-force sync will ensure eventual consistency until we implemented a Go version of the Yjs sync protocol.
https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8
https://arxiv.org/pdf/1406.2294.pdf
- rewrite writeVaruint to only accept 32 uints as js does only support 32 bit encoding