Improve OOB change handling
dlqqq opened this issue · 1 comments
Introduction
(this section is mostly copied from #240)
An out-of-band (abbrev. OOB) change is an edit of a collaborative file not done through JupyterLab's UI. For example, suppose user A runs git reset --hard <other-branch>
while another user B is editing a collaborative document within the same repository. A's git reset
deletes all edits made by B and sets the contents of that document to some fixed value. Then the YStore table containing B's edits needs to be dropped, and this event must be somehow communicated to all Yjs clients connected to the document. Here, we say that an out-of-band change occurred on B's document.
Please see #240 for information on how OOB changes are detected & signaled. This issue focuses exclusively on how OOB changes are handled by the backend.
Description
The current implementation of how OOB changes can be improved to potentially reduce data loss and to greatly improve performance on long-running collaboration sessions.
Context: how OOB changes are handled currently
The handling of OOB changes is implemented in DocumentRoom
, mostly in the _on_outofband_change()
method. The relevant portions of its source are:
class DocumentRoom(YRoom):
"""A Y room for a possibly stored document (e.g. a notebook)."""
def __init__(
self, ...
):
...
# Listen for document changes
self._document.observe(self._on_document_change)
self._file.observe(self.room_id, self._on_outofband_change)
async def _on_outofband_change(self) -> None:
"""
Called when the file got out-of-band changes.
"""
self.log.info("Out-of-band changes. Overwriting the content in room %s", self._room_id)
self._emit(LogLevel.INFO, "overwrite", "Out-of-band changes. Overwriting the room.")
try:
model = await self._file.load_content(self._file_format, self._file_type)
except Exception as e:
msg = f"Error loading content from file: {self._file.path}\n{e!r}"
self.log.error(msg, exc_info=e)
self._emit(LogLevel.ERROR, None, msg)
return
async with self._update_lock:
self._document.source = model["content"]
self._document.dirty = False
From the above source, we see that when an OOB change occurs, DocumentRoom
will fetch the latest file contents. If that does not raise an exception, then the most important block is reached, which is highlighted below. When an OOB change occurs, this block sets the source of the YDoc to the latest file contents:
async with self._update_lock:
self._document.source = model["content"]
self._document.dirty = False
To understand this, we need to know what A) what self._document
is, and B) what happens when self._document.source
is set.
self._document
is a Ydoc from jupyter_ydoc
. jupyter_ydoc
provides YDoc-like classes that are wrappers around pycrdt.Doc
, each of which implement a unique interface for a specific file type.
jupyter_ydoc
currently provides three types of Ydoc-like classes:
YUnicode/YFile
: a YDoc that represents a UTF-8 plaintext fileYNotebook
: a YDoc that represents a notebook fileYBlob
: a YDoc that represents a blob file
We will refer to these classes provided by jupyter_ydoc
as Jupyter Ydocs, for specificity.
We now know what self._document
is. It is one of the 3 Jupyter Ydocs, all of which inherit from jupyter_ydoc.YBaseDoc
. Now we need to know what happens when self._document.source
is set.
Here are the relevant portions of YBaseDoc
's source that determine the behavior of setting its source
attribute:
class YBaseDoc(ABC):
"""
Base YDoc class.
This class, defines the minimum API that any document must provide
to be able to get and set the content of the document as well as
subscribe to changes in the document.
"""
...
@property
def source(self) -> Any:
"""
Returns the content of the document.
:return: The content of the document.
:rtype: Any
"""
return self.get()
@source.setter
def source(self, value: Any):
"""
Sets the content of the document.
:param value: The content of the document.
:type value: Any
"""
return self.set(value)
@abstractmethod
def get(self) -> Any:
"""
Returns the content of the document.
:return: Document's content.
:rtype: Any
"""
@abstractmethod
def set(self, value: Any) -> None:
"""
Sets the content of the document.
:param value: The content of the document.
:type value: Any
"""
As shown above, get()
and set()
are implemented in the YDoc class directly inheriting from YBaseDoc
. This means that ultimately, the handling of an OOB change depends on the file type, and is determined by the definition of set()
in the corresponding Jupyter Ydoc.
Issue part 1: Verify that OOB change handling doesn't lead to data loss
I have looked through the latest implementation of the set()
method on YUnicode
, YNotebook
, YBlob
, and could not find any immediate concerns. Each implementation flags all of the existing items for deletion, and then inserts the new content from the beginning. This does not delete the Ydoc and therefore should not result in any data loss. I've also tested the OOB change handling in the latest version of jupyter_collaboration
with this one-liner shell script:
sleep 5; cat Untitled.ipynb | jq '.cells[0].source[0] = "i am an out-of-band edit!"' > temp.ipynb && mv temp.ipynb Untitled.ipynb
This script sleeps 5 seconds, then makes an OOB change that sets the content of the first cell to a fixed string. I ran this script and immediately began making edits to the first cell until the OOB change occurred. This did not lead to any data loss for me locally.
While I have not yet found any specific code paths that lead to data loss in the latest implementation, I think that it remains debatable whether the current OOB change handling leads to data loss in certain scenarios. This is supported by the fact that myself and others have noticed OOB changes being logged by the extension immediately prior to data loss incidents. See #219 as an example.
Issue part 2: Each OOB change grows the Ydoc by at least the new file size
The implementations of set()
, while correct, have the unfortunate consequence of growing the Ydoc's size by a factor of the new file size. That is, an OOB change on a 0.5 MB notebook will grow the Ydoc by at least 0.5 MB.
You can confirm this locally with these steps:
- Save
1kb.txt
locally. - Open JupyterLab with
jupyter_collaboration
installed. - Open
1kb.txt
through JupyterLab. - Record the file size of
.jupyter_ystore.db
.- You can do so via
ls -lah . | grep 'ystore.db'
.
- You can do so via
- Make 5 OOB changes by running this one-liner script in your terminal:
touchit() { sleep 1; touch -m 1kb.txt }; for i in {1..5}; do touchit; done
- Wait a few seconds, then record the file size of
.jupyter_ystore.db
again.
On my machine, the Ystore starts at 12K and grows to 20K after making 5 OOB changes. I expected a growth in size of at least 5K, and observed a growth of 8K. This test confirms that currently, each OOB change grows the Ydoc and Ystore by at least the new file size.
Larger Ydocs lead to poorer performance. The most significant consequence is that a larger Ydoc results in a slower connection for new users, as the entire Ydoc is streamed to new clients. I will elaborate on this in a future issue concerning the YStore specifically.
Proposed next steps
-
Increase confidence in the claim that OOB changes do not lead to data loss. To do so, we should:
- Add test coverage of OOB change handling.
- Simplify OOB change handling code.
-
Reduce the growth in YStore size on OOB changes
- For example, for "small" changes, it is preferable to compute a diff and only apply that diff as a Yjs item. This would drastically reduce the growth in YStore size per OOB change. This may be implemented as a boolean configurable to avoid a breaking change, though I personally do not see this as a breaking change.
- It is also worth considering a more general configurable that allows for other ways of handling OOB changes. For example, we may want
DocumentRoom
either to a) completely ignore OOB changes, or b) to delete the YDoc from memory, drop the YStore table, and begin anew. - Credit to @ellisonbg for some of these suggestions. Thank you!
Reduce the growth in YStore size on OOB changes
- For example, for "small" changes, it is preferable to compute a diff and only apply that diff as a Yjs item. This would drastically reduce the growth in YStore size per OOB change. This may be implemented as a boolean configurable to avoid a breaking change, though I personally do not see this as a breaking change.
What you call "computing a diff" would require interpreting OOB changes and map them to e.g. a notebook schema, which is not an easy task. In this issue I proposed a bridge between different document structures, that is basically solving the same issue.
To give you an idea of the complexity, say an OOB inserts a character in a notebook cell. Now you need to infer the cell index where the change applies, the position of the added character in the cell source, and translate that into a Y operation (fake API here):
ycell = ynotebook["cells"][idx]
ytext = ycell["source"]
ytext.insert(pos, char)
For more complex changes, it will even be ambiguous to infer changes. See for instance how git diff
cannot sometimes infer that something has moved, and it will just mark the change as a deletion and an addition (and git
works with lines of text, which is a very basic document structure).