Fleece deltas on UTF-8 strings produce invalid JSON strings

Question

Fleece deltas on UTF-8 strings produce invalid JSON strings

bbrks opened this issue 6 years ago · 4 comments

Fleece deltas generated on multi-byte UTF-8 strings create invalid UTF-8 sequences in the delta, which cannot be represented by JSON strings without first being encoded. This causes patching of strings to fail when the Go side consumes a delta generated by Fleece.

Given the following documents to diff, the raw bytes of the delta generated by Fleece are as follows (encoded as hexadecimal):

v1.json: {"mystr": "யாமறிந்த மொழிகளிலே தமிழ்மொழி போல் இனிதாவது எங்கும் காணோம், பாமரராய் விலங்குகளாய், உலகனைத்தும் இகழ்ச்சிசொலப் பான்மை கெட்டு, நாமமது தமிழரெனக் கொண்டு இங்கு வாழ்ந்திடுதல் நன்றோ? சொல்லீர்! தேமதுரத் இகழ்ச்சிசொலப் உலகமெலாம் பரவும்வகை செய்தல் வேண்டும்."}
v2.json: {"mystr": "யாமறிந்த மொழிகளிலே தமிழ்மொழி போல் இனிதாவது எங்கும் காணோம், பாமரராய் விலங்குகளாய், உலகனைத்தும் இகழ்ச்சிசொலப் பான்மை கெட்டு, நாமமது தமிழரெனக் கொண்டு இங்கு வாழ்ந்திடுதல் நன்றோ? கொண்டு! தேமதுரத் தமிழோசை உலகமெலாம் பரவும்வகை செய்தல் வேண்டும்."}
delta:   \x7B\x6D\x79\x73\x74\x72\x3A\x5B\x22\x34\x37\x38\x3D\x31\x2D\x31\x2B\x95\x7C\x35\x3D\x31\x2D\x31\x2B\xA3\x7C\x35\x3D\x31\x30\x2D\x34\x2B\x9F\xE0\xAF\x81\x7C\x32\x39\x3D\x31\x39\x2D\x34\x2B\xA4\xE0\xAE\xAE\x7C\x35\x3D\x31\x33\x2D\x31\x30\x2B\xB4\xE0\xAF\x8B\xE0\xAE\x9A\xE0\xAF\x88\x7C\x31\x30\x34\x3D\x22\x2C\x30\x2C\x32\x5D\x7D

We can pass the raw bytes above into this tool as hexadecimal and see multiple "Unexpected continuation byte." appearances where the problems are occurring.

The delta works to patch the document when fed directly back into Fleece as raw bytes (e.g. in a unit test), but when unmarshalled as a UTF-8 string, the invalid UTF-8 codepoints are replaced with U+FFFD in Go, which ends up looking like this, and doesn't work for obvious reasons:

{"mystr":["478=1-1+�|5=1-1+�|5=10-4+�ு|29=19-4+�ம|5=13-10+�ோசை|104=",0,2]}

Ideally the string diffs wouldn't be splitting bytes inside UTF-8 codepoints, and things would "just work" with valid UTF-8 codepoints, but given the string deltas are being produced by diffmatchpatch outside of Fleece, maybe we just don't do string deltas for any strings containing characters outside of the ASCII range?

This issue can also be seen in the commented out unit test inside #33

Answer 1 · 2019-01-07T19:18:54.000Z

diff_match_patch doesn't appear to have any support for UTF-8 input; instead it supports Unicode by allowing C++ wide strings. So one solution to the problem is to convert both strings to wstring, do the diff on those, then convert the character offsets back to byte offsets when writing the string form of the delta.

...which seems kind of expensive, for a situation that's probably(?) not going to be super common. We can optimize by doing the diff the existing way, then checking for invalid UTF-8 sequences, and if one is found falling back to the expensive way.

Or I wonder if we can detect the invalid range when Fleece processes the diff_match_patch output, and adjust it to be valid? The only case we need to fix is the + operation, since that's where bytes from the input string appear in the delta.

Answer 2 · 2019-01-07T19:21:41.000Z

Another option: encode invalid byte sequences in the delta. For example, URL-style %-encoding. The only bytes that need to be encoded are invalid UTF-8 sequences, and of course a literal %.

Answer 3 · 2019-01-07T19:31:27.000Z

@snej myself and @adamcfraser came up with a similar solution to your last comment, where we preserve these partial bytes with some encoding on both sides, and I decode it before patching. Seems to be the least amount of hassle.

I was initially thinking of base64 encoding the full string to preserve any raw bytes, but escaping them individually would also work.

The associated work for Sync Gateway is being tracked here: https://issues.couchbase.com/browse/CBG-174

Answer 4 · 2019-01-08T18:01:55.000Z

I fixed it by editing the patch ranges produced by diff_match_patch. For each patch that represents an insertion or deletion, the ends of the range are pushed outwards so they fall on character boundaries. This is done by looking at the first byte in the range and the first byte past the range — if either's most significant 2 bits are 10 it is a UTF-8 continuation byte, so the range needs to be pushed out past it (by decrementing its start or incrementing its end.)