Support Symbols in the Astral Plane
roark opened this issue ยท 12 comments
Hi @jhchen,
I've been happily enjoying your fast-diff library in a custom editor that I'm building.
However, I came across an issue when trying to support code points outside of the BMP. In its current implementation fast-diff checks each JS "character" or "code unit". Being that ES5 JS iterates over code units rather than code points this creates issues with symbols outside the BMP.
A diff of two different string symbols such as "๐ถ" and "๐ฏ" results in the first code unit matching and the second code unit non-matching. Thereby splitting up the surrogate pairs and potentially causing unwanted side effects.
Fast-Diff Produces:
[ [ 0, '๏ฟฝ' ], [ -1, '๏ฟฝ' ], [ 1, '๏ฟฝ' ] ]
Rather than:
[ [ -1, '๐ถ' ], [ 1, '๐ฏ' ] ]
I've modified your library to handle surrogate pairs as single "symbols". You can find my changes at fast-diff-astral. If you would like I can submit a pull request.
The method it uses is to break up the string into an array of symbols. Thereby allowing iteration over each symbol and full comparison of each surrogate pair as a single code point.
This may cause some extra overhead in terms of processing speed, which, along with the amount of changes, is why I broke it out into a separate library. That said, it still seems quick but I haven't tested the speed differences.
I've chosen to use a custom method to find and group surrogate pairs into an array rather than using the new ES6 String iterator methods. I did so only to avoid introducing a transpiler to the build process.
Hope you get a chance to look over it. Let me know if you'd like a pull request.
Cheers!
Refs: Great article by Mathias Bynens on JS unicode support, if you haven't already read it.
Thank you for the detailed report.
A diff of two different string symbols such as "๐ถ" and "๐ฏ" results in the first code unit matching and the second code unit non-matching. Thereby splitting up the surrogate pairs and potentially causing unwanted side effects.
Have you found this to be an issue in practice? It already breaks up characters within words for example but this is not an issue.
Performance would be the other concern. The diff has to work on very large strings (pages and pages of text). Right now it scales on the number of changes so works very well on even large strings. I would be concerned this is lost with converting a large string to an array and back. Intuitively this overhead is not trivial but numbers would trump everything.
Great question and valid concerns. In its current form the diff generated, even with surrogate pairs being broken up, remains accurate and can be applied to the original string to generate the second string.
Have you found this to be an issue in practice?
I did encounter an issue when passing the diff object to a server over websocket (npm ws).
The diff goes through JSON.stringify (client side) and then JSON.parse (server side). On parsing, when only half of a surrogate pair is present, it is parsed into the Replacement Character ๏ฟฝ (\uFFFD or 65533), rather than the original char code.
"๐ถ" // is '\uD83D\uDC36' or 55357 56374
"๐ฏ" // is '\uD83D\uDC2F' or 55357 56367
// produces diff of:
[ [0,"\uD83D"], [-1,"\uDC36"], [1,"\uDC2F"] ]
// after JSON.stringify on client & JSON.parse on server:
[ [0,"\uFFFD"], [-1,"\uFFFD"], [1,"\uFFFD"] ]
This does seem unusual, because if you run JSON.stringify and then JSON.parse locally, with the surrogate halves broken up, the char codes still remain accurate.
However, once passed over the websocket, lone surrogate halves are parsed as "\uFFFD"
. I assume this is because when a surrogate half is on its own it is not considered a valid codepoint.
see: ( \uD83D \uDC36 )
When the surrogate pairs are sent together and go through the stringify / parse process, they remain accurate and valid.
I don't yet know if this is a protocol level decision or possibly an issue with the ws
package. I also haven't tested if this is an issue with other websocket libraries or over other protocols.
In regards to performance...
Intuitively this overhead is not trivial but numbers would trump everything.
Your intuition is generally correct. The extra overhead of parsing each String into an Array of codepoint symbols is non trivial. And really shows itself on strings of 1000 characters or less.
Interestingly enough and much to my surprise, there seem to be some cases where fast-diff-astral appears to actually perform slightly faster, even with the extra overhead of doing an upfront conversion of String to Array. (Assuming due to better performance on Arrays than Strings?)
Even more interesting, it's on cases were when the String lengths are greater than 6000 characters. So even with the large up front cost of conversion, apparently the better performance with arrays made up for the conversion loss.
However, it isn't all great, since there are only minimal gains on very long strings but large losses on shorter strings... So depending on what one is working with and the number of iterations being performed, the differences can really add up.
I put together a little script to performance test fast-diff vs fast-diff-astral. You can try it yourself and play with the number of iterations / lengths if you'd like.
Performance results (similar on multiple runs):
Computing 100 diffs, string length 10, with seed 5888...
fast-diff: 2ms
fast-diff-astral: 9ms
Computing 100 diffs, string length 100, with seed 5359...
fast-diff: 35ms
fast-diff-astral: 110ms
Computing 100 diffs, string length 1000, with seed 155...
fast-diff: 1135ms
fast-diff-astral: 1843ms
Computing 100 diffs, string length 10000, with seed 7132...
fast-diff: 112460ms
fast-diff-astral: 100477ms
Computing 10 diffs, string length 100000, with seed 4213...
fast-diff: 818379ms
fast-diff-astral: 686297ms
All that said, occasionally when running with larger iteration numbers, I sometimes saw fast-diff-astral gains on long strings diminish, which I don't know what to make of.
Possibly with the new ES6 String spread operator coming, the upfront overhead of converting a string into any array of symbols would be minimized and a speed increases could be gained.
(Just out of curiosity, I will try Babel's transpiled version of the String spread operator to see if it performs better than the very manual version built in right now... Though I would imagine eventually the actual ES6 language level implementation would be even faster.)
That's interesting there are cases where the array runs faster.
One testing variant to focus on though is diffs in where there are a small number of changes. This allows for applications to be very responsive and emit change events very often, even on large documents or strings. In fact the more responsive the application is, the smaller number of batched changes, the faster the overall system. If we make this modification to your test (the else branch):
function generateRandomStrings(seed, length, count, changes = 3) {
var strings = [];
var random = seedrandom(seed);
for(var i = 0; i <= count; ++i) {
if (strings.length === 0) {
var chars = [];
for(var l = 0; l < length; ++l) {
var letter = ALPHABET.substr(Math.floor(random() * ALPHABET.length), 1);
chars.push(letter);
}
strings.push(chars.join(''));
} else {
var next = strings[strings.length-1];
for(var j = 0; j < changes; ++j) {
var index = Math.floor(random()*next.length);
var letter = ALPHABET.substr(Math.floor(random() * ALPHABET.length), 1);
next = next.slice(0, index) + letter + next.slice(index + 1);
}
strings.push(next);
}
}
return strings;
}
You can see where the performance diverges, regardless of document size.
Computing 1000 diffs, string length 10, with seed 9119...
fast-diff: 17.482ms
fast-diff-astral: 55.444ms
Computing 1000 diffs, string length 100, with seed 6004...
fast-diff: 8.382ms
fast-diff-astral: 70.755ms
Computing 1000 diffs, string length 1000, with seed 3335...
fast-diff: 12.816ms
fast-diff-astral: 174.641ms
Computing 1000 diffs, string length 10000, with seed 522...
fast-diff: 46.190ms
fast-diff-astral: 1509.164ms
Computing 10000 diffs, string length 10, with seed 46...
fast-diff: 68.268ms
fast-diff-astral: 315.805ms
Computing 10000 diffs, string length 100, with seed 1207...
fast-diff: 103.920ms
fast-diff-astral: 585.409ms
Computing 10000 diffs, string length 1000, with seed 3921...
fast-diff: 153.088ms
fast-diff-astral: 1656.690ms
Computing 10000 diffs, string length 10000, with seed 682...
fast-diff: 406.333ms
fast-diff-astral: 15454.475ms
The diff library has an optimization for 1-2 changes so it's even faster in those cases.
Interesting... It makes a lot of sense that your fast-diff
library would be much faster in cases where there are a small number of differences between the strings. (Which probably is the most common case).
I did keep in your optimizations for a binary search of common suffix and prefixes, and they work quickly with the arrays as well. That said, in those cases the majority of the diffing time for the fast-diff-astral
library is the initial conversion from string to array and therefore leads to the massive performances divergence.
I doubt there could be any solution to make fast-diff-astral
nearly as fast in those cases. Unless the native ES6 string spread operator was almost instantaneous... Or by taking a different approach and not converting to an array and doing some major changes to keep an eye out for surrogate pairs.(Which would still add some extra time since then the library would be repeatedly checking surrounding characters, looking for pairs.)
I think there are still some use cases for performing diffs that don't split the surrogate pairs, but ultimately it will likely never be as fast.
Therefore, it probably makes most sense to keep these libraries separate. Since for many use cases the speed of large similar diffs is most important and spitting surrogate pairs has little consequence. I'll also add a note to the fast-diff-astral
library README explaining the speed difference and mentioning that the library should only be used if not splitting surrogate pairs is the exact functionality needed.
Sounds good thanks for looking into this.
I did run into this issue in practice when the back end is not Javascript. In those cases converting between UTF-16 and UTF-8 will fail if strings contain partial surrogate pairs. So this is now fixed in fast-diff itself.
Does this fix affect slab/quill#1230 ?
The different length of "๐" in Javascript and other languages breaks OT.
It should help the approach where the back end tries to count characters in UTF-16 but not sure it it completely solves it yet.
I'm currently using UTF-8 on the server and I don't see a difference after this change.
I'll try to use UTF_16.
Thanks.
To be clear, when it reaches the back end it has to be UTF-8 off the network but the counting part in the OT code would treat them like UTF-16. Consider the this:
Initial Doc: "Go ๐!"
Delta: new Delta().retain(3).insert("๐").delete(2)
Result in JS: "Go ๐!"
The server will get the Delta , where the insert string length is only 1 now if using the native string length functions. But note the other integers are still 2. Because the back end can't know we are deleting two characters that are both a part of the same surrogate pair, I would claim the only way for the back end to work correctly with this delta is if it counted strings using UTF-16. It presumably has the initial doc in memory however it counts needs to count the string to have a length of 6, not 5 or else the delete(2) instruction will fail.
I did run into this issue in practice when the back end is not Javascript. In those cases converting between UTF-16 and UTF-8 will fail if strings contain partial surrogate pairs. So this is now fixed in fast-diff itself.
Use python at the back end of my project. After watching everyone's discussion, I still haven't got a solution. How to ensure that javascript and python have the same emoji length? Thank you