aaubry/YamlDotNet

YamlStream.Load with JSON with emojis (even escaped) fails: "While scanning a quoted scalar, found invalid Unicode character escape code."

cmeeren opened this issue ยท 13 comments

I am serializing something to JSON with System.Text.Json, and then converting it to YAML using YamlStream. However, if the JSON contains an emoji, even if it's escaped, YamlStream.Load throws:

(Line: 1, Col: 1, Idx: 0) - (Line: 1, Col: 4, Idx: 3): While scanning a quoted scalar, found invalid Unicode character escape code.

Code to reproduce:

open System.IO
open System.Text.Json
open YamlDotNet.RepresentationModel

let input = "๐Ÿ‘"

let json = JsonSerializer.Serialize input

assert (json = "\"\\uD83D\\uDC4D\"")

let yamlStream = YamlStream()
yamlStream.Load(new StringReader(json))

Those utf-8 codes are actually invalid codes for utf-8. According to Wikipedia anyways. I wonder if the dot net core json library sees your character as 2 separate characters because it might be utf-16. Iโ€™m not able to actually dig in to this too much right now, but Iโ€™ll try and take a closer look tonight. I wonder if you can specify the encoding in the json serializer and set it to utf-16 and see what happens.

UTF-16? But JSON is always UTF-8 by definition, isn't it? In any case, I can find nothing in System.Text.Json about UTF-16. Let me know if I'm wrong.

Also, AFAIK many emojis are composed of 2 or even more unicode characters. This SO question and answers may be helpful.

Finally, I highly doubt that System.Text.Json (the official .NET JSON serialization API) is doing things wrong. That would require very concrete proof.

This will also help me in getting a fix in

dotnet/runtime#42847
Thereโ€™s some other linked issues that should help clarify the nuances in Unicode.

not sure when Iโ€™ll get it done though.

Is there any workaround I can apply now, for converting JSON to YAML? The emojis etc. doesn't have to be unescaped; I'm OK with anything that preserves the escape codes.

I simply want my JSON content converted to YAML (tweaked using a YAML visitor). Now it seems that the YAML conversion is attempting to decode escaped stuff, and is failing at that.

Possibly relevant: The Wikipedia article on JSON, section "Character encoding", says:

JSON exchange in an open ecosystem must be encoded in UTF-8. The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+0000 to U+FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs. For example, to include the Emoji character U+1F610 ๐Ÿ˜ NEUTRAL FACE in JSON:

{ "face": "๐Ÿ˜" }
// or
{ "face": "\uD83D\uDE10" }

The latter is exactly what System.Text.Json does. And since YAML is a superset of JSON, I would expect any YAML implementation, such as YamlDotNet, to support such surrogate pairs.

Ideally I would like unescaped output (i.e., emojis in the YAML), but at least the following workaround lets me preserve the surrogate pair escape codes:

let unicodeEscapeCodePlaceholder = Guid.NewGuid().ToString()

let escapeUnicodeEscapeCodes (str: string) =
    str.Replace(@"\u", unicodeEscapeCodePlaceholder)

let unEscapeUnicodeEscapeCodes (str: string) =
    str.Replace(unicodeEscapeCodePlaceholder, @"\u")

let formatAsYaml json =
    let json = escapeUnicodeEscapeCodes json
    let yaml = (* load with YamlStream and transform to YAML *)
    unEscapeUnicodeEscapeCodes yaml

I definitely do agree that utf8 surrogate pairs should be supported. Iโ€™m hoping to have some time this weekend to look at it. I did find where it was throwing a fit though.

Happy to hear it! ๐Ÿ˜

I've been doing a lot of research on this, through the unicode spec and all that. The surrogate pairs are a way of putting utf-16 and utf-32 characters in a utf-8 file. JSON does this by escaping them, in raw UTF-8 files, its done at the byte level (from my understanding). We just need to support the escaped version, just like the YAML spec shows (if I recall). It has a lot to deal with bit masking and what not, it's pretty complicated so it may take some time, but it may go quicker than I think. I'll let you know when I have a PR ready for it by linking it to this issue.

Thanks! I have no idea what the internals of YamlDotNet is doing, but wouldn't it work to just keep the loaded text as-is and not attempting to decode the escape sequences, which is seems like it's doing now?

This fix is going to be released in the latest nuget package which should be available in about 10-15 minutes.

It works. Thanks a lot! ๐Ÿ˜Š