Invalid parsing of escaped unicode values

Question

Invalid parsing of escaped unicode values

bmcminn opened this issue 10 years ago · 4 comments

Currently using this library in a Grunt task, and ran into the following issue:

// JSON file data being linted
{
    "copyright": "\u2117 & \u00a9 2014 {{sitename}}"
}

// BASH error...
Invalid Reverse Solidus '\' declaration.

Just tested the above snippet against jsonlint.com, jsonlint pro, and jsoneditoronline and they all infer the unicode characters and parse as valid JSON data.

This snippet exists in a much deeper part of my JSON data that is compiled via PHP's json_encode function, however the raw escaped unicode values cause this linter to throw the above error.

Escaping the reverse solidus ["\\u2117 & \\u00a9 2014 {{sitename}}"] "fixes" the issue; though it seems inconvenient as most systems escape unicode values in this fashion by default.

Answer 1 · 2014-08-20T07:17:17.000Z

Further testing shows that in jsonlint.js:7, rvalidsolidus is improperly regexing for the appropriate u[0-9] combination. Changing it as described below remedies the problem, though it doesn't make sense that the explicit length of [0-9]{4} would break like this:

    // ...
    rvalidsolidus = /\\("|\\|\/|b|f|n|r|t|u[0-9]{4})/, // original version
    rvalidsolidus = /\\("|\\|\/|b|f|n|r|t|u[0-9]+)/,   // my change

Edit: regex demo showing that the {4} should work...

Answer 2 · 2014-12-02T19:42:13.000Z

Just figured out why it invalidates and it's because the regex is ONLY listening for numeric unicode values...

Updating jsonlint.js:7 as follows corrects the problem.

    // ...
    rvalidsolidus = /\\("|\\|\/|b|f|n|r|t|u[0-9]{4})/,     // original version
    rvalidsolidus = /\\("|\\|\/|b|f|n|r|t|u[0-9A-F]{4})/i,  // my change
    // > catches \u1234 AND \u12aE

👉 panuhorsmalahti/gulp-json-lint#1

Answer 3 · 2014-12-02T20:39:47.000Z

@codenothing I just finished updating the test on my fork and it passes. I had to modify the json-lint dependency for nlint because it uses jsonlint and had the same regex problem I'm trying to fix :P

In reading up on Unicode spec, under Architecture and terminology, it specifies that the Basic Multilingual Plane occupies the range of 0000 - FFFF, and so my changes reflect this standard, because outside of Basic 0, you get into larger byte sets that the regex is not handling.

In any case, this is a pretty involved issue, because I have no idea what your goal was in supporting a particular unicode spec and how robust the validation of that should be? And so you can review my changes made here (https://github.com/bmcminn/jsonlint) and see what you think, though I plan to issue a pull request to resolve this issue.

EDIT: Pull request issued #3

Answer 4 · 2015-03-23T17:46:40.000Z

Finally have a spec to reference for implementation validation: http://rfc7159.net/rfc7159#unichars