openlink/virtuoso-opensource

JSON Resultset UTF-8 encoding issues when escaped with \u

fgiasson opened this issue · 5 comments

Hi,

It appears that UTF-8 characters returned in SPARQL JSON resultsets are not properly encoded with \u.

Here is a DBPedia query that fails:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+%3Fo+%3Falt%0D%0Awhere%0D%0A%7B%0D%0A++%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FwikiPageRedirects%3E+%3Fo+.%0D%0A%0D%0A++%7B%0D%0A++++%3Fs+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23label%3E+%3Falt+.%0D%0A++%7D%0D%0A%7D%0D%0Alimit+10000%0D%0Aoffset+10000&format=application%2Fsparql-results%2Bjson&timeout=30000&debug=on#

Encoded characters such as "\U0001B000" should probably encoded as "\uD82C\uDC00" instead.

knoan commented

Spot on… JSON only supports 4-digit Unicode escape sequences. Unicode characters outside the BMP must be emitted directly as a UTF-8 sequence (allowed by JSON production char) or encoded as surrogate pairs.

This is a serious bug as browser-provided JSON.parse() doesn't support lenient parsing and breaks on illegal escape sequences, as in

JSON.parse("\U0001B000")

May be reproduced by the following query on the DBpedia endpoint:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select * {
   <http://dbpedia.org/resource/Ancient_Carthage> rdfs:comment ?c filter (lang(?c) = 'en')
}
knoan commented

The following should work as a stopgap measure:

    JSON.parse(text.replace(/\\U([0-9A-Fa-f]{8})/g, function ($0, $1) {

        var c=parseInt($1, 16)-0x010000;
        var h=(c>>10)+ 0xD800;
        var l=(c & 0x3FF) + 0xDC00;

        return String.fromCharCode(h, l)

    }))

This issue was fixed a few days ago , and will be making its way to the commercial and open source archives , dbpedia included in the coming days ...

The fix for this issue has been pushed to the open source develop/7 branch:

http://sourceforge.net/p/virtuoso/virtuoso-opensource/ci/e0f65ec67f980251579fbd614be1fb0ac6b18786

Thanks!