Invalid json in output
Closed this issue · 1 comments
dandelionred commented
Sample input I will use below:
$ echo -n $'Th\x10e' | xxd
0000000: 5468 1065 Th.e
Let's use tokenize,ssplit
annotator:
$ echo $'Th\x10e' | /usr/bin/java -mx1g -cp \* edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -annotators tokenize,ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
Entering interactive shell. Type q RETURN or EOF to quit.
NLP> Untokenizable: � (U+10, decimal: 16)
{
"sentences": [
{
"index": 0,
"tokens": [
{
"index": 1,
"word": "Th",
"originalText": "Th",
"characterOffsetBegin": 0,
"characterOffsetEnd": 2,
"before": "",
"after": "�" <------ ascii 0x10
},
{
"index": 2,
"word": "e",
"originalText": "e",
"characterOffsetBegin": 3,
"characterOffsetEnd": 4,
"before": "�", <------ ascii 0x10
"after": ""
}
]
}
]
}
NLP>
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 2 tokens at 38.5 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.1 sec.
We've got ascii 0x10 raw dumped in json under after and before keys. I'm not sure if printing raw ascii control characters in json is valid but php's json_decode()
returns null
for any json containing that so I guess it is not a common practice at the very least.
manning commented
Yes, indeed, we got burned ourselves by this over the summer; previously the JSONOutputter only escaped "common" text control characters rather than all control characters. This got fixed in:
commit d6318a0cb06dba635550477bc843952cc5a5f868
If you checkout HEAD or cherrypick that commit, then you should be good.