stanfordnlp/CoreNLP

Invalid json in output

Closed this issue · 1 comments

Sample input I will use below:

$ echo -n $'Th\x10e' | xxd
0000000: 5468 1065                                Th.e

Let's use tokenize,ssplit annotator:

$ echo $'Th\x10e' | /usr/bin/java -mx1g -cp \* edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json  -annotators tokenize,ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit

Entering interactive shell. Type q RETURN or EOF to quit.
NLP> Untokenizable: � (U+10, decimal: 16)
{
  "sentences": [
    {
      "index": 0,
      "tokens": [
        {
          "index": 1,
          "word": "Th",
          "originalText": "Th",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 2,
          "before": "",
          "after": "�"  <------ ascii 0x10
        },
        {
          "index": 2,
          "word": "e",
          "originalText": "e",
          "characterOffsetBegin": 3,
          "characterOffsetEnd": 4,
          "before": "�",  <------ ascii 0x10
          "after": ""
        }
      ]
    }
  ]
}
NLP> 
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 2 tokens at 38.5 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.1 sec.

We've got ascii 0x10 raw dumped in json under after and before keys. I'm not sure if printing raw ascii control characters in json is valid but php's json_decode() returns null for any json containing that so I guess it is not a common practice at the very least.

Yes, indeed, we got burned ourselves by this over the summer; previously the JSONOutputter only escaped "common" text control characters rather than all control characters. This got fixed in:

commit d6318a0cb06dba635550477bc843952cc5a5f868

If you checkout HEAD or cherrypick that commit, then you should be good.