Set americanize false for tokenizer

Question

Set americanize false for tokenizer

toth12 opened this issue 7 years ago · 3 comments

@arunchaganty I am processing the following sentence:

Test sentence (this is a try).

As an output with print corenlp.to_text(sentence) I get.
'Test sentence -LRB-this is a try-RRB-.'

To get 'Test sentence (this is a try).' I would need to set -tokenizerOptions "americanize=false" as described under question 33 here: https://nlp.stanford.edu/software/parser-faq.html

Can you please let me know how to do that?
I have tried: with corenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma".split(),properties={'timeout': '50000','americanize':'False'}) as client:

this did not work

Answer 1 · 2018-07-20T08:55:41.000Z

You would want to use a command like this:

ann = client.annotate(u"Test sentence (this is a try).", output_format="text", properties={"tokenize.options":"ptb3Escaping=false"})

Answer 2 · 2018-08-16T00:39:54.000Z

A related question: when setting 'tokenize.whitespace': 'true', even if I then do 'tokenize.options': 'ptb3Escaping=true', PTB-style conversions are not done.

E.g.:

with corenlp.CoreNLPClient() as client:
  print(client.annotate('I ( a student ) like apples .', properties={
        'annotators': 'depparse',
        'inputFormat': 'text',
        'outputFormat': 'text',
        'tokenize.whitespace': 'true',
        'tokenize.options': 'ptb3Escaping=true'
    }))

The parentheses in this example are not converted to -LRB- and -RRB-. Do you know how I can make it work? Thanks!

Answer 3 · 2018-08-16T17:58:19.000Z

Looks like it might be ignored on the Java end, so I also opened an issue in the CoreNLP repo here.