stanfordnlp/python-stanford-corenlp

Set americanize false for tokenizer

toth12 opened this issue · 3 comments

@arunchaganty I am processing the following sentence:

Test sentence (this is a try).

As an output with print corenlp.to_text(sentence) I get.
'Test sentence -LRB-this is a try-RRB-.'

To get 'Test sentence (this is a try).' I would need to set -tokenizerOptions "americanize=false" as described under question 33 here: https://nlp.stanford.edu/software/parser-faq.html

Can you please let me know how to do that?
I have tried: with corenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma".split(),properties={'timeout': '50000','americanize':'False'}) as client:

this did not work

J38 commented

You would want to use a command like this:

ann = client.annotate(u"Test sentence (this is a try).", output_format="text", properties={"tokenize.options":"ptb3Escaping=false"})

A related question: when setting 'tokenize.whitespace': 'true', even if I then do 'tokenize.options': 'ptb3Escaping=true', PTB-style conversions are not done.

E.g.:

with corenlp.CoreNLPClient() as client:
  print(client.annotate('I ( a student ) like apples .', properties={
        'annotators': 'depparse',
        'inputFormat': 'text',
        'outputFormat': 'text',
        'tokenize.whitespace': 'true',
        'tokenize.options': 'ptb3Escaping=true'
    }))

The parentheses in this example are not converted to -LRB- and -RRB-. Do you know how I can make it work? Thanks!

Looks like it might be ignored on the Java end, so I also opened an issue in the CoreNLP repo here.