Set americanize false for tokenizer
toth12 opened this issue · 3 comments
@arunchaganty I am processing the following sentence:
Test sentence (this is a try).
As an output with print corenlp.to_text(sentence) I get.
'Test sentence -LRB-this is a try-RRB-.'
To get 'Test sentence (this is a try).' I would need to set -tokenizerOptions "americanize=false" as described under question 33 here: https://nlp.stanford.edu/software/parser-faq.html
Can you please let me know how to do that?
I have tried: with corenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma".split(),properties={'timeout': '50000','americanize':'False'}) as client:
this did not work
You would want to use a command like this:
ann = client.annotate(u"Test sentence (this is a try).", output_format="text", properties={"tokenize.options":"ptb3Escaping=false"})
A related question: when setting 'tokenize.whitespace': 'true', even if I then do 'tokenize.options': 'ptb3Escaping=true', PTB-style conversions are not done.
E.g.:
with corenlp.CoreNLPClient() as client:
print(client.annotate('I ( a student ) like apples .', properties={
'annotators': 'depparse',
'inputFormat': 'text',
'outputFormat': 'text',
'tokenize.whitespace': 'true',
'tokenize.options': 'ptb3Escaping=true'
}))The parentheses in this example are not converted to -LRB- and -RRB-. Do you know how I can make it work? Thanks!
Looks like it might be ignored on the Java end, so I also opened an issue in the CoreNLP repo here.