Handling quotes
jofatmofn opened this issue · 4 comments
Given the text
John said, "Welcome to the heaven".
rrp.simple_parse gives
(S1 (S (NP (NNP John)) (VP (VBD said) (, ,) (`` ``) (INTJ (UH Welcome) (PP (TO to) (NP (DT the) (NN heaven)))) ('' '')) (. .)))
If I use rrp.parse_tagged with the following tokens and postags
tokens=[u'John', u'said', u',', u'"', u'Welcome', u'to', u'the', u'heaven', u'"', u'.']
postags={0: u'NNP', 1: u'VBD', 2: u',', 3: u'``', 4: u'UH', 5: u'TO', 6: u'DT', 7: u'NN', 8: u"''", 9: u'.'}
it returns an empty list.
Workaround: In tokens, if I change the beginning double quotes to two backticks and ending double quotes to two apostrophe, as
tokens=[u'John', u'said', u',', u'``', u'Welcome', u'to', u'the', u'heaven', u"''", u'.']
then it works.
Sure. Thanks. Could you please direct me to any reference (document or code) which highlights such replacements. I need to use tokens and postags from another parser and I can apply these before calling BLLIP.
Thanks.