Parsing tagged text
thorunna opened this issue · 5 comments
Hi,
I'm trying to parse input text which has already been tagged using a model that includes a tagger. For this experiment, I'd like to disregard the tagger included in the parsing model but make the parser use the existing tags for tagging the text. Is this possible?
As of benepar 0.2.0a0, there is a new API integrated with NLTK that can more easily handle parsing text with existing tags. If the tags
field of benepar.InputSentence
is not None, the provided tags will be passed through to the output (but if the tags
field is None, benepar will do its own pos tagging).
As of benepar 0.2.0a0, there is a new API integrated with NLTK that can more easily handle parsing text with existing tags. If the
tags
field ofbenepar.InputSentence
is not None, the provided tags will be passed through to the output (but if thetags
field is None, benepar will do its own pos tagging).
Hi Nikita,
I see that version 0.2.0a0 does not have this feature with spaCy integration, which is the recommended way to use benepar. Are there any benefits to using spaCy integration if I am parsing English corpus data that is already tokenized and POS tagged by a human? I just want to make sure--will I have better results if I use the existing tags but integrate with NLTK instead of starting with raw text and using spaCy?
With spaCy, you should be able to do the following to disable benepar's POS tagger and fall back on spaCy's instead.
if spacy.__version__.startswith('2'):
nlp.add_pipe(benepar.BeneparComponent("benepar_en3", disable_tagger=True))
else:
nlp.add_pipe("benepar", config={"model": "benepar_en3", "disable_tagger": True})
You can also inject your own POS tags into spaCy:
for i in range(len(spacy_sent)):
spacy_sent[i].tag_ = my_tags[i] # my_tags[i] is a string, e.g. NN
But the only thing the spaCy integration offers over NLTK is that it has non-destructive and better tokenization, as well as better sentence segmentation. If sentence segmentation, tokenization, and tagging are already done by a human I don't think spaCy offers anything (unless you like its API more than NLTK's).