nikitakit/self-attentive-parser

Parsing tagged text

thorunna opened this issue · 5 comments

Hi,

I'm trying to parse input text which has already been tagged using a model that includes a tagger. For this experiment, I'd like to disregard the tagger included in the parsing model but make the parser use the existing tags for tagging the text. Is this possible?

@thorunna Did you have any luck finding out if this is possible?

@jkallini No I didn't, but please let me know if you have any!

As of benepar 0.2.0a0, there is a new API integrated with NLTK that can more easily handle parsing text with existing tags. If the tags field of benepar.InputSentence is not None, the provided tags will be passed through to the output (but if the tags field is None, benepar will do its own pos tagging).

As of benepar 0.2.0a0, there is a new API integrated with NLTK that can more easily handle parsing text with existing tags. If the tags field of benepar.InputSentence is not None, the provided tags will be passed through to the output (but if the tags field is None, benepar will do its own pos tagging).

Hi Nikita,
I see that version 0.2.0a0 does not have this feature with spaCy integration, which is the recommended way to use benepar. Are there any benefits to using spaCy integration if I am parsing English corpus data that is already tokenized and POS tagged by a human? I just want to make sure--will I have better results if I use the existing tags but integrate with NLTK instead of starting with raw text and using spaCy?

With spaCy, you should be able to do the following to disable benepar's POS tagger and fall back on spaCy's instead.

if spacy.__version__.startswith('2'):
    nlp.add_pipe(benepar.BeneparComponent("benepar_en3", disable_tagger=True))
else:
    nlp.add_pipe("benepar", config={"model": "benepar_en3", "disable_tagger": True})

You can also inject your own POS tags into spaCy:

for i in range(len(spacy_sent)):
    spacy_sent[i].tag_ = my_tags[i]  # my_tags[i] is a string, e.g. NN

But the only thing the spaCy integration offers over NLTK is that it has non-destructive and better tokenization, as well as better sentence segmentation. If sentence segmentation, tokenization, and tagging are already done by a human I don't think spaCy offers anything (unless you like its API more than NLTK's).