explosion/spaCy

Error from adding arbitrary fixup rules to pipeline

cchu613 opened this issue · 3 comments

Hello! I'm a newbie to natural language processing and am trying to use spaCy for an information extraction project. So far everything has been great, except that in sentences like "One killed in Bucks County shooting", shooting gets tagged as a verb instead of a noun.

Here is my code (only slightly modified from the tutorial titled Customizing the Pipeline):

def arbitrary_fixup_rules(doc):
    for token in doc:
        if token.lower == u'shooting'
            token.tag_ = u'NN'

def custom_pipeline(nlp):
    return (nlp.tagger, arbitrary_fixup_rules, nlp.parser, nlp.entity)

nlp = spacy.load('en', create_pipeline=custom_pipeline)

However, running

doc = nlp(u'One dead in Bucks County shooting.')

resulted in
AttributeError: attribute 'tag_' of 'spacy.tokens.token.Token' objects is not writable

python 2.7, spacy version 1.1.2

Hm! There's a gap in the API there — a missing attribute setter. Thanks.

This should be fixed in master. We also noticed a page missing from the docs, which we've just put up.

The missing page describes the API for the tokenizer. It's relevant here because it's another way to do what you want here. The tokenizer.add_special_case() method lets you add a rule saying how to segment some string into component tokens. You can then add custom attributes to these tokens.

For instance, you can do something like this:

nlp.tokenizer.add_special_case('shooting', [{"F": "shooting", "pos": "NN"}])

The attribute keys are currently a bit idiosyncratic. It recognises:

  • F: The string of the subtoken.
  • pos: The part-of-speech to assign to the subtoken.
  • L: The lemma (base form) to assign to the the subtoken.

Soon this will be fixed, and it'll support the same token attributes as the rest of the library.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.