explosion/spaCy

Tokenisation exceptions for English pronoun "She"

liane-brainnwave opened this issue · 9 comments

tokenizer_exceptions.py for English contains several entries for the pronoun 'She' with ORTH value 'i'. for example:

"She's": [
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]

Unless I am missing something, this should be:

"She's": [
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]

ines commented

Thanks – you're absolutely correct. Looks like this was a copy-paste mistake that wasn't spotted. Fixing this now!

Also currently working on making the English exceptions a little easier to read – for example, ordering them properly etc.

Thanks ines. Re-ordering / grouping the exceptions would certainly help.

ines commented

For the pronouns, I'm thinking about doing something like:

for pron in ["you", "You", "we", "We", "they", "They"]:
    EXC[pron + "'re"] = [
        {ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
        {ORTH: "'re", LEMMA: "be", NORM: "are"}
    ]

    EXC[pron + "re"] = [
        {ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
        {ORTH: "re", LEMMA: "be", NORM: "are"}
    ]

This obviously doesn't work for all exceptions and languages, and not everything follows such a nice pattern. But for the tokens that do, I think it could help with making sure everything's consistent. (Plus, it'd easily cut a couple of hundred lines.)

I know that @honnibal don't want to overuse regexps in spaCy, but I think that in these case that would not just be more efficient but is easier to maintain as well.

If we can utilize the return value of rule_match from #700 we can simplify the handling of such exceptions significantly. What do you think?

ines commented

@oroszgy Hm, so I agree regexes would be more elegant, but even with the new rule_match, all we can do is match the exception – but there's no way to actually return a list of split tokens for each individual exception. Or am I misunderstanding your suggestion?

@ines I meant, that we could simply change the semantics of the token_match thing so that it can return tokens if there is a match. Doing so the tokenizer exception handling would be a bit easier.

Instead of token_match(text: str) -> bool we can have a token_match(text: str) -> Union[None, List[TokenAnalysis]] where the TokenAnalysis would be something like this for "and/or":

{ORTH: "and/or", LEMMA: "and/or", TAG: "CC"}

I really think it's better to express the exceptions as literal strings. We're hashing these, so it's (super fast) O(1) operation per-chunk to check the exceptions. I do think we'd hit a performance problem from moving this to regular expressions.

The expansion logic also gets quite tricky if you're writing an expression. Let's say you write a pattern like this:

(I|he|she) ('d) (nt|n't)

You need to look up the regex capture groups to fill out the ORTH fields, and ultimately the lemmas. In this case there's a natural 1-to-1 mapping between the groups and the tokens, but this need not be so. If you have character classes for case sensitivity, you've got some problems to deal with.

In order to make the expansion logic sane, you'll be tempted to separate out the matches into different expressions. Once you do that, you're not really any further ahead on generality/compactness than you are with the explicit exceptions.

Honnibal, thanks for the explanation! For some reason I didn't notice that for loops are executed at loading time and not running time. Of course, in this case hashing is way more effective than regular expressions.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.