explosion/spaCy

How should `is_oov` be used?

DeNeutoy opened this issue · 7 comments

I had a question (allenai/scispacy#136) in scispacy regarding the usage of is_oov, which then confused me:

How should the is_oov flag be used? Initially, I thought it would correspond to tokens which do not have a vector, but it seems like it should correspond to existence in the nlp.vocab, given this line: https://github.com/explosion/spaCy/blob/master/spacy/cli/init_model.py#L142

In [28]: x = spacy.load("en_core_sci_sm")

In [29]: doc = x("hello this word smelling is oov.")

In [30]: [t.is_oov for t in doc]
Out[30]: [True, False, False, True, False, True, False]

In [31]: x = spacy.load("en_core_web_sm")

In [32]: doc = x("hello this word smelling is oov.")

In [33]: [t.is_oov for t in doc]
Out[33]: [True, True, True, True, True, True, True]

I've seen previously in #3986 that this was an issue from v2.0, so I double checked that the model is fresh:

In [36]: x.meta
Out[36]:
{'accuracy': {'ents_f': 85.8587845242,
  'ents_p': 86.3317889027,
  'ents_r': 85.3909350025,
  'las': 89.6616629074,
  'tags_acc': 96.7783856079,
  'token_acc': 99.0697323163,
  'uas': 91.5287392082},
 'author': 'Explosion AI',
 'description': 'English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.',
 'email': 'contact@explosion.ai',
 'lang': 'en',
 'license': 'MIT',
 'name': 'core_web_sm',
 'parent_package': 'spacy',
 'pipeline': ['tagger', 'parser', 'ner'],
 'sources': ['OntoNotes 5'],
 'spacy_version': '>=2.1.0',
 'speed': {'cpu': 6684.8046553827, 'gpu': None, 'nwords': 291314},
 'url': 'https://explosion.ai',
 'version': '2.1.0',
 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}}

Additionally, i'm not quite sure how i've managed to get this behaviour in one of the scispacy models:

In [51]: x = spacy.load("en_core_sci_sm")

In [52]: doc = x("hello this word smelling is oov.")

In [53]: for t in doc:
    ...:     print(t.is_oov, t.text in x.vocab)
    ...:
True True
False True
False True
True True
False True
True False
False True

In [54]: x = spacy.load("en_core_web_sm")

In [55]: doc = x("hello this word smelling is oov.")

In [56]: for t in doc:
    ...:     print(t.is_oov, t.text in x.vocab)
    ...:
True True
True True
True True
True True
True True
True True
True True

So basically, i'm just wondering what the correct interpretation is of Token.is_oov is 😄

Thanks!

Which page or section is this issue related to?

https://spacy.io/api/token

The intended interpretation is, "tokens that don't have a meaningful .prob value." Which corresponded to words that weren't in the vocab.

This gets non-useful in the _sm models, and might not work well in other data packs if there wasn't a big frequency count used to build the vocab. So I'm not sure it's always useful.

Cool, thanks - i'm not quite sure I understand why this is not useful in the small models? I thought that the small models still use freq statistics when building the vocab (but don't use vectors). Is the vocabulary of the small models different (aside from being smaller) as well? I think understanding this would help me get to the bottom of this case:

  • t in vocab == True, t.is_oov == True # seems wrong by definition

is this just an artifact of the way the small models work?

Maybe I'm wrong but I thought the small models didn't have much vocab?

t in vocab == True, t.is_oov == True

Agree this is confusing :(. The problem is we do add entries to the vocab during processing, as the vocab also acts as a cache. But if we didn't have an initial probability value for the word, we still mark it as oov. Open to suggestions for how to improve this.

Cool, I think I understand how that could happen now, thanks!

I think the utility of whether a token has a .prob is not huge - perhaps t.is_oov == t not in vocab would be a better definition, but that's probably tricky to change now!

This does seem kind of strange. For my use case, I am training the NER with new entities/labels. However, after the training process the words are not added to the vocab but the NER does correctly label them. One would assume that if the NER can correctly label a word it should be in the vocab.

@mdgilene Well...The embedding tables use the hashing trick, so they don't require a fixed-size vocabulary to be computed ahead of time. I understand that it's confusing, but I'll close this as changing the behaviour would introduce a lot of backwards incompatibilities.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.