explosion/spaCy

Lexemes are unhashable (v0.101.0)

bwegge opened this issue · 6 comments

When I try to add Lexemes to a set or dict, it fails since Lexemes are unhashable:

cat = nlp.vocab['cat']
dog = nlp.vocab['dog']
my_animals = {cat, dog}

Traceback (most recent call last):

  File "<ipython-input-30-8ffec97fae23>", line 1, in <module>
    my_animals = {cat, dog}

TypeError: unhashable type: 'spacy.lexeme.Lexeme'

Maybe lexeme.orth can be used (together with lexeme.lang) as hash value?

Another funny observation is that looking up the same word multiple times through nlp.vocab[word] produces Lexemes at different addresses (although comparison works thanks to the newly implemented rich comparison):

nlp.vocab['cat']
Out[17]: <spacy.lexeme.Lexeme at 0xe865401e10>

nlp.vocab['cat']
Out[18]: <spacy.lexeme.Lexeme at 0xe865401d80>

To save memory, the Lexeme class is a wrapper around the LexemeC struct. So the Python objects are indeed created afresh each time. You can see the implementation here: https://github.com/spacy-io/spaCy/blob/master/spacy/lexeme.pyx#L31

Adding a __hash__ method is a good idea though. Will do.

Sounds reasonable, thanks for the explanation!

Is there a workaround for this in the meantime? I'm new to NLP and trying to follow this guide, specifically the part where it mentions word vector representations.

jr-pe commented

@lylebrown
Replace the curly braces ({ }) with square brackets ([ ]) in the following line:

allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "nasa"})

Btw the line should probably be:

allWords = [w for w in parser.vocab if w.has_vector and w.is_lower and w.lower_ != "nasa"]

The old .repvec property is now named .vector, too.

The __hash__ method will be there in the next release.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.