Lexemes are unhashable (v0.101.0)
bwegge opened this issue · 6 comments
When I try to add Lexemes to a set or dict, it fails since Lexemes are unhashable:
cat = nlp.vocab['cat']
dog = nlp.vocab['dog']
my_animals = {cat, dog}
Traceback (most recent call last):
File "<ipython-input-30-8ffec97fae23>", line 1, in <module>
my_animals = {cat, dog}
TypeError: unhashable type: 'spacy.lexeme.Lexeme'
Maybe lexeme.orth can be used (together with lexeme.lang) as hash value?
Another funny observation is that looking up the same word multiple times through nlp.vocab[word]
produces Lexemes at different addresses (although comparison works thanks to the newly implemented rich comparison):
nlp.vocab['cat']
Out[17]: <spacy.lexeme.Lexeme at 0xe865401e10>
nlp.vocab['cat']
Out[18]: <spacy.lexeme.Lexeme at 0xe865401d80>
To save memory, the Lexeme class is a wrapper around the LexemeC struct. So the Python objects are indeed created afresh each time. You can see the implementation here: https://github.com/spacy-io/spaCy/blob/master/spacy/lexeme.pyx#L31
Adding a __hash__
method is a good idea though. Will do.
Sounds reasonable, thanks for the explanation!
Is there a workaround for this in the meantime? I'm new to NLP and trying to follow this guide, specifically the part where it mentions word vector representations.
@lylebrown
Replace the curly braces ({ }) with square brackets ([ ]) in the following line:
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "nasa"})
Btw the line should probably be:
allWords = [w for w in parser.vocab if w.has_vector and w.is_lower and w.lower_ != "nasa"]
The old .repvec
property is now named .vector
, too.
The __hash__
method will be there in the next release.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.