Problem loading back the saved FastTextKeyedVectors
robinp opened this issue · 6 comments
Hello! I tried to compress a fasttext model, and then load back the saved gensim model. On trying to load, got this exception:
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim
>>> sm = gensim.models.fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 995, in load
return super(FastTextKeyedVectors, cls).load(fname_or_handle, **kwargs)
File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/utils.py", line 487, in load
obj._load_specials(fname, mmap, compress, subname)
File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1019, in _load_specials
self.adjust_vectors() # recompose full-word vectors
File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1177, in adjust_vectors
self.vectors = self.vectors_vocab[:].copy()
TypeError: 'NoneType' object is not subscriptable
Note: saw this warning while compressing:
/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/scipy/cluster/vq.py:607: UserWarning: One of the
clusters is empty. Re-run kmeans with a different initialization.
warnings.warn("One of the clusters is empty. "
but then rerunning and checking a case where the warning is not printed, the issue still stands.
Pipfile:
...
[packages]
gensim = "==4.1.2"
compress-fasttext = "==0.1.1"
pqkmeans = "*"
python-Levenshtein = "*"
...
But also with gensim==4.0.0
Thank you!
It seems the unpickled object doesn't have the vectors
field, which is why the adjust_vectors is called, which then tries to touch the obviously missing vectors_vocab (the code at https://github.com/avidale/compress-fasttext/blob/master/compress_fasttext/compress.py#L27 didn't set it).
Why could that field be missing when unpickling? It is there on the model before it is saved.
Hello! Could you please provide a complete code snippet with loading the full model, compressing it, saving the small model and loading it?
If I could reproduce the problem, it would be much easier to solve it.
Hm, https://github.com/RaRe-Technologies/gensim/blob/4.0.0/gensim/models/fasttext.py#L1072 seems to ignore "vectors" on saving. But then how could this work? Or maybe noone tried to load it back yet.
Re example, yeah, missed it, sorry:
from gensim.models import fasttext
from gensim.test.utils import datapath
import compress_fasttext
""" original to gensim - can skip
print("Loading")
big_model = fasttext.load_facebook_model(datapath("/root/py/train/eng.bin"))
print("Saving back")
big_model.wv.save("/root/py/train/orig.gensim")
"""
print("Load gensim vecs")
loaded = fasttext.FastTextKeyedVectors.load("/root/py/train/orig.gensim")
print("Compressing")
small_model = compress_fasttext.prune_ft_freq(loaded)
print("Saving")
small_model.save('/root/py/train/eng-small2')
print("Load back saved")
sm = fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small2')
Thanks, I think I got it!
The old Gensim models had two equivalent attributes, vectors
and vectors_vocab
(vectors
are calculated from vectors_vocab
and vectors_ngrams
). This is obviously redundant, so I kept only vectors
in the model. In the update of Gensim, its developers resolved the redundancy in an alternative way: they decided to save only vectors_vocab
, and recompute vectors
each time the model is loaded.
I don't want to store both vectors
and vectors_vocab
, as in the old Gensim (because it takes disk space). But I also don't want to recompute vectors
each time the model loads (because it takes CPU and makes the model load slower).
I will think how to resolve this carefully. Maybe, just will override _save_specials
. Suggestions are welcome.
@robinp, I have updated the package so that the models are saved and loaded correctly.
Please update it to compress-fasttext>=0.1.2
and check that the problem is gone. You need to replace the line
sm = fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small2')
with
sm = compress_fasttext.CompressedFastTextKeyedVectors.load('/root/py/train/eng-small2')
because compressed models use the optimizations that are not present in FastTextKeyedVectors
(and in gensim
in general).
Works like a charm, thank you!