avidale/compress-fasttext

Supervised fastText models are not supported

danieldaeschle opened this issue ยท 12 comments

I want to load the model using gensim:

from gensim.models.fasttext import FastText
FastText.load_fasttext_format("cc.de.300.compressed.bin")

But I get the error:

File "C:\Users\dd\Projects\wordembeddingservice\venv\lib\site-packages\gensim\models\_fasttext_bin.py", line 194, in _load_vocab
    raise NotImplementedError("Supervised fastText models are not supported")
NotImplementedError: Supervised fastText models are not supported

Is there a way to get it working?

As far as I know, there is no way. Supervised FastText embeddings are not supported by Gensim, and this library is only a wrapper around Gensim.

In principle, we could write some code to support supervised embeddings, by bypassing Gensim and loading them directly into some pythonic model. But this would require some work, and we need to justify it.

So why do you want to compress supervised FastText embeddings that are already compressed natively?

I used an unsupervised embedding (cc.de.300.bin). It has 7GB and I was not aware that it is already compressed. Is it?

I'm just wondering why it turns into supervised from unsupervised.

Please give me the link to the embeddings you use (and some meta information about them, if available).

Judging by the name "cc.de.300.compressed.bin", the embeddings have been already compressed by someone. And because the native FastText library by Facebook supports compressing only supervised embeddings (that is. embeddings from a classifier model), I guess that this is the case.

The base model does not container "compressed" in its name. I compressed this one: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz from this list: https://fasttext.cc/docs/en/crawl-vectors.html

It's unsupervised and afaik not compressed. Compressing it results in a 14mb supervised model (and gensim can't load it anymore). Loading the 7GB model works with gensim.

Additional question: What happens with words that are not trained yet? Does the compressed model processed them as well like the uncompressed one?

About the model formats.

There are three formats of FastText models, and they have to be processed in different ways:

  1. Facebook native models, such as cc.de.300.bin. They can be loaded with gensim.models.fasttext.FastText.load_fasttext_format(path).
  2. Gensim models. They can be loaded with gensim.models.fasttext.FastTextKeyedVectors.load(path).
  3. Gensim-like compressed models. They can be loaded with compress_fasttext.models.CompressedFastTextKeyedVectors.load(path).

In your case, if you have compressed the model with compress_fasttext, you should also load it with this library.
Therefore, the command

compress_fasttext.models.CompressedFastTextKeyedVectors.load("cc.de.300.compressed.bin")

should do the job.

Okay, thank you for your fast reply. I was just wondering because the Gensim FastText class contains a property like wv and vector_size. It would have been good to get the same interface working with the compressed model. Do you know if this is possible?

Do you know something about my additional question from the post before?

Additional question: What happens with words that are not trained yet? Does the compressed model processed them as well like the uncompressed one?

Yes, all compressed models can process unknown words. However, if you use the compression method based on aggressive pruning (prune_ft_freq), then many word n-grams will lose their embeddings, and for some unknown words the resulting embedding could become zero. If this is unacceptable, please use other, less aggressive, compression methods.

I was just wondering because the Gensim FastText class contains a property like wv and vector_size. It would have been good to get the same interface working with the compressed model. Do you know if this is possible?

The property FastText.wv returns a FastTextKeyedVectors object that can be used to extract word vectors. And the CompressedFastTextKeyedVectors is a subclass FastTextKeyedVectors, so they have nearly identical interfaces.

In other words, you can do with a CompressedFastTextKeyedVectors all the things that you did with FastText.wv.

Okay, thank you. Any way to get vector_size using CompressedFastTextKeyedVectors?

Actually, this class does have a property vector size. Just use it.

# !pip install compress-fasttext
import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
    'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)
print(small_model.vector_size)  # 300

Which algorithm do you suggest to not lose embeddings for unknown words?

Which algorithm do you suggest to not lose embeddings for unknown words?

Both quantize_ft and prune_ft do try to preserve embeddings for unknown words, but for obvious reasons they still perform worse for out-of-vocabulary words than for in-vocabulary words.

Probably, if you train a small FastText model from scratch, as opposed to compressing an existing big one, it will work better for OOV words, but I have not tried this comparison.