fasttext is a Python interface for Facebook fastText.
fasttext support Python 2.6 or newer. It requires Cython in order to build the C++ extension.
pip install fasttext
This package has two main use cases: word representation learning and text classification.
These were described in the two papers 1 and 2.
In order to learn word vectors, as described in
1, we can use
fasttext.skipgram
and fasttext.cbow
function like the following:
import fasttext
# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary
# CBOW model
model = fasttext.cbow('data.txt', 'model')
print model.words # list of words in dictionary
where data.txt
is a training file containing utf-8
encoded text.
By default the word vectors will take into account character n-grams from
3 to 6 characters.
At the end of optimization the program will save two files:
model.bin
and model.vec
.
model.vec
is a text file containing the word vectors, one per line.
model.bin
is a binary file containing the parameters of the model
along with the dictionary and all hyper parameters.
The binary file can be used later to compute word vectors or to restart the optimization.
The following fasttext(1)
command is equivalent
# Skipgram model
./fasttext skipgram -input data.txt -output model
# CBOW model
./fasttext cbow -input data.txt -output model
The previously trained model can be used to compute word vectors for out-of-vocabulary words.
print model['king'] # get the vector of the word 'king'
the following fasttext(1)
command is equivalent:
echo "king" | ./fasttext print-vectors model.bin
This will output the vector of word king
to the standard output.
We can use fasttext.load_model
to load pre-trained model:
model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model['king'] # get the vector of the word 'king'
This package can also be used to train supervised text classifiers and load pre-trained classifier from fastText.
In order to train a text classifier using the method described in 2, we can use the following function:
classifier = fasttext.supervised('data.train.txt', 'model')
equivalent as fasttext(1)
command:
./fasttext supervised -input data.train.txt -output model
where data.train.txt
is a text file containing a training sentence per line
along with the labels. By default, we assume that labels are words
that are prefixed by the string __label__
.
We can specify the label prefix with the label_prefix
param:
classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__')
equivalent as fasttext(1)
command:
./fasttext supervised -input data.train.txt -output model -label '__label__'
This will output two files: model.bin
and model.vec
.
Once the model was trained, we can evaluate it by computing the precision
at 1 (P@1) and the recall on a test set using classifier.test
function:
result = classifier.test('test.txt')
print 'P@1:', result.precision
print 'R@1:', result.recall
print 'Number of examples:', result.nexamples
This will print the same output to stdout as:
./fasttext test model.bin test.txt
In order to obtain the most likely label for a list of text, we can
use classifer.predict
method:
texts = ['example very long text 1', 'example very longtext 2']
labels = classifier.predict(texts)
print labels
# Or with the probability
labels = classifier.predict_proba(texts)
print labels
We can specify k
value to get the k-best labels from classifier:
labels = classifier.predict(texts, k=3)
print labels
# Or with the probability
labels = classifier.predict_proba(texts, k=3)
print labels
This interface is equivalent as fasttext(1)
predict command. The same model
with the same input set will have the same prediction.
Train & load skipgram model
model = fasttext.skipgram(params)
List of available params
and their default value:
input_file training file path (required)
output output file path (required)
lr learning rate [0.05]
lr_update_rate change the rate of updates for the learning rate [100]
dim size of word vectors [100]
ws size of the context window [5]
epoch number of epochs [5]
min_count minimal number of word occurences [5]
neg number of negatives sampled [5]
word_ngrams max length of word ngram [1]
loss loss function {ns, hs, softmax} [ns]
bucket number of buckets [2000000]
minn min length of char ngram [3]
maxn max length of char ngram [6]
thread number of threads [12]
t sampling threshold [0.0001]
silent disable the log output from the C++ extension [1]
encoding specify input_file encoding [utf-8]
Example usage:
model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300)
Train & load CBOW model
model = fasttext.cbow(params)
List of available params
and their default value:
input_file training file path (required)
output output file path (required)
lr learning rate [0.05]
lr_update_rate change the rate of updates for the learning rate [100]
dim size of word vectors [100]
ws size of the context window [5]
epoch number of epochs [5]
min_count minimal number of word occurences [5]
neg number of negatives sampled [5]
word_ngrams max length of word ngram [1]
loss loss function {ns, hs, softmax} [ns]
bucket number of buckets [2000000]
minn min length of char ngram [3]
maxn max length of char ngram [6]
thread number of threads [12]
t sampling threshold [0.0001]
silent disable the log output from the C++ extension [1]
encoding specify input_file encoding [utf-8]
Example usage:
model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300)
File .bin
that previously trained or generated by fastText can be
loaded using this function
model = fasttext.load_model('model.bin', encoding='utf-8')
Skipgram and CBOW model have the following atributes & methods
model.model_name # Model name
model.words # List of words in the dictionary
model.dim # Size of word vector
model.ws # Size of context window
model.epoch # Number of epochs
model.min_count # Minimal number of word occurences
model.neg # Number of negative sampled
model.word_ngrams # Max length of word ngram
model.loss_name # Loss function name
model.bucket # Number of buckets
model.minn # Min length of char ngram
model.maxn # Max length of char ngram
model.lr_update_rate # Rate of updates for the learning rate
model.t # Value of sampling threshold
model.encoding # Encoding of the model
model[word] # Get the vector of specified word
Train & load the classifier
classifier = fasttext.supervised(params)
List of available params
and their default value:
input_file training file path (required)
output output file path (required)
label_prefix label prefix ['__label__']
lr learning rate [0.1]
lr_update_rate change the rate of updates for the learning rate [100]
dim size of word vectors [100]
ws size of the context window [5]
epoch number of epochs [5]
min_count minimal number of word occurences [1]
neg number of negatives sampled [5]
word_ngrams max length of word ngram [1]
loss loss function {ns, hs, softmax} [softmax]
bucket number of buckets [0]
minn min length of char ngram [0]
maxn max length of char ngram [0]
thread number of threads [12]
t sampling threshold [0.0001]
silent disable the log output from the C++ extension [1]
encoding specify input_file encoding [utf-8]
pretrained_vectors pretrained word vectors (.vec file) for supervised learning []
Example usage:
classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__',
thread=4)
File .bin
that previously trained or generated by fastText can be
loaded using this function.
./fasttext supervised -input train.txt -output classifier -label 'some_prefix'
classifier = fasttext.load_model('classifier.bin', label_prefix='some_prefix')
This is equivalent as fasttext(1)
test command. The test using the same
model and test set will produce the same value for the precision at one
and the number of examples.
result = classifier.test(params)
# Properties
result.precision # Precision at one
result.recall # Recall at one
result.nexamples # Number of test examples
The param k
is optional, and equal to 1
by default.
This interface is equivalent as fasttext(1)
predict command.
texts
is an array of string
labels = classifier.predict(texts, k)
# Or with probability
labels = classifier.predict_proba(texts, k)
The param k
is optional, and equal to 1
by default.
Classifier have the following atributes & methods
classifier.labels # List of labels
classifier.label_prefix # Prefix of the label
classifier.dim # Size of word vector
classifier.ws # Size of context window
classifier.epoch # Number of epochs
classifier.min_count # Minimal number of word occurences
classifier.neg # Number of negative sampled
classifier.word_ngrams # Max length of word ngram
classifier.loss_name # Loss function name
classifier.bucket # Number of buckets
classifier.minn # Min length of char ngram
classifier.maxn # Max length of char ngram
classifier.lr_update_rate # Rate of updates for the learning rate
classifier.t # Value of sampling threshold
classifier.encoding # Encoding that used by classifier
classifier.test(filename, k) # Test the classifier
classifier.predict(texts, k) # Predict the most likely label
classifier.predict_proba(texts, k) # Predict the most likely label include their probability
The param k
for classifier.test
, classifier.predict
and
classifier.predict_proba
is optional,
and equal to 1
by default.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
(* These authors contributed equally.)
- Facebook page: https://www.facebook.com/groups/1174547215919768
- Google group: https://groups.google.com/forum/#!forum/fasttext-library