Fasttext is a library for text representation and classification by facebookresearch. It implements text classification and word embedding learning.
A java version implementation of fasttext in kotlin. Features:
-
100% in java
-
Compatible with original C++ model
It can read all trained models from fasttext directly
-
Compatible with the original product quantizer compression model
-
Provides APIs for training(with equivalent performance)
-
Support for customing storage formats
-
Support reading model files in mmap
The scale of official model for chinese wiki is about 2.8G, which needs, at least, 4G for jvm in general. Moreover, it takes longer time for loading model files than C++ version.
But it could be much optimised through mmap with limited RAM and time(around 3 seconds).
compile 'com.mayabot:fastText4j:1.2.2'
<dependency>
<groupId>com.mayabot</groupId>
<artifactId>fastText4j</artifactId>
<version>1.2.2</version>
</dependency>
File file = new File("data/fasttext/data.text");
FastText fastText = FastText.train(file, ModelName.sg);
fastText.saveModel("data/fasttext/model.bin");
data.txt is the file of data, stored in utf-8. Before training, it is necessary to do word spliting to get training set. By default, it will use 3-6 char ngram. Additionally, cow algorithm is implemented as well. For more parameter tuning if needed, please provide TrainArgs object.
File file = new File("data/fasttext/data.txt");
FastText fastText = FastText.train(file, ModelName.sup);
fastText.saveModel("data/fasttext/model.bin");
data.txt is also encoded in utf-8 with one sample each line. And it needs to do word spliting beforehand as well. There is a string starting with __label__
in each line,representing the classifying target, such as __label__正面
. Each sample could have multiple label. Through the attribute 'label' in TrainArgs, you can customise the head.
Invoke predict method to classify after the model trained.
FastText fastText = FastText.loadFasttextBinModel("data/fasttext/wiki.zh.bin");
fastText.saveModel("data/fasttext/wiki.model");
//predict the result of a word
FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin");
List<FloatStringPair> predict = fastText.predict(Arrays.asList("fastText在预测标签时使用了非线性激活函数".split(" ")), 5);
FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin");
List<FloatStringPair> predict = fastText.nearestNeighbor("**",5);
By giving three words A, B and C, return the nearest words in terms of semantic distance and their similarity list, under the condition of (A - B + C).
FastText fastText = FastText.loadCModel("data/fasttext/wiki.zh.bin");
List<FloatStringPair> predict = fastText.analogies("国王","皇后","男",5);
/**
* classify the label by sup model
*/
List<FloatStringPair> predict(Iterable<String> tokens, k: Int)
/**
* nearest neighbor search
* @param word
* @param k k most similar words
*/
List<FloatStringPair> nearestNeighbor(String word, k: Int)
/**
* Analogies search
* Query triplet (A - B + C)?
*/
List<FloatStringPair> analogies(String A,String B,String C, k: Int)
/**
* The vector of a certain word
*/
Vector getWordVector(String word)
/**
* The vector of a certain phrase
*/
Vector getSentenceVector(Iterable<String> tokens)
/**
* Save the vector in text format
*/
saveVectors(String fileName)
/**
* Save the model in binary format
*/
saveModel(String file)
/**
* Model Training
* @param File trainFile
* @param model_name
* sg skipgram Use skipgram algorithm
* cow cbow Use cbow algorithm
* sup supervised Text classification
* @param args Parameters
**/
FastText FastText.train(File trainFile, ModelName model_name, TrainArgs args)
/**
* Load the model saved by saveModel method
* @param file
* @param mmap Load by mmap model to accelerate and save RAM
*/
Fasttext.loadModel(String file,boolean mmap)
/**
* Load model generated by C++ version(support bin & ftz).
*/
Fasttext.loadFasttextBinModel(String binFile)
The parameters is consistant with the C++ version :
The following arguments for the dictionary are optional:
-minCount minimal number of word occurences [1]
-minCountLabel minimal number of label occurences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.
Please cite 1 if using this code for learning word representations or 2 if using for text classification.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2017enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={Transactions of the Association for Computational Linguistics},
volume={5},
year={2017},
issn={2307-387X},
pages={135--146}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@InProceedings{joulin2017bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
month={April},
year={2017},
publisher={Association for Computational Linguistics},
pages={427--431},
}
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
(* These authors contributed equally.)
fastText is BSD-licensed. facebook provide an additional patent grant.