The source code used for Spherical Text Embedding, published in NeurIPS 2019. The code structure (especially file reading and saving functions) is adapted from the Word2Vec implementation.
- GCC compiler (used to compile the source c file): See the guide for installing GCC.
We provide pre-trained JoSE
embeddings on the wikipedia dump.
Unlike Euclidean embeddings such as Word2Vec and GloVe, spherical embeddings do not necessarily benefit from higher-dimensional space, so it might be a good idea to start with lower-dimensional ones first.
We provide a shell script run.sh
for compiling the source file and training embedding.
Note: When preparing the training text corpus, make sure each line in the file is one document/paragraph.
Note: It is recommended to use the default hyperparameters, especially the number of negative samples (-negative
) and loss function margin (-margin
).
Invoke the command without arguments for a list of hyperparameters and their meanings:
$ ./src/jose
Parameters:
-train <file> (mandatory argument)
Use text data from <file> to train the model
-word-output <file>
Use <file> to save the resulting word vectors
-context-output <file>
Use <file> to save the resulting word context vectors
-doc-output <file>
Use <file> to save the resulting document vectors
-size <int>
Set size of word vectors; default is 100
-window <int>
Set max skip length between words; default is 5
-sample <float>
Set threshold for occurrence of words. Those that appear with higher frequency in the
training data will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-3)
-negative <int>
Number of negative examples; default is 2
-threads <int>
Use <int> threads; default is 20
-margin <float>
Margin used in loss function to separate positive samples from negative samples; default is 0.15
-iter <int>
Run more training iterations; default is 10
-min-count <int>
This will discard words that appear less than <int> times; default is 5
-alpha <float>
Set the starting learning rate; default is 0.04
-debug <int>
Set the debug mode (default = 2 = more info during training)
-save-vocab <file>
The vocabulary will be saved to <file>
-read-vocab <file>
The vocabulary will be read from <file>, not constructed from the training data
-load-emb <file>
The pretrained embeddings will be read from <file>
Examples:
./jose -train text.txt -word-output jose.txt -size 100 -margin 0.15 -window 5 -sample 1e-3 -negative 2 -iter 10
We provide a shell script eval_sim.sh
for word similarity evaluation of trained spherical word embeddings on the wikipedia dump. The script will first download a zipped file of the pre-processed wikipedia dump (retrieved 2019.05; the zipped version is of ~4GB; the unzipped one is of ~13GB; for a detailed description of the dataset, see its README file), and then run JoSE
on it. Finally, the trained embeddings are evaluated on three benchmark word similarity datasets: WordSim-353, MEN and SimLex-999.
We provide a shell script eval_cluster.sh
for document clustering evaluation of trained spherical document embeddings on the 20 Newsgroup dataset. The script will perform K-Means and Spherical K-Means clustering on the trained document embeddings.
We provide a shell script eval_classify.sh
for document classification evaluation of trained spherical document embeddings on the 20 Newsgroup dataset. The script will perform KNN classification following the original 20 Newsgroup train/test split with the trained document embeddings as features.
Please cite the following paper if you find the code helpful for your research.
@inproceedings{meng2019spherical,
title={Spherical Text Embedding},
author={Meng, Yu and Huang, Jiaxin and Wang, Guangyuan and Zhang, Chao and Zhuang, Honglei and Kaplan, Lance and Han, Jiawei},
booktitle={Advances in neural information processing systems},
year={2019}
}