An autoencoder to calculate word embeddings as mentioned in Lebret/Collobert paper 2015
Runs the model mentined in http://arxiv.org/abs/1412.4930 on set of text files stored in directory on GPU
With GPU support code is 5-8x faster than it's CPU version.
Requirements:
-
Numpy http://www.numpy.org/
To install the package follow the instructions mentioned below
To install the package from source run setup.py file, it will install and include necessary files in system python directory
$ python setup.py install
Tyrion can also be install from pip. Run following command from console
$ pip install tyrion
To train the model on bunch of text files run following commnands in python console:
There is a sample data folder in project's folder, that can be used for sanity checking
>> from tyrion import train
>> train.train('/path/to/text/corpus')
Different hyperparameters are set by default in training module, to set hyperparamters manually use following command while training >> from tyrion import train
>> train.train('/path/to/text/corpus', contextSize=size, min_count=count, newdims=dims, ntimes=ntimes, lr=learningrate)
Arguments explained:
-
ContextSize is the contex window size used for constructing coocurence matrix
-
min_count is minimum threshold frequency for words, removes garbage words below this frequency
-
newdims is the dimension of embedding desired.
-
ntimes is number of epochs for training model
-
lr is the learning rate to drive optimization routine
To generate word embeddings and to find closest words to a word, use utils module in tyrion. Ex.
>> from tyrion import utils
>> embedding = utils.gen_embedding('word')
>> close_words = utils.closest_words('word',n=10)
- Implement a closely related paper on phrase embeddings (http://arxiv.org/abs/1506.05703) (ICML 2015)
- Try implementing AdaGrad ( for optimization )