The implementation of the language model from my Master's Thesis. If you are interested in getting the paper, please send me email to m.lankinen@iki.fi.
In order to run the model, the following pre-requirements must be satisfied.
-
Install Python 2.7.x,
pip
andvirtualenv
-
Create new virtual environment and install the following packages:
# Theano and keras and their deps
pip install numpy scipy pyyaml
pip install https://github.com/Theano/archive/rel-0.8.2.zip
pip install https://github.com/fchollet/keras/archive/1.0.5.zip
pip install Cython
pip install h5py
NOTE: If you are using OSX, these native packages must be installed before you
can install the actual Python packages (using homebrew
)
# if gfortran is missing
brew install gcc
brew tap homebrew/science
brew install hdf5
Running C2W2C model and use example data from data
folder:
./run_c2w2c.sh
Running WordLSTM model and use example data from data
folder:
./run_word_lstm.sh
NOTE: If you are using OS X and XCode 7.x, you may need to install older version
of XCode and set $DEVELOPER_DIR
environment variable to point to older installation
(see example from run
script).
Usage: ./run <args>
C2W2C language model
optional arguments:
-h, --help show this help message and exit
--training filename Training dataset filename
--test filename Validation dataset filename
--data-limit training:validation, -l training:validation
Limit data size to the given rows (e.g. "10:1")
--batch-size n Number of samples is single training batch
--learning-rate num, -r num
--num-epoch n, -e n Number of epoch to run
--load-weights filename
File containing the initial model weights
--save-weights filename
Filename where model weights will be saved
--max-word-length n, -w n
Maximum word length (longer words will be truncated)
--d_C n Character features vector size
--d_W n Word features vector size
--d_Wi n Intermediate word LSTM state dimension
--d_L n Language model state dimension
--d_D n W2C Decoder state dimension
--gen-text n Generate N sample sentences after each epoch
--test-only, -T Run only PP test and (optional) text generation
--mode c2w2c|word Select which mode to run
Example data files are taken from Europarl V7 Finnish Corpus and pre-processed with Apache OpenNLP tokenizer and Finnish tokenizing model.
training.txt
: 700 sentences / 15k tokensvalidation.txt
: 30 sentences / 600 tokens
MIT