tf-lm

This repository contains scripts for recurrent neural network language modeling with TensorFlow, and a link to pre-trained language models on several benchmarks. The main purpose of tf-lm is providing a toolkit for researchers that want to use a language model as is, or for researchers that do not have a lot of experience with language modeling/neural networks and would like to start with it.

A description of the toolkit can be found in this paper:

Verwimp, Lyan, Van hamme, Hugo and Patrick Wambacq. 2018. TF-LM: TensorFlow-based Language Modeling Toolkit. In Proceedings LREC, Miyazaki, Japan, 9-11 May 2018.

The poster presented at LREC 2018 can be found here.

Last update (24/05/2018): made the scripts compatible with TF version 1.8.

!!! Disclaimer: This project is still under development and not everything has been tested very thoroughly yet.

Installation and setup

Python version used: 2.7.5.
Install TensorFlow. These scripts are compatible with version 1.8.
Modify the config files in config/: change the pathnames and optionally the parameters.

Options

For more information on how to specify these options in a configuration file, see the README in config/.

Input units: words, characters, character n-gram or concatenated word and characters [1].
Train on sentence level ('sentence'), with all sentences padded until the length of the longest sentence in the dataset, or train on batches that may contain multiple sentences ('discourse').
- e.g. across sentence boundaries (default): "owned by <unk> & <unk> co. was under contract with to make the cigarette filters <eos> the finding probably"
- e.g. sentence-level:
  - "<bos> the plant which is owned by <unk> & <unk> co. was under contract with <unk> to make the cigarette filters <eos> @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @"
  - "<bos> the finding probably will support those who argue that the u.s. should regulate the class of asbestos including <\unk> more <unk> than the common kind of asbestos <unk> found in most schools and other buildings dr. <unk> said <eos> @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @"
Training schedules:
- Fixed training schedule
- Early stopping based on comparison with previous n validation perplexities
- Learning rate decay
Optimizers: stochastic gradient descent, adam, adagrad.
Full softmax or sampled softmax.
Testing options:
- Perplexity (of the standard validation set or test set: same configuration file as for training, but set --train False and --test False or --valid False respectively)
- Re-scoring: log probabilities per sentence
- Predicting the next word(s) given a prefix
- Generate debugging file similar to SRILM's -debug 2 option: can be used to calculate interpolation weights
Reading the data all at once or streaming sentence per sentence.

Code overview

Main script: by default training, validation and testing is done.

main.py:
- --config: configuration file specifying all options
- --notrain: skip training the model
- --novalid: skip validating the model
- --notest: skip testing the model
- --device: use with 'cpu' if you want to explicitly run on cpu, otherwise it will try to run on gpu

Other scripts:

configuration.py: handles configuration files
lm.py: classes for language models
lm_data.py: contains several classes for handling the language model data in different ways
multiple_lm_data.py: class that handles several lm_data classes (for models with concatenated word and character embedding [1])
run_epoch.py: calls lm_data to get batches of data, feeds the batches to the language model and calculates the probability/perplexity
trainer.py: classes for training the model
writer.py: for writing to multiple output streams

Auxiliary scripts:

data_prep: several data preparation scripts, see the readme.
make_char_feat_files.py: to create character feature files used for the models described in [1].

Possible combinations

Input unit	Batching	Model	Testing options	Example (arguments only)
Word	Discourse	Unidirectional	Perplexity	--config ../config/en-ptb_word_discourse.config (--train False --valid False)
Word	Discourse	Unidirectional	Rescore	--config ../config/en-ptb_word_discourse_rescore.config
Word	Discourse	Unidirectional	Predict next word(s)	--config ../config/en-ptb_word_discourse_predict.config
Word	Discourse	Unidirectional	Generate debug file	--config ../config/en-ptb_word_discourse_debug2.config
Word	Sentence	Unidirectional	Perplexity	--config ../config/en-ptb_word_sentence.config (--train False --valid False)
Word	Sentence	Unidirectional	Rescore	--config ../config/en-ptb_word_sentence_rescore.config
Word	Discourse	Unidirectional	Predict next word(s)	--config ../config/en-ptb_word_sentence_predict.config
Word	Sentence	Unidirectional	Generate debug file	--config ../config/en-ptb_word_sentence_debug2.config
Word	Sentence	Bidirectional	Perplexity	--config ../config/en-ptb_word_sentence_bidir.config (--train False --valid False)
Character	Discourse	Unidirectional	Perplexity	--config ../config/en-ptb_char_discourse.config (--train False --valid False)
Character	Discourse	Unidirectional	Rescore
Character	Discourse	Unidirectional	Predict next word(s)
Word-Character	Discourse	Unidirectional	Perplexity	--config ../config/en-ptb_wordchar9-invert_discourse.config (--train False --valid False)
Character n-gram	Discourse	Unidirectional	Perplexity	--config en-ptb_char2gram_discourse.config (--train False --valid False)

Example commands

For these examples, you can download the Penn TreeBank, WikiText or use you own dataset. The data should be divided in a train.txt, valid.txt and test.txt and the correct data path should be specified in the configuration file ('data_path').

Train and evaluate a word-level language model on Penn Treebank:

python main.py --config ../config/en-ptb_word_discourse.config

Pre-trained models

We release models pre-trained on Penn TreeBank, WikiText, the Corpus of Spoken Dutch and a corpus of Dutch subtitles.

This work is still in progress, but the models can be found here.

The config files used for the models are:

Contact

If you have any questions, mail to lyan.verwimp [at] esat.kuleuven.be.

[1] Verwimp, L., Pelemans, J., hamme, H. V., and Wambacq, P. (2017). Character-Word LSTM Language Models. Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), pages 417–427.

Zminghua/tf-lm