Hierarchical Encoder Decoder RNN (HRED) with Truncated Backpropagation Through Time (Truncated BPTT) for Dialog Modeling.
The truncated computation is based on the trick of splitting each document into shorter sequences (e.g. 80 tokens) and then computing gradients for each sequence separately, but where the hidden state of the RNNs have been initialized from the preceding sequences (i.e. the hidden states have been forward propagated through the previous states).
The script convert-text2dict.py can be used to generate model datasets based on text files with dialogues. It only requires that the document contains end-of-utterance tokens </s> which are used to construct the model graph, since the utterance encoder is only connected to the dialogue encoder at the end of each utterance.
Prepare your dataset as a text file for with one document (e.g. movie script or subtitle) per line. The dialogues are assumed to be tokenized. If you have validation and test sets, they must satisfy the same requirements.
Once you're ready, you can create the model dataset files by running:
python convert-text2dict.py <training_file> --cutoff <vocabulary_size> Training python convert-text2dict.py <validation_file> --dict=Training.dict.pkl Validation python convert-text2dict.py <test_file> --dict=Training.dict.pkl <vocabulary_size> Test
where <training_file>, <validation_file> and <test_file> are the training, validation and test files, and <vocabulary_size> is the number of tokens that you want to train on (all other tokens, but the most frequent <vocabulary_size> tokens, will be converted to <unk> symbols).
NOTE: The script automatically adds the following special tokens specific to movie scripts:
- end-of-utterance: </s>
- end-of-dialogue: </d>
- first speaker: <first_speaker>
- second speaker: <second_speaker>
- third speaker: <third_speaker>
- minor speaker: <minor_speaker>
- voice over: <voice_over>
- off screen: <off_screen>
- pause: <pause>
If these do not exist in your dataset, you can safely ignore these, but remember that your vocabulary will still contain these.
If you have Theano with GPU installed (bleeding edge version), you can train the model as follows:
- Clone the Github repository
- Create a new "Output" and "Data" directories inside it.
- Unpack your dataset files into "Data" directory.
- Create a new prototype inside state.py (look at prototype_movies or prototype_test as examples)
- From the terminal, cd into the code directory and run:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python train.py --prototype <prototype_name> &> Model_Output.txt
For a 7M word dataset, such as the Movie-Scriptolog dataset without any pretraining, this takes about 24 hours to reach convergence.
To test the model afterwards, you can run:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python evaluate.py --exclude-sos --plot-graphs Output/<model_name> --document_ids Data/Test_Shuffled_Dataset_Labels.txt &> Model_Evaluation.txt
where <model_name> is the name automatically generated by train.py.
If your GPU runs out of memory, you can adjust the bs (batch size) parameter inside the state.py, but training will be slower. You can also play around with the other parameters inside state.py.