Enhanced word representation for Out-of-Vocabulary on Ubuntu Dialogue Corpus

This is a TensorFlow implementation of the model described in:

Jianxiong Dong, Jim Huang Enhance Word Representation For Out-of-Vocabulary on Ubuntu Dialogue Corpus.

The model has acheived the state-of-the-art performane on Ubuntu Dialogue Corpus V2 and Douban Chinese dialogue corpus.

Contact

Code author: Jianxiong Dong

Requirements
Dataset
Training a model
Evaluating a Model

Requirements

Install the Tensorflow library (instructions). For example:

virtualenv --system-site-packages tensorfow_dev
source tensorflow_dev/bin/activate
pip install --upgrade pip
pip install tensorflow-gpu==1.4.0

16GB of RAM. 32GB is recommended.
A machine with NVIDIA GPU card (large GPU RAM) is preferable. It has been tested with NVIDIA Titan Xp (12G).

Dataset

We used Ubuntu Dialogue Corpus V2. In order to easily reproduce results in the above paper, the processed dataset has been provided.

cd data
sh download.sh

Training a model

Execute the following commands to start the training script. By default it will run for 230k steps to achieve maximum mean reciprocal rank on the validation set.

cd bin
nohup sh ubuntu_train.sh &

Evaluating a model

If several runs exist in 'runs' folder, the checkpoints of the latest run is used to evaluate the model performance.

cd bin
sh ubuntu_test.sh

jdongca2003/next_utterance_selection