
CVAE-GPT2 open-domain chatbot

Primary LanguagePython


This project combines NeuralDialog-CVAE proposed in (Zhao et al., 2017) and GPT2 pretrained model released by Hugginface to implement an open-domain chatbot.



  • python == 3.6.8
  • pytorch==1.2.0
  • transformers==2.5.1
  • jsonlines
  • tqdm

To install the requried packages with conda, you can run the following script:

  1. clone the repo
git clone https://github.com/ssxy00/CVAE-Chatbot
cd CVAE-Chatbot
  1. create virtual environment(optional)
conda create -n cvae_chatbot python==3.6.8
conda activate cvae_chatbot
  1. install packages
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0
python -m pip install transformers==2.5.1
python -m pip install -r requirements.txt


prepare data

The dataset used in this project is PersonaChat Dataset provided in Convai2. train_self_original_no_cands is used to train the model and valid_self_original_no_cands.txt is used to evaluate.

The easiest way to prepare data is to download my processed dataset here. After that, pleast unzip files into ./datasets.

PersonaChat Dataset also provides other datasets, such as train_self_revised_no_cands.txt of which persona is revised. If you want to use these datasets, you need to:

  • download ConvAI2 dataset The dataset is available in ParlAI, so first install ParlAI:
git clone https://github.com/facebookresearch/ParlAI
cd ParlAI
# ParlAI now requires PyTorch==1.4, so revert to history vesion
git reset --hard 1e905fec8ef4876a07305f19c3bbae633e8b33af
# then download data
python examples/display_data.py --task convai2 --datatype train

After running this script, a folder ConvAI2 containing dataset files will be created in ParlAI/data/.

  • process data Then you can process dataset with following script:
# Run this script in the root directory of this repo
export PYTHONPATH=./
python ./prepare_data/preprocess_data.py --raw_data $RAW_DATA --cache_data $CACHE_DATA --gpt2_vocab_path $GPT2_VOCAB_PATH

prepare pretrained gpt2 model

The project uses GPT2 pretrained model provided by Huggingface, so you need to download it in advance. You can run the following script to download gpt2 model:

# Run this script in the root directory of this repo
mkdir -p gpt2/model
mkdir -p gpt2/tokenizer

cd gpt2/model
# download model files
wget https://cdn.huggingface.co/gpt2-pytorch_model.bin
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json
# rename files
mv gpt2-pytorch_model.bin pytorch_model.bin
mv gpt2-config.json config.json

cd ../tokenizer
# download tokenizer files
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
# rename files
mv gpt2-vocab.json vocab.json
mv gpt2-merges.txt merges.txt


The training script can only be used in single GPU setting:

# Run this script in the root directory of this repo
python train.py --save_model_dir $MODEL_DIR

You can set the following training arguments:

Argument Type Default vale Description
gpt2_model_dir str "./gpt2/model" path to GPT2 pretrained model parameters
gpt2_vocab_path str "./gpt2/tokenizer" path to GPT2 tokenizer vocab file
train_dataset str ."/datasets/train_self_original_no_cands.cache" cache train_dataset path
valid_dataset str ./datasets/valid_self_original_no_cands.cache cache valid_dataset path
max_seq_len int 60 max sequence length fed into GPT2
max_history int 2 max number of historical conversation turns to use
max_context_len int 100 type=int
max_persona_len int 70 max persona sequence length for sentence embedding
max_response_len int 30 max response sequence length for sentence embedding
seed int 0 random seed
device str 'cuda' if torch.cuda.is_available() else 'cpu' "cpu" or "cuda"
z_dim int 200 latent hidden state dim (z)
n_epochs int 1 number of training epochs
batch_size int 2 batch size for training
lr float 6.25e-5 learning rate
gradient_accumulate_steps int 1 accumulate gradient on several steps
clip_grad float 1.0 clip gradient threshold
save_model_dir str default="./checkpoints" path to save model checkpoints
save_interval int 1 save checkpoint every N epochs
model_type type=str "compressed_cvae" "decoder", "cvae_memory", "cvae_embedding", "compressed_decoder" or "compressed_cvae", see here for detailed description
bow bool False add bow loss or not, refer to (Zhao et al., 2017) for detailed explanation
kl_coef float 1.0 kl loss coefficient
bow_coef float 1.0 bow loss coef coefficient


After training the model, you can run the following script to evaluate the model. This script will output the prediction results to the file you specified. For each test sample, you will get a json-format result:

"persona": persona_string, 
"context: context_string, 
"golden_response": target_response_string, 
"predict_responses": [candidate_1_string, ..., candidate_n_string], 
"predict_f1s": [candidate_1_f1, ..., candidate_n_f1]

When the evaluation ends, average ppl and average f1(max f1 among candidates of each sample) will be output to terminal.

You can set n_outputs to modify the number of candidates to predict. For decoder-type model, the model will do beam search (beam_size=n_outputs) and return all beams. For cvae-type model, the model will sample z n_output times and do greedy search.

# Run this script in the root directory of this repo
export PYTHONPATH=./
MODEL_TYPE={type of model trained}
N_OUTPUTS={number of candidates to predict}
python evaluation/predict.py \
--checkpoint_path $CHECKPOINT_PATH \
--model_type $MODEL_TYPE \
--output_path $OUTPUT_PATH \
--n_outputs $N_OUTPUTS

You can set the following evaluation arguments:

Argument Type Default vale Description
gpt2_model_dir str "./gpt2/model" path to GPT2 pretrained model parameters
gpt2_vocab_path str "./gpt2/tokenizer" path to GPT2 tokenizer vocab file
valid_dataset str ./datasets/valid_self_original_no_cands.cache cache valid_dataset path
output_path str ./result.jsonl path to output prediction results
batch_size int 2 batch size for evaluation
max_seq_len int 60 max sequence length fed into GPT2
max_history int 2 max number of historical conversation turns to use
max_context_len int 100 max context sequence length for sentence embedding
max_persona_len int 70 max persona sequence length for sentence embedding
max_response_len int 30 max response sequence length for sentence embedding
max_predict_len int 32 max predicted response sequence length
n_outputs int 3 how many candidates to generate
seed int 0 random seed
device str 'cuda' if torch.cuda.is_available() else 'cpu' "cpu" or "cuda"
z_dim int 200 latent hidden state dim (z)
checkpoint_path str default="" path to load model checkpoint
model_type type=str "compressed_cvae" "decoder", "cvae_memory", "cvae_embedding", "compressed_decoder" or "compressed_cvae"

After getting the prediction results, you can run the following script to get diversity metrics:

python evaluation/evaluate_diversity.py --eval_file $OUTPUT_PATH

This script will output distinct-1, distinct-2 and entropy-4.