This project combines NeuralDialog-CVAE proposed in (Zhao et al., 2017) and GPT2 pretrained model released by Hugginface to implement an open-domain chatbot.
- (Zhao et al., 2017) for cvae architecture
- (Li et al., 2020) for combination method
- transfer-learning-conv-ai for baseline
- (Ippolito et al., 2019) for diversity evaluation
- (transformer_chatbot) for beam search
- python == 3.6.8
- pytorch==1.2.0
- transformers==2.5.1
- jsonlines
- tqdm
To install the requried packages with conda
, you can run the following script:
- clone the repo
git clone https://github.com/ssxy00/CVAE-Chatbot
cd CVAE-Chatbot
- create virtual environment(optional)
conda create -n cvae_chatbot python==3.6.8
conda activate cvae_chatbot
- install packages
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0
python -m pip install transformers==2.5.1
python -m pip install -r requirements.txt
The dataset used in this project is PersonaChat Dataset provided in Convai2. train_self_original_no_cands
is used to train the model and valid_self_original_no_cands.txt
is used to evaluate.
The easiest way to prepare data is to download my processed dataset here. After that, pleast unzip files into ./datasets
.
PersonaChat Dataset also provides other datasets, such as train_self_revised_no_cands.txt
of which persona is revised. If you want to use these datasets, you need to:
- download ConvAI2 dataset The dataset is available in ParlAI, so first install ParlAI:
git clone https://github.com/facebookresearch/ParlAI
cd ParlAI
# ParlAI now requires PyTorch==1.4, so revert to history vesion
git reset --hard 1e905fec8ef4876a07305f19c3bbae633e8b33af
# then download data
python examples/display_data.py --task convai2 --datatype train
After running this script, a folder ConvAI2
containing dataset files will be created in ParlAI/data/
.
- process data Then you can process dataset with following script:
# Run this script in the root directory of this repo
export PYTHONPATH=./
RAW_DATA=/path/to/raw/dataset/file
CACHE_DATA=/path/to/save/processed/dataset/file
GPT2_VOCAB_PATH=/path/to/gpt2/tokenizer/files
python ./prepare_data/preprocess_data.py --raw_data $RAW_DATA --cache_data $CACHE_DATA --gpt2_vocab_path $GPT2_VOCAB_PATH
The project uses GPT2 pretrained model provided by Huggingface, so you need to download it in advance. You can run the following script to download gpt2
model:
# Run this script in the root directory of this repo
mkdir -p gpt2/model
mkdir -p gpt2/tokenizer
cd gpt2/model
# download model files
wget https://cdn.huggingface.co/gpt2-pytorch_model.bin
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json
# rename files
mv gpt2-pytorch_model.bin pytorch_model.bin
mv gpt2-config.json config.json
cd ../tokenizer
# download tokenizer files
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
# rename files
mv gpt2-vocab.json vocab.json
mv gpt2-merges.txt merges.txt
The training script can only be used in single GPU setting:
# Run this script in the root directory of this repo
MODIR_DIR=/path/to/save/model/checkpoints
python train.py --save_model_dir $MODEL_DIR
You can set the following training arguments:
Argument | Type | Default vale | Description |
---|---|---|---|
gpt2_model_dir | str |
"./gpt2/model" |
path to GPT2 pretrained model parameters |
gpt2_vocab_path | str |
"./gpt2/tokenizer" |
path to GPT2 tokenizer vocab file |
train_dataset | str |
."/datasets/train_self_original_no_cands.cache" |
cache train_dataset path |
valid_dataset | str |
./datasets/valid_self_original_no_cands.cache |
cache valid_dataset path |
max_seq_len | int |
60 |
max sequence length fed into GPT2 |
max_history | int |
2 |
max number of historical conversation turns to use |
max_context_len | int |
100 |
type=int |
max_persona_len | int |
70 |
max persona sequence length for sentence embedding |
max_response_len | int |
30 |
max response sequence length for sentence embedding |
seed | int |
0 |
random seed |
device | str |
'cuda' if torch.cuda.is_available() else 'cpu' |
"cpu" or "cuda" |
z_dim | int |
200 |
latent hidden state dim (z) |
n_epochs | int |
1 |
number of training epochs |
batch_size | int |
2 |
batch size for training |
lr | float |
6.25e-5 |
learning rate |
gradient_accumulate_steps | int |
1 |
accumulate gradient on several steps |
clip_grad | float |
1.0 |
clip gradient threshold |
save_model_dir | str |
default="./checkpoints" | path to save model checkpoints |
save_interval | int |
1 |
save checkpoint every N epochs |
model_type | type=str | "compressed_cvae" | "decoder", "cvae_memory", "cvae_embedding", "compressed_decoder" or "compressed_cvae", see here for detailed description |
bow | bool |
False |
add bow loss or not, refer to (Zhao et al., 2017) for detailed explanation |
kl_coef | float |
1.0 |
kl loss coefficient |
bow_coef | float |
1.0 |
bow loss coef coefficient |
After training the model, you can run the following script to evaluate the model. This script will output the prediction results to the file you specified. For each test sample, you will get a json-format result:
{
"persona": persona_string,
"context: context_string,
"golden_response": target_response_string,
"predict_responses": [candidate_1_string, ..., candidate_n_string],
"predict_f1s": [candidate_1_f1, ..., candidate_n_f1]
}
When the evaluation ends, average ppl and average f1(max f1 among candidates of each sample) will be output to terminal.
You can set n_outputs
to modify the number of candidates to predict. For decoder-type model, the model will do beam search (beam_size=n_outputs
) and return all beams. For cvae-type model, the model will sample z n_output
times and do greedy search.
# Run this script in the root directory of this repo
export PYTHONPATH=./
CHECKPOINT_PATH=/path/to/model/checkpoint
MODEL_TYPE={type of model trained}
OUTPUT_PATH=/path/to/save/predicted/results
N_OUTPUTS={number of candidates to predict}
python evaluation/predict.py \
--checkpoint_path $CHECKPOINT_PATH \
--model_type $MODEL_TYPE \
--output_path $OUTPUT_PATH \
--n_outputs $N_OUTPUTS
You can set the following evaluation arguments:
Argument | Type | Default vale | Description |
---|---|---|---|
gpt2_model_dir | str |
"./gpt2/model" |
path to GPT2 pretrained model parameters |
gpt2_vocab_path | str |
"./gpt2/tokenizer" |
path to GPT2 tokenizer vocab file |
valid_dataset | str |
./datasets/valid_self_original_no_cands.cache |
cache valid_dataset path |
output_path | str |
./result.jsonl |
path to output prediction results |
batch_size | int |
2 |
batch size for evaluation |
max_seq_len | int |
60 |
max sequence length fed into GPT2 |
max_history | int |
2 |
max number of historical conversation turns to use |
max_context_len | int |
100 |
max context sequence length for sentence embedding |
max_persona_len | int |
70 |
max persona sequence length for sentence embedding |
max_response_len | int |
30 |
max response sequence length for sentence embedding |
max_predict_len | int |
32 |
max predicted response sequence length |
n_outputs | int |
3 |
how many candidates to generate |
seed | int |
0 |
random seed |
device | str |
'cuda' if torch.cuda.is_available() else 'cpu' |
"cpu" or "cuda" |
z_dim | int |
200 |
latent hidden state dim (z) |
checkpoint_path | str |
default="" | path to load model checkpoint |
model_type | type=str | "compressed_cvae" | "decoder", "cvae_memory", "cvae_embedding", "compressed_decoder" or "compressed_cvae" |
After getting the prediction results, you can run the following script to get diversity metrics:
OUTPUT_PATH=/path/of/generated/results
python evaluation/evaluate_diversity.py --eval_file $OUTPUT_PATH
This script will output distinct-1
, distinct-2
and entropy-4
.