input format should be jsonlines file:
["hello", "hi", "how are you?", "I am good", "do you like movies?", "No, I don't"]
Then it will be processed into following format in python:
"A:hello\n\n\nB:hi\n\n\nA:how are you?\n\n\nB:I am good\n\n\nA:do you like movies?\n\n\nB:No, I don't\n\n\n"
Download 30GB data:
https://drive.google.com/open?id=1TC4uliw5QqcL8zeDyfKxwV6E5V0LLTCF
install jsonlines
pip install jsonlines
Install apex version that modified the error due to cuda version
git clone https://github.com/qywu/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
from Qingyang Wu github, install torchfly:
git clone https://github.com/qywu/TorchFly.git
cd TorchFly
pip install -e .
pip install -r requirements.txt
install tensorboard
pip install tensorboard
install transformers from huggingface
pip install transformers
install allennlp
pip install allennlp
Run bot.py and chat with the bot. You could use a different model by changing the model location. Default location is Checkpoint/best.pth
Model is 500MB, and optimizer is 1000MB. Total model: ten nearest model_optimiza plus 1 model/4 hours, making it total round 50 GB.
If training is interupted, please find the closest checkpoint and reload model and optimizer. Also, you need to find the rough position of Dataset. Besides, you need to reset the schedule.
Note: one gpu memory overlead will not lead to training crush(but still not acceptable, for lacking a gpu will leads to a need for a different hyperparameter set). Make sure data all GPU are working correctly at all time.
-
using hdf5 or other methods to make the model train on limited cpu memory condition https://towardsdatascience.com/hdf5-datasets-for-pytorch-631ff1d750f5
-
multiwoz, persuasion, camerest
-
double roles training or single roles training
-
ablation study on randomlize start positions
-
hyperparameter setting maybe not good. It maybe better to use the hyperparameter from microsoft https://github.com/microsoft/DialoGPT/blob/master/LSP_train.py they mention lr depends on model size, meaning they tried different lr?
-
data processing more? like deleting n-gram dialogue?
We will run on three experiments, persuasionforgood, multiwoz and cameres