/g2r

Codebase for the EMNLP 2021 Paper "Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for Efficient Open-domain Conversation".

Primary LanguagePythonOtherNOASSERTION

G2R: Distilling the Knowledge of Large-Scale Generative Models into Retrieval Models for Efficient Open-domain Conversation

This is a codebase for the EMNLP 2021 (Findings) Paper, "Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for Efficient Open-domain Conversation".

Paper Link

Arxiv link

Dataset

We provide a link for downloading the dataset used in the paper: the augmented dialogue dataset generated by the data-level G2R, along with model scores generated by the model-level G2R.

Training

Preliminaries

  • Extract the dataset zipfile into datasets/ folder.
  • Activate virtualenv/conda python environment and install requirements

Dataset preparation

# Assuming that the Blended SKill Talk dataset is already build in ParlAI

PARLAI_DIR=/your/parlai/library/dir
mkdir -p ${PARLAI_DIR}/data/bst_distill
ln -s ${PARLAI_DIR}/data/blended_skill_talk/valid.txt ${PARLAI_DIR}/data/bst_distill/valid.txt
ln -s ${PARLAI_DIR}/data/blended_skill_talk/test.txt ${PARLAI_DIR}/data/bst_distill/test.txt

python3 score_result_to_parlai.py \
  --input-path ./datasets/emnlp_2021_g2r_dataset/bst_data_level_g2r_dialogue.jsonl \
  --output-parlai-path ${PARLAI_DIR}/data/bst_distill/train-g2r-ll.txt \
  --score-name ll

python3 score_result_to_parlai.py \
  --input-path ./datasets/emnlp_2021_g2r_dataset/bst_data_level_g2r_dialogue.jsonl \
  --output-parlai-path ${PARLAI_DIR}/data/bst_distill/train-g2r-mi.txt \
  --score-name mi

Training

  • Check scripts/training for training the model of data-level G2R, model-level G2R (LL score / MI score).
  • We assume that INIT_MODEL_PATH contains the ParlAI model path for initializing the model. Otherwise, the training starts with the model trained from Pushshift dataset.

Inference

Check scripts/inference for generating the response using G2R models and other baselines.

# Inference of G2R based models
./scripts/generate/generate_g2r.sh trained_biencoder_model_path

Automatic Evaluation

Automatic Evaluation (Dist-2, Dist-3, Length calculation) of generated results.

python3 auto_evaluation.py --result-paths /path/for/generation/result

Citation

If you find our paper or this project helps your research, please kindly consider citing our paper in your publications.

@article{kim2021distilling,
  title={Distilling the Knowledge of Large-scale Generative Models into Retrieval Models for Efficient Open-domain Conversation},
  author={Kim, Beomsu and Seo, Seokjun and Han, Seungju and Erdenee, Enkhbayar and Chang, Buru},
  journal={arXiv preprint arXiv:2108.12582},
  year={2021}
}