/kd-topic-models

Repo for EMNLP 2020 paper, "Improving Neural Topic Models using Knowledge Distillation"

Primary LanguagePython

Improving Neural Topic Models using Knowledge Distillation

Repo for our EMNLP 2020 paper. We will clean up the implementation for improved ease-of-use, but provide the code included in our original submission for the time being.

If you use this code, please use the following citation:

@inproceedings{hoyle-etal-2020-improving,
    title = "Improving Neural Topic Models Using Knowledge Distillation",
    author = "Hoyle, Alexander Miserlis  and
      Goel, Pranav  and
      Resnik, Philip",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.137",
    pages = "1752--1771",
}

Rough Steps

  1. As of now, you'll need two conda environments to run both the BERT teacher and topic modeling student (which is a modification of Scholar). The environment files are defined in teacher/teacher.yml and scholar/scholar.yml for the teacher and topic model, respectively. For example: conda env create -f teacher/teacher.yml (edit the first line in the yml file if you want to change the name of the resulting environment; the default is transformers28).

  2. We use the data processing pipeline from Scholar. We'll use the IMDb data to serve as a guide (preprocessing scripts for the Wikitext and 20ng data are also included for replication purposes, but the processing scripts aren't general-purpose):

conda activate scholar
python data/imdb/download_imdb.py

# main preprocessing script
python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist
# create a dev split from the train data--change filenames if using different data
create_dev_split.py
  1. Run the teacher model, below is an example using IMDb.
conda activate transformers28

python teacher/bert_reconstruction.py \
    --input-dir ./data/imdb/processed-dev \
    --output-dir ./data/imdb/processed-dev/logits \ 
    --do-train \
    --evaluate-during-training \
    --truncate-dev-set-for-eval 120 \
    --logging-steps 200 \
    --save-steps 1000 \
    --num-train-epochs 6 \
    --seed 42 \
    --num-workers 4 \
    --batch-size 20 \
    --gradient-accumulation-steps 8 \
    --document-split-pooling mean-over-logits
  1. Collect the logits from the teacher model (the --checkpoint-folder-pattern argument accepts grub pattern matching in case you want to create logits for different stages of training; be sure to enclose in double quotes ")
conda activate transformers28

python teacher/bert_reconstruction.py \
    --output-dir ./data/imdb/processed-dev/logits \
    --seed 42 \
    --num-workers 6 \
    --get-reps \
    --checkpoint-folder-pattern "checkpoint-9000" \
    --save-doc-logits \
    --no-dev
  1. Run the topic model (there are a number of extraneous experimental arguments in run_scholar.py, which we intend to strip out in a future version).
conda activate scholar

python scholar/run_scholar.py \
    ./data/imdb/processed-dev \
    --dev-metric npmi \
    -k 50 \
    --epochs 500 \
    --patience 500 \
    --batch-size 200 \
    --background-embeddings \
    --device 0 \
    --dev-prefix dev \
    -lr 0.002 \
    --alpha 0.5 \
    --eta-bn-anneal-step-const 0.25 \
    --doc-reps-dir ./data/imdb/processed-dev/logits/checkpoint-9000/doc_logits \
    --use-doc-layer \
    --no-bow-reconstruction-loss \
    --doc-reconstruction-weight 0.5 \
    --doc-reconstruction-temp 1.0 \
    --doc-reconstruction-logit-clipping 10.0 \
    -o ./outputs/imdb