/lit-seq

Primary LanguagePythonApache License 2.0Apache-2.0

This repository is based on Lightning Transformers.

Table of contents

  1. Installation
  2. Related papers
  3. Use CT Loss in your work
  4. Test or interact with our checkpoints
  5. Training by yourself
    1. Language modelling task
    2. Dialogue task
  6. Test or interact with your trained model

Related papers

This repository contains the official source code for the following papers:

[1] A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration

If you use this work, please cite our paper:

@article{jiang2022contrastive,
  doi = {10.48550/ARXIV.2205.02517},
  url = {https://arxiv.org/abs/2205.02517},
  author = {Jiang, Shaojie and Zhang, Ruqing and Vakulenko, Svitlana and de Rijke, Maarten},
  title = {A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration},
  publisher = {arXiv},
  year = {2022},
}

[2] Weakly Supervised Turn-level Engagingness Evaluator for Dialogues

@inproceedings{jiang2023weakly,
  author = {Jiang, Shaojie and Vakulenko, Svitlana and de Rijke, Maarten},
  title = {Weakly Supervised Turn-level Engagingness Evaluator for Dialogues},
  year = {2023},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3576840.3578319},
  doi = {10.1145/3576840.3578319},
  keywords = {Conversation analysis, engagingness, user experience},
  location = {Austin, TX, USA},
  series = {CHIIR '23}
}

Installing from source

Clone and change your working directory to this repo's root dir

git clone https://github.com/ShaojieJiang/lit-seq.git
cd lit-seq
pip install . # Tested with Python >= 3.7.0

Using our CT objective in your work

This repo depends on our Python package ct-loss, which is a PyTorch loss function for reducing generative repetitions of auto-regressive language models. Using ct-loss in your work is very simple, please take a look at this repo.

Run test or interact with our pretrained model

The pretrained checkpoints used in paper [1] are now on Hugging Face Hub, so you can easily reproduce the results reported in our paper, or interact with our pretained models.

Here is the notebook to interact with our models on Google Colab.

For reproducing the test results on your local server, or interacting with the GPT2-small model finetuned on Wikitext-103:

python lit --config-name lm backbone.pretrained_model_name_or_path=NeuralNotwork/gpt2-ct stage=[test | interact]

Interacting with a language model, you can get continuations to your input prefix.

For the BlenderBot dialogue model:

python lit --config-name dialogue_multi backbone.pretrained_model_name_or_path=NeuralNotwork/blenderbot-400M-ct stage=[test | interact]

Interacting with a dialogue model, you can get responses to your input message.

If you don't need the W&B logging, add log=False to the above commands.

Training

You can also reproduce our training using the instructions below. All the data downloading and preprocessing are taken care of automatically. All default hyper-parameters for reproducing our results are already in their corresponding conf/*.yaml configuration files. Simply run the following commands.

NOTE: For preprocessing big datasets such as Wikitext-103 and DSTC8-Reddit, it may take longer, more CPU memory and CPU cores for the first time. But thanks to Hugging Face Datasets, once the datasets are preprocessed and cached locally, the subsequent runs should take much less memory (25GB or less) and CPU cores (usually two are enough) to run, and should be loaded instantly.

Language modeling task

python lit.py --config-name lm dataset.cfg.dataset_config_name=wikitext-103-raw-v1 [OPTIONS]

For customising the training, consider the following options:

optinal arguments values explanation
task.cfg.ct_seq_len Positive integer Suggested to be 1/4 (rounded) of the cross-entropy sequence length (maximum training length). Default to 150
task.cfg.preced_m_negatives Integer > -1 -1 means using all preceding tokens as negatives, 0 use none, k>0 uses k. Suggested to be 1/8 of the cross-entropy sequence length (max training length). Default to 60
task.cfg.negative_method ct, ul, nce, simctg Which method to use for penalizing negative tokens. ct: contrastive token; ul: unlikelihood training; nce: noise-contrastive estimation; simctg: SimCTG (training objective only); Default to ct
task.cfg.ul_seq True, False Whether to use sequence level of UL or not. Default to False
task.cfg.simctg True, False Whether to use simctg loss. Default to False
training.lr Float Learning rate. Default to 1e-5
trainer.default_root_dir Path to your checkpoint location Default to ${HOME}/storage/trained/lit/${task.cfg.task_name}/${backbone.pretrained_model_name_or_path}_${dataset.cfg.pretrained_dataset_name}

Dialogue task

python lit.py --config-name dialogue_multi [OPTIONS]

For customising the training, consider these options:

optinal arguments values explanation
task.cfg.ct_seq_len Positive integer Suggested to be 1/4 (rounded) of the cross-entropy sequence length (maximum training length). Default to 30
task.cfg.preced_m_negatives Integer > -1 -1 means using all preceding tokens as negatives, 0 use none, k>0 uses k. Suggested to be 1/8 of the cross-entropy sequence length (max training length). Default to 15
task.cfg.negative_method ct, ul, nce, simctg Which method to use for penalizing negative tokens. ct: contrastive token; ul: unlikelihood training; nce: noise-contrastive estimation; simctg: SimCTG (training objective only); Default to ct
task.cfg.ul_seq True, False Whether to use sequence level of UL or not. Default to False
task.cfg.simctg True, False Whether to use simctg loss. Default to False
training.lr Float Learning rate. Default to 1e-5
trainer.default_root_dir Path to your checkpoint location Default to ${HOME}/storage/trained/lit/${task.cfg.task_name}/${backbone.pretrained_model_name_or_path}_${dataset.cfg.pretrained_dataset_name}

Evaluation task

To reproduce the training in work [2]:

python lit.py --config-name rdep_hier_multi dataset.cfg.history_size=3 trainer.default_root_dir='your_path_to_save_checkpoints

Test or interact

To test or interact with the models trained by yourself:

python lit --config-name [lm | dialogue_multi] trainer.default_root_dir='your_path_to_saved_checkpoints' stage=[test | interact]

To test the trained evaluator on the FED dataset:

export DATASET=fed # or daily_dialog_engaging

python lit.py --config-name rdep_hier dataset.cfg.history_size=3 trainer.default_root_dir='your_path_to_save_checkpoints' stage=test log=False dataset=nlp/text_regression/${DATASET}

License

Please observe the Apache 2.0 license that is listed in this repository.

Changes to the original repo

Coming soon. Tested on the following tasks:

  • Language modeling
  • Conversation