Latent Diffusion for Language Generation
This is the official code release for
Latent Diffusion for Language Generation.
by Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q. Weinberger
Abstract
Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to autoregressive language generation. We instead view diffusion as a complementary method that can augment the generative capabilities of existing pre-trained language models. We demonstrate that continuous diffusion models can be learned in the latent space of a pre-trained encoder-decoder model, enabling us to sample continuous latent representations that can be decoded into natural language with the pre-trained decoder. We show that our latent diffusion models are more effective at sampling novel text from data distributions than a strong autoregressive baseline and also enable controllable generation.
Citation
@article{lovelace2022latent,
title={Latent Diffusion for Language Generation},
author={Lovelace, Justin and Kishore, Varsha and Wan, Chao and Shekhtman, Eliot and Weinberger, Kilian},
journal={arXiv preprint arXiv:2212.09462},
year={2022}
}
Environment
A suitable environment can be created with the following commands.
conda env create -f environment.yml
python -m spacy download en_core_web_sm
Datasets
The dataset files for the E2E and ROCStories datasets are included in the datasets/
directory and do not require any additional processing. The SST and AG News datasets are loaded from the HuggingFace Hub.
Training
We provide scripts to train the diffusion models for each dataset with our default hyperparameters. Train a model with the command
./scripts/diffusion/text_diffusion_{dataset}.sh
where dataset is one of {roc, e2e, sst2, ag_news}
.
Evaluation
To evaluate a trained model on the validation set, see the scripts/diffusion/eval_text_diffusion.sh
script for an example. The --resume_dir
argument should be updated with the path of a trained model.
Different sampling configurations can be explored by changing the {num_samples, sampling_timesteps, ddim_sampling_eta}
arguments. We utilize 1,000 random samples for computing the metrics in our work. Note that MAUVE scores computed with different numbers of samples are not directly comparable (see here for more information about MAUVE scores).
To evaluate a trained model on the test set with 5 random seeds, see the scripts/diffusion/test_eval_text_diffusion.sh
script for an example. The only difference is that the eval_test
flag is used instead of the eval
flag. The --resume_dir
argument will need to be updated as before.
Contact
Please open an issue if you have any questions about using this repo. I will be updating the repo with the code for the classification experiment and the autoregressive baseline after the holiday season.
Acknowledgement
This work built upon excellent open-source implementations from Lucidrains. Specifically, we adapted his Pytorch DDPM implementation (link) and built upon his transformer implementation (link).