Literature Review on Autoregressive VAEs
The paper that begin with "(Cited)" means this paper is cited by the previous work Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs, so some of us might have already read them.
Papers related to combination of Transformer/Seq2Seq and VAEs
- A Transformer-Based Variational Autoencoder for Sentence Generation
- one of the frist attempt to combine Transformer with VAEs
- in a way that is different from Optimus
- Implicit Deep Latent Variable Models for Text Generation (recommand)
- iVAE, turn explicit VAE (gaussian) into implicit and use sampling.
- also modified original KL loss with mutual information.
- more details in his PhD thesis Towards Effective and Controllable Neural Text Generation
- Variational Transformers for Diverse Response Generation (recommand)
- propose two kind of combination of Transformer and VAEs, GVT and SVT
- GVT encodes input sentence as first token vector and fed it to decoder, but still have posterior collapse
- SVT is interesting, the latent code is also generated in an autoregressive fasion for prior network, and non-autoregressive for posteior network.
- Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation (Recommand!)
- design a new objective, without weakening the decoder.
- Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space (need to read)
- one of the first attempt to combine BERT + GPT-2 into a VAE model.
- FlowPrior: Learning Expressive Priors for Latent Variable Sentence Models
- data-driven expressive prior
- Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder
- some analysis on OPTIMUS in the area of Dense Reteival, claiming OPTIMUS is not so good as original BERT
- Finetuning Pretrained Transformers into Variational Autoencoders
- Very similar to OPTIMUS, only instead of self-attention in GPT-2, they inject cross-attention in Transformer
- Transformer-based Conditional Variational Autoencoder for Controllable Story Generation
- this work considers Conditional VAE instead of VAE, the goal is a little different, so a little off-topic.
- they design a pseudo attention, new way to inject enc-dec dependency.
- Controlled Text Generation Using Dictionary Prior in Variational Autoencoders (recommand)
- also data driven prior, improving optimus.
Empirical methods to prevent posterior collapse
- (Cited, Everything's begining) Generating Sentences from a Continuous Space
- (Cited) Semi-Amortized Variational Autoencoders
- (Cited) Preventing Posterior Collapse with delta-VAEs
- (Cited) Lagging Inference Networks and Posterior Collapse in Variational Autoencoders
- Preventing posterior collapse in variational autoencoders for text generation via decoder regularization
- dropout as augmentation, like SimCSE. to make the hidden space robust.
Theoretical analysis on general posterior collapse
- (Cited) Variational Lossy Autoencoder
- Analyze posterior collapse, from the view of information theory and coding.
- replace the original gaussian prior with an autoregressive flow, and claim to have better result.
- (Cited) Don' t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse
- The Usual Suspects? Reassessing Blame for VAE Posterior Collapse
- this paper gives a good taxonomy of various kinds of posterior collapse.
- not on posterior collapse of autoregressive model
Analysis on posterior collapse particular w.r.t. autoregressive models
- Variational Attention for Sequence-to-Sequence Models
- Bypass phenomenon: whenever there is a deterministic information pathway, the varietional part would loss function.
Not very Related, but might be insightful
- INSET: Sentence Infilling with INter-SEntential Transformer
- not about VAE, but contains many training tricks, might be useful.
- Improving Text Generation with Student-Forcing Optimal Transport
- a little off-topic, about the loss design of autoregressive model, might be useful in training.
- Non-Autoregressive Neural Dialogue Generation
- maybe transformer and VAEs can be used in non-autoregressive way.
- Variational Transformer Networks for Layout Generation
- not on text, but on Layout
- both auto-regressive (GPT) or non-autoregressive (BERT) fassion are studied.
- use a learned prior distribution.
- Diffusion-LM Improves Controllable Text Generation
- diffusion model on text! they also do text generation non-autoregressively.