An Hourglass Transformer VQ-VAE architecture.
As part of the LatentLM project, a first-stage model capable of compressing very long sequences is neccessary. We achieve this by combining Hourglass Transformer with FSQ and Contrastive Weight Tying to construct an attention-only VQ-VAE architecture.
- Linear attention.
- GQA.
- FlashAttention2 with sliding window to replace linear attention.
- Attention upsampling to replace linear upsampling.
- (Optional) experiment with adverserial losses (Hourglass VQ-GAN).
- Hyperparameter tuning.