/bsq-vit

[arXiv:2406.07548] Image and Video Tokenization with Binary Spherical Quantization

Primary LanguagePythonMIT LicenseMIT

🍰 BSQ-ViT

You can pronouce BSQ-ViT like "biskvit" (a kind of Russian sponge cake) or simply "biscuit".

Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao1, Yuanjun Xiong2, Philipp KrΓ€henbΓΌhl1
1 UT Austin, 2Predera
arxiv | bibtex

Installation

  1. Install Miniforge3

  2. Create the environment

mamba env create -f bsqvit-env.yaml
mamba activate bsqvit

Main Results

Image Reconstruction (IN-1K val 256x256)

Use approx. (Eq 8) #bits PSNR↑ SSIM↑ LPIPS↓ rFID↓ config & ckpt md5sum
SDXL-VAE N/A 64 25.38 .7276 .0666 0.72 External N/A
BSQ-ViT 18 24.79 .7319 .0836 1.34 UTBox 7abf5a
BSQ-ViT (EMA) 18 24.80 .7314 .0820 1.23 UTBox 7abf5a
BSQ-ViT βœ“ 18 25.36 .7578 .0761 1.14 UTBox 8f5422
BSQ-ViT (EMA) βœ“ 18 25.80 .7680 .0729 1.30 UTBox 8f5422
BSQ-ViT βœ“ 36 27.88 .8410 .0432 0.41 UTBox b5ce5f
BSQ-ViT (EMA) βœ“ 36 28.14 .8448 .0400 0.45 UTBox b5ce5f

Video Reconstruction (UCF-101 16x128x128)

#bits PSNR↑ SSIM↑ LPIPS↓ rFVD↓ config & ckpt
MAGVIT-L 10 22.0 .7010 .0990 25 N/A
MAGVITv2 18 - - .0694 16.12 N/A
MAGVITv2 (deeper) 18 - - .0537 8.62 N/A
BSQ-bcViT 18 32.08 .9421 .0244 8.08 TBA
BSQ-bcViT 36 33.80 .9606 .0159 4.10 TBA

Image Synthesis (IN-1K 128x128)

FID↓ IS↑ Prec↑ Rec↑ pre-computed samples config & ckpt
BigGAN 6.02 145.8 0.86 0.35 External External
ADM 5.91 93.3 0.70 0.65 External External
Ours BSQ-ViT + Masked-LM 5.44 139.6 0.80 0.50 UTBox TBA

Video Compression (MCL-JCV 360P)

License

MIT License.

Citing BSQ-ViT

@article{zhao2024bsqvit,
  title={Image and Video Tokenization with Binary Spherical Quantization},
  author={Zhao, Yue and Xiong, Yuanjun, and Kr{\"a}henb{\"u}hl, Philipp},
  journal={arXiv preprint arXiv:2406.07548},
  year={2024}
}