You can pronouce BSQ-ViT like "biskvit" (a kind of Russian sponge cake) or simply "biscuit".
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao1, Yuanjun Xiong2, Philipp KrΓ€henbΓΌhl1
1 UT Austin, 2Predera
arxiv | bibtex
-
Install Miniforge3
-
Create the environment
mamba env create -f bsqvit-env.yaml
mamba activate bsqvit
Use approx. (Eq 8) | #bits | PSNRβ | SSIMβ | LPIPSβ | rFIDβ | config & ckpt | md5sum | |
---|---|---|---|---|---|---|---|---|
SDXL-VAE | N/A | 64 | 25.38 | .7276 | .0666 | 0.72 | External | N/A |
BSQ-ViT | 18 | 24.79 | .7319 | .0836 | 1.34 | UTBox | 7abf5a | |
BSQ-ViT (EMA) | 18 | 24.80 | .7314 | .0820 | 1.23 | UTBox | 7abf5a | |
BSQ-ViT | β | 18 | 25.36 | .7578 | .0761 | 1.14 | UTBox | 8f5422 |
BSQ-ViT (EMA) | β | 18 | 25.80 | .7680 | .0729 | 1.30 | UTBox | 8f5422 |
BSQ-ViT | β | 36 | 27.88 | .8410 | .0432 | 0.41 | UTBox | b5ce5f |
BSQ-ViT (EMA) | β | 36 | 28.14 | .8448 | .0400 | 0.45 | UTBox | b5ce5f |
#bits | PSNRβ | SSIMβ | LPIPSβ | rFVDβ | config & ckpt | |
---|---|---|---|---|---|---|
MAGVIT-L | 10 | 22.0 | .7010 | .0990 | 25 | N/A |
MAGVITv2 | 18 | - | - | .0694 | 16.12 | N/A |
MAGVITv2 (deeper) | 18 | - | - | .0537 | 8.62 | N/A |
BSQ-bcViT | 18 | 32.08 | .9421 | .0244 | 8.08 | TBA |
BSQ-bcViT | 36 | 33.80 | .9606 | .0159 | 4.10 | TBA |
FIDβ | ISβ | Precβ | Recβ | pre-computed samples | config & ckpt | |
---|---|---|---|---|---|---|
BigGAN | 6.02 | 145.8 | 0.86 | 0.35 | External | External |
ADM | 5.91 | 93.3 | 0.70 | 0.65 | External | External |
Ours BSQ-ViT + Masked-LM | 5.44 | 139.6 | 0.80 | 0.50 | UTBox | TBA |
@article{zhao2024bsqvit,
title={Image and Video Tokenization with Binary Spherical Quantization},
author={Zhao, Yue and Xiong, Yuanjun, and Kr{\"a}henb{\"u}hl, Philipp},
journal={arXiv preprint arXiv:2406.07548},
year={2024}
}