/LCL

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Primary LanguagePythonMIT LicenseMIT

Latent Compression Learning (LCL)

The official implementation of the paper "Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning".

We introduce the Latent Compression Learning (LCL) to pre-train vision models from scratch with interleaved image-text data. Compared to existing methods (e.g., CLIP, auto-regressive text generation), our proposed LCL is the first to achieve both

  • Learning vision models from scratch
  • Training on interleaved image-text data

overview

📈 Results

Pre-training on MMC4 Dataset

result_interleaved

Our LCL pre-training significantly outperforms all other methods in the caption tasks and is on par with the best paired pre-training methods on classification and retrieval tasks.

Comparison with OpenCLIP

result_main_transfer

result_main_multimodal

When both using LAION-400M data, our LCL pre-training achieves similar performance to OpenCLIP. When combined with MMC4 data, our LCL pre-training outperforms OpenCLIP, especially in caption and multi-modal dialogue tasks. For a fair comparison, the total number of images seen during pre-training is 13B.

🛠️ Usage

Install

This code is built upon OpenCLIP, you can refer to their repository for setup.

Training LCL

The example training scripts are provided in ./scripts. You can refer to OpenCLIP for more ways to launch training.

Training on LAION-400M: Run ./scripts/lcl_vit_b_32_laion.sh. The corresponding model config is here.

Training on MMC4: We provide a simple dataloader that supports the original MMC4 dataset. Organize the data folder as follows:

  /path/to/mmc4/
      ├── images/
      │   └── ...
      └── data/ 
          ├── docs_shard_0_v2.jsonl.zip
          ├── docs_shard_1_v2.jsonl.zip
          └── ...

Run ./scripts/lcl_vit_b_32_mmc4.sh. The corresponding model config is here.

Pre-trained Checkpoints

The following are the checkpoints of our pre-trained vision encoders. (Some of the checkpoints will be available in the future based on the schedule)

model data epoch download
ViT-B/32 LAION-400M 32 eta: 2024/06/30
ViT-B/32 LAION-400M + MMC4 32 TBD
ViT-B/32 LAION-2B + CC-Interleaved 15 eta: 2024/06/30
ViT-L/14 LAION-400M 32 TBD
ViT-L/14 LAION-400M + MMC4 32 TBD
ViT-L/14 LAION-2B + CC-Interleaved 15 TBD

CC-Interleaved is a newly collected interleaved image-text dataset with over one billion images, which will be released soon.

NOTE: We conduct large-scale pre-training with internal efficient code, which will not be released due to intellectual property reasons. This released version has been verified and can reproduce the results of ViT-B/32 on LAION-400M dataset.

📅 Schedule

  • basic code of LCL
  • checkpoints of more models and datasets
  • transfer evaluation code

🖊️ Citation

If you find this work helpful in your research, please consider citing:

@article{yang2024vision,
  title={Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning},
  author={Yang, Chenyu and Zhu, Xizhou and Zhu, Jinguo and Su, Weijie and Wang, Junjie and Dong, Xuan and Wang, Wenhai and Li, Bin and Zhou, Jie and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2406.07543},
  year={2024}
}

📃 License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

🙏 Acknowledgements

Our code is built with reference to the code of the following projects: OpenCLIP.