Latent Compression Learning (LCL)

The official implementation of the paper "Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning".

We introduce the Latent Compression Learning (LCL) to pre-train vision models from scratch with interleaved image-text data. Compared to existing methods (e.g., CLIP, auto-regressive text generation), our proposed LCL is the first to achieve both

Learning vision models from scratch
Training on interleaved image-text data

📈 Results

Pre-training on MMC4 Dataset

Our LCL pre-training significantly outperforms all other methods in the caption tasks and is on par with the best paired pre-training methods on classification and retrieval tasks.

Comparison with OpenCLIP

When both using LAION-400M data, our LCL pre-training achieves similar performance to OpenCLIP. When combined with MMC4 data, our LCL pre-training outperforms OpenCLIP, especially in caption and multi-modal dialogue tasks. For a fair comparison, the total number of images seen during pre-training is 13B.

🛠️ Usage

Install

This code is built upon OpenCLIP, you can refer to their repository for setup.

Training LCL

The example training scripts are provided in ./scripts. You can refer to OpenCLIP for more ways to launch training.

Training on LAION-400M: Run ./scripts/lcl_vit_b_32_laion.sh. The corresponding model config is here.

Training on MMC4: We provide a simple dataloader that supports the original MMC4 dataset. Organize the data folder as follows:

  /path/to/mmc4/
      ├── images/
      │   └── ...
      └── data/ 
          ├── docs_shard_0_v2.jsonl.zip
          ├── docs_shard_1_v2.jsonl.zip
          └── ...

Run ./scripts/lcl_vit_b_32_mmc4.sh. The corresponding model config is here.

Pre-trained Checkpoints

The following are the checkpoints of our pre-trained vision encoders. (Some of the checkpoints will be available in the future based on the schedule)

model	data	epoch	download
ViT-B/32	LAION-400M	32	eta: 2024/06/30
ViT-B/32	LAION-400M + MMC4	32	TBD
ViT-B/32	LAION-2B + CC-Interleaved	15	eta: 2024/06/30
ViT-L/14	LAION-400M	32	TBD
ViT-L/14	LAION-400M + MMC4	32	TBD
ViT-L/14	LAION-2B + CC-Interleaved	15	TBD

CC-Interleaved is a newly collected interleaved image-text dataset with over one billion images, which will be released soon.

NOTE: We conduct large-scale pre-training with internal efficient code, which will not be released due to intellectual property reasons. This released version has been verified and can reproduce the results of ViT-B/32 on LAION-400M dataset.

📅 Schedule

basic code of LCL
checkpoints of more models and datasets
transfer evaluation code

🖊️ Citation

If you find this work helpful in your research, please consider citing:

@article{yang2024vision,
  title={Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning},
  author={Yang, Chenyu and Zhu, Xizhou and Zhu, Jinguo and Su, Weijie and Wang, Junjie and Dong, Xuan and Wang, Wenhai and Li, Bin and Zhou, Jie and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2406.07543},
  year={2024}
}

📃 License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

🙏 Acknowledgements

Our code is built with reference to the code of the following projects: OpenCLIP.

OpenGVLab/LCL