/LoRS_Distill

Code for our ICML'24 on multimodal dataset distillation

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

LoRS: Low-Rank Similarity Mining

This repo contains code of our ICML'24 work LoRS: Low-Rank Similarity Mining for Multimodal Dataset Distillation. LoRS propose to learn the similarity matrix during distilling the image and text. The simple and plug-and-play method yields significant performance gain. Please check our paper for more analysis.

Method

Getting Started

Requirements: please see requirements.txt.

Pretrained model checkpoints: you may manually download checkpoint of BERT, NFNet (from TIMM) and put them here:

distill_utils/checkpoints/
├── bert-base-uncased/
│   ├── config.json
│   ├── LICENSE.txt
│   ├── model.onnx
│   ├── pytorch_model.bin
│   ├── vocab.txt
│   └── ......
└── nfnet_l0_ra2-45c6688d.pth

Datasets: please download Flickr30K: [Train][Val][Test][Images] and COCO: [Train][Val][Test][Images] datasets, and put them here:

./distill_utils/data/
├── Flickr30k/
│   ├── flickr30k-images/
│   │   ├── 1234.jpg
│   │   └── ......
│   ├── results_20130124.token
│   └── readme.txt
└── COCO/
    ├── train2014/
    ├── val2014/
    └── test2014/

Training Expert Buffer: e.g. run sh sh/buffer_flickr.sh. The expert training takes days. You could manually split the num_experts and run multiple processes.

Distill with LoRS: e.g. run sh sh/distill_flickr_lors_100.sh. The distillation could be run on one single RTX 3090/4090 thanks to TESLA.

Citation

If you find our work useful and inspiring, please cite our paper:

@article{xu2024lors,
  title={Low-Rank Similarity Mining for Multimodal Dataset Distillation},
  author={Xu, Yue and Lin, Zhilin and Qiu, Yusong and Lu, Cewu and Li, Yong-Lu},
  journal={arXiv e-prints},
  pages={arXiv--2406},
  year={2024}
}

Acknowledgement

We following the setting and code of VL-Distill and re-implement the algorithm with TESLA. We deeply appreciate their valuable contribution!