/SC-Tune

Official code for CVPR 2024 paper, "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models"

Primary LanguagePythonMIT LicenseMIT

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Tongtian Yue1,3* ,  Jie Cheng2,3* ,  Longteng Guo1,3* ,  Xingyuan Dai2,3 ,  Zijia Zhao1,3 ,  Xingjian He1,3   Gang Xiong2,3   Yisheng Lv2,3   Jing Liu1,3†  
1Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA   
2State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA   
3School of Artificial Intelligence, University of Chinese Academy of Sciences   

CVPR, 2024

Requirements

Installation

Create a conda environment and install dependencies:

conda create -n sc_tune python=3.10
conda activate sc_tune
pip install -r requirements.txt

Data

Download the Qwen-VL-Chat checkpoint (10 *.bin files in total) to the path Qwen-VL-Chat/ and Object365 images.

Note

We have modified the codes in Qwen-VL-Chat/visual.py. Please replace the original file with the one in this repo if necessary.

Get Started

Configs

Set the path of Object365 images in scripts/finetune_ds.sh. Other hyperparameters can also be found in this file.

Running

sh scripts/finetune_ds.sh

Main codes

The main codes to implement sc-tune method are in transformers/trainer.py and transformers/trainer_utils.py.

Acknowledgement

This repo benefits from Qwen-VL, TRL, and MOSS. Thanks for their wonderful work.

Citation

@article{yue2024sc,
  title={SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models},
  author={Yue, Tongtian and Cheng, Jie and Guo, Longteng and Dai, Xingyuan and Zhao, Zijia and He, Xingjian and Xiong, Gang and Lv, Yisheng and Liu, Jing},
  journal={arXiv preprint arXiv:2403.13263},
  year={2024}
}