/IF-FILM

[ICML 2024] Official implementation for "Image Fusion via Vision-Language Model".

Primary LanguagePython

FILM: image Fusion via vIsion-Language Model

Code and dataset for Image Fusion via Vision-Language Model (ICML 2024).

Information

Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, Luc Van Gool.

Update

  • [2024/07] Vision-Language Fusion (VLF) Dataset are public available.
  • [2024/07] Codes and config files of FILM are public available.
  • [2024/06] Release Project Page for FILM.

Citation

@inproceedings{Zhao_2024_ICML,
    title={Image Fusion via Vision-Language Model},
    author={Zixiang Zhao and Lilun Deng and Haowen Bai and Yukun Cui and Zhipeng Zhang 
            and Yulun Zhang and Haotong Qin and Dongdong Chen and Jiangshe Zhang 
            and Peng Wang and Luc Van Gool},
    booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
    year={2024},
}

Abstract

Image fusion integrates essential information from multiple images into a single composite, enhancing structures, textures, and refining imperfections. Existing methods predominantly focus on pixel-level and semantic visual features for recognition, but often overlook the deeper text-level semantic information beyond vision. Therefore, we introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM), for the first time, utilizing explicit textual information from source images to guide the fusion process. Specifically, FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion, enhancing feature extraction and contextual understanding, directed by textual semantic information via cross-attention. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing ChatGPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion.

🌐 Usage

⚙ Network Architecture

Our FILM is implemented in net/Film.py.

🏊 Training

1. Virtual Environment

# create virtual environment
conda create -n FILM python=3.8.17
conda activate FILM
# select pytorch version yourself
# install FILM requirements
pip install -r requirements.txt

2. Data Preparation

Download the datasets corresponding to four different tasks provided in our paper from this link. These datasets contain images, texts, and implicit features corresponding to the texts. Place these datasets in the './VLFDataset/' folder.

3. Pre-Processing

Run

python data_process.py

and the processed training dataset is in './VLFDataset_h5/MSRS_train.h5'.

4. FILM Training

Run

python train.py

The training results will be stored in the './exp/' folder, with subfolder names that can be modified using the save_path variable in 'train.py'. The next-level subfolders are named after the training start time and contain three folders: 'code', 'model', and 'pic_fusion', as well as a log file and a JSON file recording the parameters. The 'code' folder saves the model and training files for that session, the 'model' folder saves the model weights for each epoch during training, and the 'pic_fusion' folder saves the original images and fusion results of the first two batches from each training epoch.

🏄 Testing

1. Pretrained models

The pre-trained models can be found at './models/IVF.pth', './models/MEF.pth', './models/MFF.pth', and './models/MIF.pth'. These models are responsible for infrared-visible fusion (IVF), multi-exposure image fusion (MEF), multi-focus image fusion (MFF), and medical image fusion (MIF) tasks, respectively.

2. Test datasets

The test datasets used in the paper are provided in the format './VLFDataset/{Task_name}/{Dataset_name}/test.txt'. Here, the provided 'Task_name' includes IVF, MEF, MFF, and MIF, and 'Dataset_name' corresponds to the dataset names included for each task.

Unfortunately, due to the size of the datasets provided in the paper exceeding 4GB, we are unable to upload them for exhibition. You can download them via this link. This includes images, text, and the implicit features corresponding to the text (via BLIP2) for all datasets.

3. Results in Our Paper

If you want to infer with our FILM and obtain the fusion results in our paper, please run

python test.py

to perform image fusion. The output fusion results will be saved in the './test_output/{Dataset_name}/Gray' folder. If you want to test using your own trained model, you can set the path of the model you want to load as 'ckpt_path' before the model weights are loaded in 'test.py'.

The output for IVF is:

================================================================================
The test result of MSRS:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            6.72	43.64	11.55	3.72	1.06	0.70    
================================================================================

================================================================================
The test result of RoadScene:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            7.43	49.26	17.33	6.59	0.69	0.62    
================================================================================

================================================================================
The test result of M3FD:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            7.09	41.52	16.77	5.55	0.83	0.67    
================================================================================

which can match the results in Table 1 in our original paper.

The output for MIF is:

================================================================================
The test result of Harvard:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            4.74	65.23	23.36	6.19	0.78	0.76    
================================================================================

which can match the results in Table 2 in our original paper.

The output for MEF is:

================================================================================
The test result of SICE:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            7.07	54.22	19.39	5.14	1.05	0.79    
================================================================================

================================================================================
The test result of MEFB:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            7.32	68.98	20.94	6.14	0.98	0.77    
================================================================================

which can match the results in Table 3 in our original paper.

The output for MFF is:

================================================================================
The test result of RealMFF:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            7.11	54.97	15.62	5.43	1.11	0.76   
================================================================================

================================================================================
The test result of Lytro:
                 EN      SD      SF      AG     VIFF     Qabf    
FILM            7.56	59.15	19.57	6.97	0.98	0.74    
================================================================================

which can match the results in Table 4 in our original paper.

📁 Vision-Language Fusion (VLF) Dataset

Considering the high computational cost of invoking various vision-language components, and to facilitate subsequent research on image fusion based on vision-language models, we propose the VLF Dataset. This dataset encompasses paired paragraph descriptions generated by ChatGPT, covering all image pairs from the training and test sets of the eight widely-used fusion datasets.

These include

  • Infrared-visible image fusion (IVF): MSRS, M³FD, and RoadScene datasets;
  • Medical image fusion (MIF): Harvard dataset;
  • Multi-exposure image fusion (MEF): SICE and MEFB datasets;
  • Multi-focus image fusion (MFF): RealMFF and Lytro datasets;

The dataset is available for download via Google Drive.

More visualizations and illustrations of the VLF Dataset can be found on our Project Homepage.

[Notice]:
Considering the immense workload involved in creating this dataset, we have opened a Google Form for error correction feedback. Please provide your suggestions for correcting any errors in the VLF dataset. If you have any questions regarding the Google Form, please contact Zixiang via email.

🙌 Fusion via vIsion-Language Model (FILM)

Workflow for our FILM

Detailed Architecture of FILM

Visualization of the VLF dataset

📝 Experimental Results

Infrared-visible image fusion (IVF)

Qualitative fusion results:

Quantitative fusion results:

Medical image fusion (MIF)

Qualitative fusion results:

Quantitative fusion results:

Multi-exposure image fusion (MEF)

Qualitative fusion results:

Quantitative fusion results:

Multi-focus image fusion (MFF)

Qualitative fusion results:

Quantitative fusion results:

📖 Related Work