A-Framework-for-Vision-Language-Warm-up-Tasks-in-Multimodal-Dialogue-Models

Dependencies

python == 3.7.5
torch ==1.13.1
transformers == 4.28.1
numpy == 1.21.6
nltk == 3.8.1
pandas == 1.3.5
pytorch-lightning == 1.9.5
pytorch-transforemrs == 1.2.0
einops == 0.6.1

Feature Extraction & Caption Generation

Before initiating the essential Warm-up Tasks and Finetuning, it is imperative to preprocess the Image-Chat data for image feature extraction and caption generation.

The Image-Chat dataset can be downloaded via Image-Chat

For this dataset, image features are extracted using CLIP, and captions are generated through BLIP-2.

clip_feature_extraction.py : Extracts features from images.

captioning_blip2-opt-2.7b.py : Generates captions from images.

captioning_blip2-flan-t5-xl-coco.py : Generates captions from images.

Warm-up Tasks

You can perform Warm-up tasks using pretrain_preprocessing.py.

Fintuning Model

After performing the learning through the Warm-up tasks, you can import the trained weights and proceed with Fine-tuning.

Fine-tuning can be carried out using training_ablation.py.

Reference