/Exp-BLIP

Official implementation of BMVC2023 Oral paper: 《Describe Your Facial Expressions by Linking Image Encoders and Large Language Models》

Primary LanguagePython

Official PyTorch Implementation of Exp-BLIP (BMVC 2023 Oral).

Exp-BLIP training Framework

[Describe Your Facial Expressions by Linking Image Encoders and Large Language Models]
Yujian Yuan, Jiabei Zeng, Shiguang Shan
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences

📰 News

[2023.12.11] Training and test codes of Exp-BLIP are available.
[2023.11.24] Video presentation of Exp-BLIP is available in BMVC2023.
[2023.11.6] Paper of Exp-BLIP are available(BMVC2023, GitHub).
[2023.10.27] Sythesized captions used for training are available.
[2023.9.12] Exp-BLIP is decided by BMVC 2023 as an Oral presentation! 🎉
[2023.8.25] Exp-BLIP is accepted by BMVC 2023 ! 🎉
[2023.8.20] Code and trained models will be released here. Welcome to watch this repository for the latest updates.

➡️ Datasets

Statistics of training and test data. The captions of each image can be downloaded in Sythesized captions part.

(1) AU datasets

BP4D DISFA GFT RAF-AU Emotionet
Train(#image/#sub) 16627*/28 14814*/24 17719*/78 3733/- 19046/-
Test(#image/#sub) 45805/13 14535/3 4034*/18 868/- 2117/-

*:sampled sets

(2) Emotion datasets

AffectNet RAF-DB FaceME
Train(#image/#sub) 287618/- 3162/- 10052/-
Test(#image/#sub) 4000/- 792/- -

⬇️ Captions and Models Download

(1) Sythesized captions

Caption type Link
AU captions OneDrive
Emotion captions OneDrive
Facial expression captions* OneDrive

*:pseudo AU/emotion captions are generated by AU/Emot-BLIP(ViT-G,OPT6.7B).

(2) Trained models

Model Link
AU-BLIP(ViT-G,OPT6.7B) OneDrive
Emot-BLIP(ViT-G,OPT6.7B) [OneDrive]
Exp-BLIP(ViT-G,OPT6.7B) OneDrive

🔨 Installation

  1. (Optional) Creating conda environment
conda create -n expblip python=3.8.12
conda activate expblip
  1. Download the packages in requirements.txt
pip install -r requirements.txt 
  1. Download this repo.
git clone https://github.com/Yujianyuan/Exp-BLIP.git
cd Exp-BLIP

🚀 Getting started

(1) Training

You should finish the two training steps sequentially for training.

  1. fill the blank labeled by 'TODO' in Exp-BLIP/mylavis/projects/blip2/train/pretrain_stage1_vitg.yaml

  2. training for step-1

python -m torch.distributed.run --nproc_per_node=4 train.py --cfg-path mylavis/projects/blip2/train/pretrain_stage1_vitg.yaml
  1. fill the blank labeled by 'TODO' in Exp-BLIP/mylavis/projects/blip2/train/caption_exp_ft.yaml

  2. training for step-2

python -m torch.distributed.run --nproc_per_node=4 train.py --cfg-path mylavis/projects/blip2/train/caption_exp_ft.yaml

(2) Test

  1. in test.py, finish the image path and model path
import torch
from PIL import Image
from mylavis.models import my_load_model_and_preprocess

# load sample image
raw_image = Image.open("figs/happy.png").convert("RGB")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# set max output length
max_len = 200 
# loads AU/Emot/Exp-BLIP model
# this also loads the associated image processors
checkpoint_path = './exp_blip_vitg_opt6.7b_trimmed.pth'
model, vis_processors, _ = my_load_model_and_preprocess(name="blip2_opt",
                model_type="caption_coco_opt6.7b", dict_path = dict_path, is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" 
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
print('[1 caption]:',model.generate({"image": image},max_length=max_len))

# use nucleus sampling for diverse outputs 
print('[3 captions]:',model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3,max_length=max_len))

Then run it, you can get the captions.

python test.py

✏️ Citation

If you find this work useful for your research, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{yuan2023describe,
  title={Describe Your Facial Expressions by Linking Image Encoders and Large Language Models},
  author={Yuan, Yujian and Zeng, Jiabei and Shan, Shiguang},
  booktitle={British Machine Vision Conference (BMVC)},
  year={2023}
}

🤝 Acknowledgement

This work is supported by National Natural Science Foundation of China (No. 62176248). We also thank ICT computing platform for providing GPUs. We thank Salesforce Research sharing the code of BLIP-2 via LAVIS. Our codes are based on LAVIS.