Rabiul Awal ✨, Saba Ahmadi ✨, Le Zhang ✨, Aishwarya Agrawal
Mila - Quebec AI Institute, University of Montreal
✨ indicates equal contribution
The training dataset has 64,392 samples, and the VisMin dataset has 2,084 samples. The dataset is stored in a JSON format. Each entry contains the image path, caption, and a list of negative examples. The negative examples consist of the edited image path and edited caption.
- Training Data: 64,392 samples from VSR and COCO 2017 training split.
- Benchmark Data: 2,084 samples from COCO 2017 validation split, human-verified.
Example of a dataset entry:
{
"image_path": "/coco/images/train2017/000000234136.jpg",
"caption": "Two men holding a brown and white dog in a van.",
"negatives": [
{
"edited_image_path": "/edited/coco/234136/0.png",
"edited_caption": "Three men holding a brown and white dog in a van.",
}
]
}
To fine-tune models, such as pre-trained CLIP, using the hard-negative contrastive loss on the curated dataset, follow these steps:
- Clone CVPR 2024 paper's codebase: Enhance-FineGrained.
- You need to specify training parameters in
scrips/run_all.sh
such as--gres=gpu:a100:2
andbatch_size
. Refer to this script file for more details. - To start the training, use the following commands:
cd scripts/
bash run_multiple_nodes.sh
The result checkpoint will be at Enhance-FineGrained/src/Outputs
directory.
The models including CLIP or Multimodal LM can be evaluated on our VisMin benchmark which image-text matching tasks. We also support evaluation on a pool of diagnostics datasets such as VALE, Winoground, and ARO.
# To evaluate two-tower models such as CLIP
python evals.contrastive_inference --dataset <dataset_name> --model_name <path_to_model> --pretrained <pretrained_model_name>
# To evaluate generative models such as Idefics2 => https://huggingface.co/blog/idefics2
python evals.mllm_inference --dataset <dataset_name> --model_name <path_to_model>
We use LLM to generate edit instructions. There are two approaches to generate these instructions: one with captions, which suggests object attribute changes following the style of in-context demonstrations, and another for spatial and counting changes, where we prompt LLM with in-context demonstrations to create the appropriate edit instructions with layouts.
Example of an llm-generated edit instruction (object attribute category):
{
"InputCaption": "A glass of ice water sitting next to a wine glass.",
"SelectedPhrase": "glass of ice water",
"EditedPhrase": "glass of milk",
"EditedRegionPhrase": "A glass of milk",
"EditedCaption": "A glass of milk sitting next to a wine glass.",
"Category": "object"
}
Example of an llm-generated edit instruction (spatial and counting category):
"A paint brush is to the left of a palette.": [
"[('a paint brush', [50, 200, 100, 312]), ('a palette', [362, 150, 150, 362])]\nBackground prompt: A realistic scene\nNegative prompt:\nCategory: relation(left of)"
]
To run the script, from the directory containing cntr_edit/
, execute:
# for object attribute category
# requires dataset name to be specified for source of captions
python -m llm_agent.minchange_text_pairs_gen --dataset <name_of_dataset> --prompt_type edit_instructgen_from_caption --language_model_name <name_of_language_model>
# for spatial and counting category
python -m llm_agent.minchange_text_pairs_gen --prompt_type edit_instructgen_from_caption --language_model_name <name_of_language_model>
Generating magic prompt (to be appended with the e.g. object name) for better diffusion guidance of input prompt:
# for object attribute category (e.g. coco dataset)
python -m llm_agent.magic_prompt --dataset coco --language_model_name <name_of_language_model>
# for spatial and counting category
python -m llm_agent.magic_prompt --dataset relation --language_model_name <name_of_language_model>
We have two approaches to generate minimal-change images:
- Masking and Inpainting: First, we mask the object to be edited in the source image using the Grounding-DINO model. Then, we use diffusion inpainting to generate minimal-change images.
- Layout Swapping: We use GLIGEN layout-diffusion to swap objects in the source image to generate edited images. For counting changes, we remove objects using their bounding boxes and create edited images.
Run the following command:
# for object attribute category (e.g. coco dataset)
# this script loads segmentation model, the diffusion model and vqa model
python -m ctrl_edit.inpaint_with_mask --language_model_name <llm_used_to_generated_edit_instruction> --dataset <dataset_name> --output <path_to_edited_image>
# for spatial and counting category (generated from scratch using layout diffusion model)
# dataset name can be "relation" or "counting"
# this script loads the layout diffusion model and vqa model
python3 -m ctrl_edit.diffusion_llm_grounded_old --repeats 3 \
--frozen_step_ratio 0.5 --no-scale-boxes-default \
--sdxl --sdxl-step-ratio 0.4 \
--dataset <dataset_name> \
--split <split_name>
Verify images through the vqa filter approach. first generate the local-global vqa questions and answers using llm following edit instruction.
# To create the local-global VQA questions and answers using LLM-generated edit instructions from one of the previous step:
python -m ctrl_edit.llm_agent.auto_filter_question_gen --language_model_name <name_of_language_model>
# Automatically filter out bad edited images using the local-global VQA approach:
python -m ctrl_edit.filters.tifa_filter --dataset <dataset_name>
git clone <https://github.com/rabiulcste/vismin>
cd vismin
pip install -r requirements.txt
The codebase is built on top of the following repositories: