The official source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at CVPR 2024.
Addressing two issues inherent in the conventional approach(Parser+Knowledge Base(WordNet))
-
Semantic Over-simplification (Step 2)
The standard scene graph parser commonly leads to converting the fine-grained predicates into coarse-grained predicates, which we refer to as semantic over-simplification. For example, in Figure (c), an informative predicate lying on in the image caption is undesirably converted into a less informative predicate on, because the scene parser operating on rule-based fails to capture the predicate lying on at once, and its heuristic rules fall short of accommodating the diverse range of caption's structure. As a result, in Figure (b), the predicate distribution follows long-tailedness. To make matter worse, 12 out of 50 predicates are non-existent, which means that these 12 predicates can never be predicted. -
Low-density Scene Graph (Step 3)
The triplet alignment based on knowledge base (i.e., WordNet) leads to low-density scene graphs, i.e., the number of remaining triplets after Step 3 is small. Specifically, a triplet is discarded if any of three components (i.e., subject, predicate, object) or their synonym/hypernym/hyponym within the triplet fail to align with the entity or predicate classes in the target data. For example, in Figure (d), the triplet <elephant, carrying, log> is discarded because log does not exist in the target data nor its synonym/hypernym, even if elephant and carrying do exist. As a result, a large number of predicates is discarded, resulting in a poor generalization and performance degradation. This is attributed to the fact that the static structured knowledge of KB is insufficient to cover the semantic relationships among a wide a range of words.
To alleviate the two issues aforementioned above, we adopt a pre-trained Large Language Model (LLM). Inspired by the idea of Chain-of-Thoughts (CoT), which arrives at an answer in a stepwise manner, we seperate the triplet formation process into two chains, each of which replaces the rule-based parser in Step 2 (i.e., Chain-1) and the KB in Step 3 (i.e., Chain-2).
Regarding an LLM, we employ gpt-3.5-turbo in ChatGPT.
- Release prompts and codes for training the model with Conceptual caption dataset
- Release enhanced scene graph datasets of Conceptual caption
- Release prompts and codes for training the model with Visual Genome caption dataset
- Release enhanced scene graph datasets of Visual Genome caption
conda create -n llm4sgg python=3.9.0 -y
conda activate llm4sgg
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install openai einops shapely timm yacs tensorboardX ftfy prettytable pymongo tqdm numpy python-magic pandas
pip install transformers==4.35.0
Once the package has been installed, please run setup.py
file.
python setup.py build develop --user
Please refer to dataset/README.md
You can find detailed explanation of triplet extraction process in triplet_extraction_process/README.md
The detailed paths of localized triplets are in maskrcnn_benchmark/config/paths_catalog.py file.
Models trained on caption datasets (e.g., COCO
, CC
, and VG Caption
) are evaluated on VG test dataset.
The required file (i.e., localized triplets made by LLM4SGG) and pre-trained model (i.e., GLIP) will be automatically downloaded to facilitate your implementation. Simply change the DATASET
name as needed.
# DATASET: coco,cc,and vgcaption
bash scripts/single_gpu/train_{DATASET}4vg.sh
# DATASET: coco,cc,and vgcaption
bash scripts/multi_gpu/train_{DATASET}4vg.sh
If you want to train model with reweighting strategy, please run the code.
# Training data: COCO
bash scripts/{multi_gpu or single_gpu}/train_coco4vg_rwt.sh
bash scripts/{multi_gpu or single_gpu}/train_coco4gqa.sh
# Please change model checkpoint in test.sh file
bash scripts/test.sh
We also provide pre-trained models and other results.
- model_VG_VS3.pth, config.yml, evaluation_res.txt
- model_VG_VS3_Rwt.pth, config.yml, evaluation_res.txt
-
Chain 1 Output: misaligned_triplets_vg_caption.json
-
Chain 2 Output: aligned_entity_dict_vg_caption4vg.pkl, aligned_predicate_dict_vg_caption4vg.pkl
-
Grounded Scene Graphs: aligned_triplet_vgcaption4vg_grounded.json
-
Training Result: evaluation_res.txt
-
Chain 1 Output: misaligned_triplets_cc.json
-
Chain 2 Output: aligned_entity_dict_cc4vg.pkl, aligned_predicate_dict_cc4vg.pkl
-
Grounded Scene Graphs: aligned_triplet_cc4vg_grounded.json
-
Training Result: model_CC4VG.pth, evaluation_res.text
@InProceedings{Kim_2024_CVPR,
author = {Kim, Kibum and Yoon, Kanghoon and Jeon, Jaehyeong and In, Yeonjun and Moon, Jinyoung and Kim, Donghyun and Park, Chanyoung},
title = {LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {28306-28316}
}
The code is developed on top of VS3.