LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

The official source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at CVPR 2024.

Overview

Addressing two issues inherent in the conventional approach(Parser+Knowledge Base(WordNet))

Semantic Over-simplification (Step 2)
The standard scene graph parser commonly leads to converting the fine-grained predicates into coarse-grained predicates, which we refer to as semantic over-simplification. For example, in Figure (c), an informative predicate lying on in the image caption is undesirably converted into a less informative predicate on, because the scene parser operating on rule-based fails to capture the predicate lying on at once, and its heuristic rules fall short of accommodating the diverse range of caption's structure. As a result, in Figure (b), the predicate distribution follows long-tailedness. To make matter worse, 12 out of 50 predicates are non-existent, which means that these 12 predicates can never be predicted.
Low-density Scene Graph (Step 3)
The triplet alignment based on knowledge base (i.e., WordNet) leads to low-density scene graphs, i.e., the number of remaining triplets after Step 3 is small. Specifically, a triplet is discarded if any of three components (i.e., subject, predicate, object) or their synonym/hypernym/hyponym within the triplet fail to align with the entity or predicate classes in the target data. For example, in Figure (d), the triplet <elephant, carrying, log> is discarded because log does not exist in the target data nor its synonym/hypernym, even if elephant and carrying do exist. As a result, a large number of predicates is discarded, resulting in a poor generalization and performance degradation. This is attributed to the fact that the static structured knowledge of KB is insufficient to cover the semantic relationships among a wide a range of words.

Proposed Approach: LLM4SGG

To alleviate the two issues aforementioned above, we adopt a pre-trained Large Language Model (LLM). Inspired by the idea of Chain-of-Thoughts (CoT), which arrives at an answer in a stepwise manner, we seperate the triplet formation process into two chains, each of which replaces the rule-based parser in Step 2 (i.e., Chain-1) and the KB in Step 3 (i.e., Chain-2).

Regarding an LLM, we employ gpt-3.5-turbo in ChatGPT.

TODO List

Release prompts and codes for training the model with Conceptual caption dataset
Release enhanced scene graph datasets of Conceptual caption
Release prompts and codes for training the model with Visual Genome caption dataset
Release enhanced scene graph datasets of Visual Genome caption

Installation

conda create -n llm4sgg python=3.9.0 -y
conda activate llm4sgg

pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install openai einops shapely timm yacs tensorboardX ftfy prettytable pymongo tqdm numpy python-magic pandas
pip install transformers==4.35.0

Once the package has been installed, please run setup.py file.

python setup.py build develop --user

Dataset

Please refer to dataset/README.md

Triplet Extraction Process

You can find detailed explanation of triplet extraction process in triplet_extraction_process/README.md

Train

The detailed paths of localized triplets are in maskrcnn_benchmark/config/paths_catalog.py file.

Test set: VG

Models trained on caption datasets (e.g., COCO, CC, and VG Caption) are evaluated on VG test dataset.

The required file (i.e., localized triplets made by LLM4SGG) and pre-trained model (i.e., GLIP) will be automatically downloaded to facilitate your implementation. Simply change the DATASET name as needed.

Single GPU

# DATASET: coco,cc,and vgcaption
bash scripts/single_gpu/train_{DATASET}4vg.sh

Multi GPU

# DATASET: coco,cc,and vgcaption
bash scripts/multi_gpu/train_{DATASET}4vg.sh

If you want to train model with reweighting strategy, please run the code.

# Training data: COCO
bash scripts/{multi_gpu or single_gpu}/train_coco4vg_rwt.sh

Test set: GQA

bash scripts/{multi_gpu or single_gpu}/train_coco4gqa.sh

Test

# Please change model checkpoint in test.sh file
bash scripts/test.sh

We also provide pre-trained models and other results.

COCO → VG test

Citation

@InProceedings{Kim_2024_CVPR,
    author    = {Kim, Kibum and Yoon, Kanghoon and Jeon, Jaehyeong and In, Yeonjun and Moon, Jinyoung and Kim, Donghyun and Park, Chanyoung},
    title     = {LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {28306-28316}
}

Acknowledgement

The code is developed on top of VS3.

Zhuzi24/torch-LLM4SGG