/torch-LLM4SGG

Official PyTorch implementation Source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at CVPR 2024

Primary LanguagePython

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

LLMs LLMs LLMs

The official source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at CVPR 2024.

Overview

Addressing two issues inherent in the conventional approach(Parser+Knowledge Base(WordNet))

  • Semantic Over-simplification (Step 2)
    The standard scene graph parser commonly leads to converting the fine-grained predicates into coarse-grained predicates, which we refer to as semantic over-simplification. For example, in Figure (c), an informative predicate lying on in the image caption is undesirably converted into a less informative predicate on, because the scene parser operating on rule-based fails to capture the predicate lying on at once, and its heuristic rules fall short of accommodating the diverse range of caption's structure. As a result, in Figure (b), the predicate distribution follows long-tailedness. To make matter worse, 12 out of 50 predicates are non-existent, which means that these 12 predicates can never be predicted.

  • Low-density Scene Graph (Step 3)
    The triplet alignment based on knowledge base (i.e., WordNet) leads to low-density scene graphs, i.e., the number of remaining triplets after Step 3 is small. Specifically, a triplet is discarded if any of three components (i.e., subject, predicate, object) or their synonym/hypernym/hyponym within the triplet fail to align with the entity or predicate classes in the target data. For example, in Figure (d), the triplet <elephant, carrying, log> is discarded because log does not exist in the target data nor its synonym/hypernym, even if elephant and carrying do exist. As a result, a large number of predicates is discarded, resulting in a poor generalization and performance degradation. This is attributed to the fact that the static structured knowledge of KB is insufficient to cover the semantic relationships among a wide a range of words.

Proposed Approach: LLM4SGG

To alleviate the two issues aforementioned above, we adopt a pre-trained Large Language Model (LLM). Inspired by the idea of Chain-of-Thoughts (CoT), which arrives at an answer in a stepwise manner, we seperate the triplet formation process into two chains, each of which replaces the rule-based parser in Step 2 (i.e., Chain-1) and the KB in Step 3 (i.e., Chain-2).

Regarding an LLM, we employ gpt-3.5-turbo in ChatGPT.

TODO List

  • Release prompts and codes for training the model with Conceptual caption dataset
  • Release enhanced scene graph datasets of Conceptual caption
  • Release prompts and codes for training the model with Visual Genome caption dataset
  • Release enhanced scene graph datasets of Visual Genome caption

Installation

conda create -n llm4sgg python=3.9.0 -y
conda activate llm4sgg

pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install openai einops shapely timm yacs tensorboardX ftfy prettytable pymongo tqdm numpy python-magic pandas
pip install transformers==4.35.0

Once the package has been installed, please run setup.py file.

python setup.py build develop --user

Dataset

Please refer to dataset/README.md

Triplet Extraction Process

You can find detailed explanation of triplet extraction process in triplet_extraction_process/README.md

Train

The detailed paths of localized triplets are in maskrcnn_benchmark/config/paths_catalog.py file.

Test set: VG

Models trained on caption datasets (e.g., COCO, CC, and VG Caption) are evaluated on VG test dataset.

The required file (i.e., localized triplets made by LLM4SGG) and pre-trained model (i.e., GLIP) will be automatically downloaded to facilitate your implementation. Simply change the DATASET name as needed.

Single GPU

# DATASET: coco,cc,and vgcaption
bash scripts/single_gpu/train_{DATASET}4vg.sh

Multi GPU

# DATASET: coco,cc,and vgcaption
bash scripts/multi_gpu/train_{DATASET}4vg.sh

If you want to train model with reweighting strategy, please run the code.

# Training data: COCO
bash scripts/{multi_gpu or single_gpu}/train_coco4vg_rwt.sh

Test set: GQA

bash scripts/{multi_gpu or single_gpu}/train_coco4gqa.sh

Test

# Please change model checkpoint in test.sh file
bash scripts/test.sh 

We also provide pre-trained models and other results.

COCO → VG test

VG Caption → VG test

CC → VG test

COCO → GQA test

Citation

@InProceedings{Kim_2024_CVPR,
    author    = {Kim, Kibum and Yoon, Kanghoon and Jeon, Jaehyeong and In, Yeonjun and Moon, Jinyoung and Kim, Donghyun and Park, Chanyoung},
    title     = {LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {28306-28316}
}

Acknowledgement

The code is developed on top of VS3.