/GLAT_SGG

Code for GLAT (Global Local Transformer), ECCV 2020 "Learning Visual Commonsense for Robust Scene Graph Generation"

Primary LanguagePythonMIT LicenseMIT

Global Local Transformer for Scene Graph Generation

[Caution: This reposity is still under development mode and not cleanly documented yet. We only recommed you to use it as a reference.]

At this repository, we build our Global-Local-Transformer model on top of a selection of base scene grap generator models including KERN, Neural Motif, Stanford, etc to improve scene graph generation by leveraging Visual Commonsense

The corresponding paper was accepted at ECCV 2020 arXiv preprint arXiv:2006.09623 (2020). Alireza Zareian*, Zhecan Wang*, Haoxuan You*, Shih-Fu Chang, "Learning Visual Commonsense for Robust Scene Graph Generation", ECCV, 2020. (* co-first authors) [manuscript]

Pretraining and independent finetuning of GLAT could refer to another repository(https://github.com/ZhecanJamesWang/GLAT_Visual_Commonsense)

Reference to Base Scene Graph Generators

Knowledge-Embedded Routing Network for Scene Graph Generation

Tianshui Chen*, Weihao Yu*, Riquan Chen, and Liang Lin, “Knowledge-Embedded Routing Network for Scene Graph Generation”, CVPR, 2019. (* co-first authors) [manuscript]

Neural Motifs: Scene Graph Parsing with Global Context

Zellers R, Yatskar M, Thomson S, Choi Y. "Neural motifs: Scene graph parsing with global context". CVPR, 2018.

Scene Graph Generation by Iterative Message Passing

Xu D, Zhu Y, Choy CB, Fei-Fei L. "Scene graph generation by iterative message passing".CVPR 2017.

Evaluation metrics

In validation/test dataset, assume there are images. For each image, a model generates top predicted relationship triplets. As for image , there are ground truth relationship triplets, where triplets are predicted successfully by the model. We can calculate:

For image , in its ground truth relationship triplets, there are ground truth triplets with relationship (Except , meaning no relationship. The number of relationship classes is , including no relationship), where triplets are predicted successfully by the model. In images of validation/test dataset, for relationship , there are images which contain at least one ground truth triplet with this relationship. The R@X of relationship can be calculated:

Then we can calculate: