1 Ludwig Maximilian University, Munich, Germany, 2 Sapienza University of Rome, Italy
3 Siemens AG, Munich, Germany
Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable information on object relations, e.g. helping to detect not only spatial relations, such as standing behind, but also non-spatial relations, such as holding. In this work, we study the effect of using different object features with a focus on depth maps. To enable this study, we release a new synthetic dataset of depth maps, VG-Depth, as an extension to Visual Genome (VG). We also note that given the highly imbalanced distribution of relations in VG, typical evaluation metrics for visual relation detection cannot reveal improvements of under-represented relations. To address this problem, we propose using an additional metric, calling it Macro Recall@K, and demonstrate its remarkable performance on VG. Finally, our experiments confirm that by effective utilization of depth maps within a simple, yet competitive framework, the performance of visual relation detection can be improved by a margin of up to 8%.
-
We perform an extensive study on the effect of using different sources of object information in visual relation detection. We show in our empirical evaluations using the VG dataset, that our model can outperform competing methods by a margin of up to 8% points, even those using external language sources or contextualization.
-
We release a new synthetic dataset VG-Depth, to compensate for the lack of depth maps in Visual Genome.
-
We propose Macro Recall@K as a competitive metric for evaluating the visual relation detection performance in highly imbalanced datasets such as Visual Genome.
We release a new dataset called VG-Depth as an extension to Visual Genome. This dataset contains synthetically generated depth maps from Visual Genome images and can be downloaded from the following link: VG-Depth
Here are some examples of the Visual Genome images and their corresponding depth maps provided in our dataset:
Some of the qualitative results from our model’s predictions. Green arrows indicate the successfully detected predicates (true positives), orange arrows indicate the false negatives and gray arrows indicate predicted links which are not annotated in the ground truth.
The main requirements of our code are as follows:
- Python >= 3.6 (for Python 3.7 please check "Python3.7-beta" branch)
- PyTorch >= 1.1
- TorchVision >= 0.2.0
- TensorboardX
- CUDA Toolkit 10.0
- Pandas
- Overrides
- Gdown
Please make sure you have Cuda Toolkit 10.0 installed, then you can setup the environment by calling the following script:
./setup_env.sh
This script will perform the following operations:
- Install the required libraries.
- Download Visual-Genome.
- Download the Depth version of VG.
- Download the necessary checkpoints.
- Compile the CUDA libraries.
- Prepare the environment.
If you have already installed some of the libraries or downloaded the datasets, you can run the following scripts individually:
./data/fetch_dataset.sh # Download Visual-Genome
./data/fetch_depth_1024.sh # Download VG-Depth
./checkpoints/fetch_checkpoints.sh # Download the Checkpoints
Set the dataset path in the config.py file, and adjust your PYTHONPATH (e.g. export PYTHONPATH=/home/sina/Depth-VRD).
To train or evaluate the networks, select the configuration index and run the following scripts: (e.g. to evaluate "Ours-v,d" model run ./scripts/shz_models/eval_scripts.sh 7)
- To train the depth model separately (without the fusion layer):
./scripts/shz_models/train_depth.sh <configuration index>
- To train each feature separately (e.g. class features):
./scripts/shz_models/train_individual.sh <configuration index>
- To train the fusion models:
./scripts/shz_models/train_fusion.sh <configuration index>
- To evaluate the models:
./scripts/shz_models/eval_scripts.sh <configuration index>
The training scripts will be performed for eight different random seeds and 25 epochs each.
You can also separately download the full model (LCVD) here. Please note that this checkpoint generates results that are slightly different that the ones reported in our paper. The reason is that for a fair comparison, we have reported the mean values from evaluations over several models.
This repository is created and maintained by Sahand Sharifzadeh, Sina Moayed Baharlou, and Max Berrendorf.
The skeleton of our code is built on top of the nicely organized Neural-Motifs framework (incl. the rgb data loading pipeline and part of the evaluation code). We have upgraded these parts to be compatible with PyTorch 1. To enable a fair comparison of models, our object detection backbone also uses the same Faster-RCNN weights as that work.
@INPROCEEDINGS{9412945,
author={Sharifzadeh, Sahand and Baharlou, Sina Moayed and Berrendorf, Max and Koner, Rajat and Tresp, Volker},
booktitle={2020 25th International Conference on Pattern Recognition (ICPR)},
title={Improving Visual Relation Detection using Depth Maps},
year={2021},
volume={},
number={},
pages={3597-3604},
doi={10.1109/ICPR48806.2021.9412945}
}