The official dataset of Advancing Visual Grounding with Scene Knowledge: Benchmark and Method.
We introduce a challenging task that requires VG models to reason over (image, scene knowledge, query) triples and build a new dataset named SK-VG on top of real images through manual annotations. In SK-VG, the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge.
{
"image_name": "3853.jpg",
"knowledge": "The man on the far right of the image is Spider-Man Bruce. A spider is painted on his back. His enemy Brandon is floating in the air across from him, wearing sunglasses. Brandon's servant Tom is behind Brandon, holding a cane in his hand. Bruce comes to destroy them today.",
"ref_exp": "Bruce's enemy Brandon",
"bbox": {
"x": 1063.1217116217117,
"y": 385.6505161505161,
"width": 430.26939726939736,
"height": 705.6600066600066
}
}
You can download the dataset from Google Drive.
If you find this dataset helpful, please cite the paper below.
@inproceedings{chen2023advancing,
title={Advancing Visual Grounding With Scene Knowledge: Benchmark and Method},
author={Chen, Zhihong and Zhang, Ruifei and Song, Yibing and Wan, Xiang and Li, Guanbin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={15039--15049},
year={2023}
}