Spatial Commonsense

Source code and data for Things not Written in Text: Exploring Spatial Commonsense from Visual Signals (ACL2022 main conference paper).

Dependencies

Python>=3.7

For pre-trained language model probing:

Transformers
Pytorch
Sklearn

For image synthesis:

Torchvision
Kornia
CLIP
Taming-transformers

For object detection and vision-language model:

Scene_graph_benchmark
Oscar

Data

Our datasets are in the data/ folder.

Size/Height: The objects, text prompts, questions, and lables are in data.json. There is an additional pickle file containing the objects in levels.

PosRel: The objects, text prompts and labels for probing are in data.json. The questions and answers are in data_qa.json.

Code

The code is in the code/ folder.

Image Synthesis

The image synthesis code is adapted from code of Ryan Murdoch, @advadnoun on Twitter.

python image_synthesis.py

Variables clip_path and taming_path need to be modified before execution.

Images are generated in data/{size, height, posrel}/images. ({size, height, posrel} means one of the three words based on the current subtask.)

Object Detection

Scene_graph_benchmark (VinVL) does not provide code for object detection from custom images directly.

We first modify scene_graph_benchmark/tools/mini_tsv/tsv_demo.py to generate tsv files for our image directory, and run

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "tools/mini_tsv/{size, height, posrel}" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True

The object detection results are outputed in predictions.tsv, and features of bounding boxes are in feature.tsv.

Probing Spatial Commonsense

(For Size/Height) Make the depth prediction for each image:

python depth_prediction.py

Image synthesis model probing with bounding boxes in the images:

python image_probing_box.py

Solving Natural Language Questions

Reasoning based on the generated images:

Generate files required by Oscar+.

python build_oscar_data.py

Create directories {size, height, posrel} under Oscar/vinvl/datasets, and then place oscar_data.json and feats.pt under it.

Place run_vqa.py in Oscar/oscar, and run:

python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir vinvl/datasets/{size, height, posrel}/  --model_type bert --model_name_or_path best/best  --task_name vqa_text --do_train --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size 256 --per_gpu_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 25 --output_dir results --label_file vinvl/datasets/vqa/vqa/trainval_ans2label.pkl --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --img_feat_format pt --classifier linear --cls_hidden_scale 3 --txt_data_dir vinvl/datasets/{size, height, posrel}

Citation

Please cite our paper if this repository inspires your work.

@inproceedings{liu2022things,
  title={Things not Written in Text: Exploring Spatial Commonsense from Visual Signals},
  author={Liu, Xiao and Yin, Da and Feng, Yansong and Zhao, Dongyan},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2365--2376},
  year={2022}
}

Contact

If you have any questions regarding the code, please create an issue or contact the owner of this repository.

xxxiaol/spatial-commonsense