/what-is-where-by-looking

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Primary LanguageJupyter Notebook

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs - NIPS 22'

Get Started

  • datasets format are followed by link

For training a model :

python train_grounding.py -bs 32 -nW 8 -nW_eval 1 -task vg_train -data_path /path_to/vg -val_path /path_to/flicker
python train_grounding.py -bs 32 -nW 8 -nW_eval 1 -task coco -data_path /path_to/coco -val_path /path_to/flicker

For Grounding evaluation with our model [XX is the number of the results folder i.e 'gpu22' - XX == 22]:

python inference_grounding.py -task grounding -dataset refit -val_path /path_to/RefIt -Isize 224 -clip_eval 0 -path_ae XX -nW 1
python inference_grounding.py -task grounding -dataset flicker -val_path /path_to/flicker -Isize 224 -clip_eval 0 -path_ae XX -nW 1
python inference_grounding.py -task grounding -dataset vg -val_path /path_to/VG -Isize 224 -clip_eval 0 -path_ae XX -nW 1

For Grounding evaluation with CLIP model:

python inference_grounding.py -task grounding -dataset refit -val_path /path_to/RefIt -Isize 224 -clip_eval 1 -nW 1
python inference_grounding.py -task grounding -dataset flicker -val_path /path_to/flicker -Isize 224 -clip_eval 1 -nW 1
python inference_grounding.py -task grounding -dataset vg -val_path /path_to/VG -Isize 224 -clip_eval 1 -nW 1

For WWbL evaluation with our model:

python inference_grounding.py -task app -dataset refit -val_path /path_to/RefIt -Isize 224 -clip_eval 0 -path_ae XX -nW 1 --start 0 --end 9983
python wwbl_algo1_point_metric.py -nW 1 -predictions_path YY -val_path /path_to/RefIt --dataset refit

python inference_grounding.py -task app -dataset flicker -val_path /path_to/flicker -Isize 224 -clip_eval 0 -path_ae XX -nW 1 -start 0 -end 1000
python wwbl_algo1_point_metric.py -nW 1 -predictions_path YY -val_path /path_to/flicker --dataset flicker

python inference_grounding.py -task app -dataset vg -val_path /path_to/VG -Isize 224 -clip_eval 0 -path_ae XX -nW 1 -start 0 -end 17478
python wwbl_algo1_point_metric.py -nW 1 -predictions_path YY -val_path /path_to/VG --dataset VG

Phrase Grounding Results - Point Accuracy Metric

COCO weights

VG weights

Method Backbone VG(VGtrained/COCO) Flicker(VGtrained/COCO) ReferIt(VGtrained/COCO)
Baseline Random 11.15 27.24 24.30
Baseline Center 20.55 47.40 30.30
GAE CLIP 54.72 72.47 56.76
FCVC VGG -/14.03 -/29.03 -/33.52
VGLS VGG 24.40/- -/- -/-
TD Inception-2 19.31/- 42.40/- 31.97/-
SSS VGG 30.03/- 49.10/- 39.98/-
MG BiLSTM+VGG 50.18/46.99 57.91/53.29 62.76/47.89
MG ELMo+VGG 48.76/47.94 60.08/61.66 60.01/47.52
GbS VGG 53.40/52.00 70.48/72.60 59.44/56.10
ours CLIP+VGG 62.31/59.09 75.63/75.43 65.95/61.03