VCR-shortcut-effects-study

Introduction
Prerequisites
How simple to beat the random baseline
Validation data for verifying the shortcuts
- Rule-based modification
- Adversarial modification
  - Score the effect of removing tokens
  - Generate adversarial settings
Qualitative examples of the validation data
Model training
Our paper

Introduction

Code and data of our AAAI2021 paper "A Case Study of the Shortcut Effects in Visual Commonsense Reasoning"

Prerequisites

Python 3.6.4
Python packages pip install -r requirements.txt
Download and unzip the VCR annotations to data/vcr1annots
Download and put the VCR images zip file to data/vcr1images.zip

How simple to beat the random baseline

Please refer to our cheat.py. We count the overlap between the question and answer options and achieved a 29% accuracy (v.s. Random Guess 25%). Moreover, we found that more than 65% of the correct options have the most overlapped tags. It means that the connection between the questions and answer options provides a strong hint regarding the correct answer. Thus, model could find the connection to answer questions, without performing true reasoning.

Validation data for verifying the shortcuts

We provide the validation data to verify the shortcut effects. Please refer to the links below to download the data. The methodology to generate these data are mentioned in our paper section "Methods to Evaluate the Shortcut Effects". We also provide more details below.

	Setting (link)	Count	Used in
Rule-based Modified	Rule-Singular	16,154	Paper Table 3, Row 2
	Rule-Plural	3,657	Paper Table 3, Row 3
Adversarially Modifed	AdvTop-1	26,534	Paper Table 4, Column 4
	KeepTop-1	26,534	Paper Table 4, Column 5
	KeepTop-3	26,534	Paper Table 4, Column 6
	KeepTop-5	26,534	Paper Table 4, Column 7

Rule-based modification

This method targets on how much different methods rely on person tag shortcuts. Please refer to the code under the "tools" directory for the methods' details. rephrase_choice_singular.py and rephrase_choice_plural.py generate the Rule-Singular and Rule-Plural validation data, respectively. To run them, just type python rephrase_choice_singular.py and python rephrase_choice_plural.py with default arguments.

Adversarial modification

This method looks for the token to be masked to maximize the cross-entropy loss.

We use shortcut_main.py to score the effect of removing individual tokens in the answer/rationale. Then, we use format_adversarial_annotations.py to merge the results from both answering model and rationale model. Finally, we use merge_adversarial_annotations.py to generate the setting of AdvTop-1, KeepTop-1, KeepTop-3, KeepTop-5, which are used in our Table 4.

Score the effect of removing tokens

We need the answering/rationale models and encoded TF record files to infer the effect of removing individual tokens. To get the trained models and the TF record files, please refer to Model training. After training the model, we use the shortcut_main.py to score the effect of removing individual tokens in the answer/rationale. We provide two example usages here:

python tools/shortcut_main.py \
  --logtostderr \
  --model_dir "PATH_TO_ANSWERING_MODEL" \
  --pipeline_proto "PATH_TO_ANSWERING_MODEL/pipeline.pbtxt" \
  --output_jsonl_file "ANSWERING_SCORING.jsonl"
  
python tools/shortcut_main.py \
  --logtostderr \
  --model_dir "PATH_TO_RATIONALE_MODEL" \
  --pipeline_proto "PATH_TO_RATIONALE_MODEL/pipeline.pbtxt" \
  --output_jsonl_file "RATIONALE_SCORING.jsonl" \
  --rationale

We then use the format_adversarial_annotations.py to merge the two output jsonl files. Simply run python tools/format_adversarial_annotations.py. We have uploaded the final result scoring file to data/adversarial_based/shortcut_scores.jsonl.

Generate adversarial settings

Here is an example pipeline to generate AdvTop-1 setting. We assume the original VCR validation data is located at data/vcr1annots/val.jsonl and the scoring of the shortcut effects is located at data/adversarial_based/shortcut_scores.jsonl. The following command shall generate the data/adversarial_based/val_adv_rmtop1.jsonl.v2 file required for the AdvTop-1 setting. The same tool can be used to generate the other adversarial settings, type python tools/merge_adversarial_annotations.py --help to see all options.

python "tools/merge_adversarial_annotations.py" \
  --logtostderr \
  --annotations_jsonl_file "data/vcr1annots/val.jsonl" \
  --adversarial_annotations_jsonl_file "data/adversarial_based/shortcut_scores.jsonl" \
  --output_jsonl_file "data/adversarial_based/val_adv_rmtop1.jsonl.v2" \
  --name "remove_shortcut"

Qualitative examples of the validation data

[val-54] Where is [2] going ?

original	rule-singular	adv-rmtop1
[2] is going into the store . [2] is getting into a carriage . [1] is going to the bathroom . [1] is going outside to play after the conversation with [2] is over .	He is going into the store . [2] is getting into a carriage . [2] is going to the bathroom . [1] is going outside to play after the conversation with [2] is over .	[MASK] is going into the store . [2] is getting into a [MASK] . [MASK] is going to the bathroom . [1] is [MASK] outside to play after the conversation with [2] is over

[val-270] What are [1,2] feeling ?

original	rule-plural	adv-rmtop1
[1,2] do not like the restaurant . They are apprehensive . They are both feeling happy . [1,2] are feeling drunk .	[1,2] do not like the restaurant . [1,2] are apprehensive . They are both feeling happy . [1,2] are feeling drunk .	[1,2] do not like the [MASK] . They are apprehensive [MASK] They are [MASK] feeling happy . [1,2] are feeling [MASK] .

[val-0] How is [1] feeling ?

original	rule-singular	adv-rmtop1
[1] is feeling amused . [1] is upset and disgusted . [1] is feeling very scared . [1] is feeling uncomfortable with [3] .	[1] is feeling amused . He is upset and disgusted . [1] is feeling very scared . [1] is feeling uncomfortable with [3] .	[1] is feeling [MASK] . [MASK] is upset and disgusted . [1] is feeling very [MASK] . [1] is feeling [MASK] with [3] .

[val-22] How does [1] feel right now ?

original	rule-singular	adv-rmtop1
[1] is tired and concerned . [1] feels admonished . [2] is a bit upset . [1] is aroused and happy .	He is tired and concerned . [1] feels admonished . [1] is a bit upset . [1] is aroused and happy .	[MASK] is tired and concerned . [1] feels [MASK] . [MASK] is a bit upset . [1] is aroused and [MASK] .

[val-171] What do [2,3] appear to be doing ?

original	rule-plural	adv-rmtop1
[2,3] are opening the ambulance doors to assist someone . [2,3] are dancing . They are working as security for the building . They are hiding among [1,4] .	[2,3] are opening the ambulance doors to assist someone . They are dancing . [2,3] are working as security for the building . [2,3] are hiding among [1,4] .	[2,3] are opening the [MASK] doors to assist someone . [2,3] are [MASK] . They are working as security for the [MASK] . They are [MASK] among [1,4] .

[val-391] What is [1] planning on doing ?

original	rule-singular	adv-rmtop1
She wants to go to sleep . She is planning to get on [6] . [1] is looking to punch [2] in her arm . She is going to play a game with the other children .	[1] wants to go to sleep . She is planning to get on [6] . [1] is looking to punch [2] in her arm . [1] is going to play a game with the other children .	She wants to go to [MASK] . [MASK] is planning to get on [6] . [1] is looking to [MASK] [2] in her arm . She is going to play a game with the other [MASK] .

Model training

tbd...

Our paper

If you found this repository useful or used our data for evaluation, please cite our paper

@InProceedings{Ye_2021_AAAI,
  author = {Ye, Keren and Kovashka, Adriana},
  title = {A Case Study of the Shortcut Effects in Visual Commonsense Reasoning},
  booktitle = {Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI)},
  month = {February},
  year = {2021}
}

yekeren/VCR-shortcut-effects-study