PyTorch code for the Findings of EMNLP 2021 paper "Does Vision-and-Language Pretraining Improve Lexical Grounding?" (Tian Yun, Chen Sun, and Ellie Pavlick). PDF
If you find this project useful, please cite our paper:
@misc{yun2021does,
title={Does Vision-and-Language Pretraining Improve Lexical Grounding?},
author={Tian Yun and Chen Sun and Ellie Pavlick},
year={2021},
eprint={2109.10246},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Outline
pip install -r requirements.txt
Require python 3.6+ (to support huggingface transformers).
In this section (corresponding to Section 4.1 of the paper), we want to explore if VL pretraining yields gains to an extrinsic task that doesn't explicitly require representing non-text inputs but intuitively requires physical commonsense knowledge.
Download PIQA:
mkdir -p data/piqa
wget https://yonatanbisk.com/piqa/data/train.jsonl -P data/piqa
wget https://yonatanbisk.com/piqa/data/train-labels.lst -P data/piqa
wget https://yonatanbisk.com/piqa/data/valid.jsonl -P data/piqa
wget https://yonatanbisk.com/piqa/data/valid-labels.lst -P data/piqa
wget https://yonatanbisk.com/piqa/data/tests.jsonl -P data/piqa
We precompute the sentence embeddings to boost up the probing experiments.
# Available `embedder` are:
# - BERT
# - VideoBERT_randmask_text
# - VideoBERT_randmask_vt
# - VideoBERT_topmask_text
# - VideoBERT_topmask_vt
# - VisualBERT_text
# - VisualBERT_vt
bash scripts/piqa/precompute_sentence_embedding.sh -e [embedder]
We measure the quality of the representations with 3 different probing heads: Linear probe, MLP probe, and a Transformer probe. Transformer probe is to finetune the last transformer encoder layer and a linear layer on top of it.
# For probing experiments of "linear/MLP" probes in the paper
# `cls_type` can either be `linear` or `mlp`.
bash scripts/piqa/piqa_probing.sh -e [embedder] -c [cls_type]
# For probing experiment of "transformer" probe in the paper
bash scripts/piqa/piqa_transformer_probing.sh -e [embedder]
The default arguments of both commands will run the experiments for 5 times and log the averaged performance metric. You can modify the num_runs
in the scripts to control the number of runs. The logs will be written to logs/piqa/
, while the outputs (i.e. predictions on validation set of PIQA) will be written to outputs/piqa/
.
This section corresponds to Section 4.3 of the paper. We explore whether multimodal pretraining impacts conceptual structure at the lexical level. To look into this, we focus on adjective-noun composition which provides a simple way of defining a space of visually groundable objects and properties that we expect conceptual representations to encode.
We pick WikiHow, a dataset about step-by-step instructions of daily tasks. We first split the instructions into single sentences, and then run a bigram search over all the sentences to extract adjective-noun pairs.
We also use the "visually-groundable" adjectives in MIT States data as our adjective filter.
- Download WikiHow from this link:
# Download WikiHow mkdir -p data/wikiHow mv wikihowAll.csv data/wikiHow # Preprocess WikiHow python3 -m vlm_lexical_grounding.adj_noun_composition.wikihow_preprocess
- Download MIT States data:
mkdir -p data/mit_states wget http://wednesday.csail.mit.edu/joseph_result/state_and_transformation/release_dataset.zip -P data/mit_states unzip release_dataset.zip
We will find the pairs of an adjective and a noun, and then precompute the noun representations for K-Means clustering and adjective probing experiments. This step is necessary before we proceed to the two experiments.
# Find `adjective noun` candidate pairs
bash scripts/adj_noun_composition/general_statistics.sh
# Precompute noun embeddings
# Available `embedder` are:
# - BERT
# - VideoBERT_randmask_text
# - VideoBERT_randmask_vt
# - VideoBERT_topmask_text
# - VideoBERT_topmask_vt
# - VisualBERT_text
# - VisualBERT_vt
bash scripts/adj_noun_composition/get_target_embs.sh -e [embedder]
We use K-Means to cluster the representations of each noun, with K equals to the number of unique adjectives that modifies the noun in our dataset.
bash scripts/adj_noun_composition/kmeans_clustering.sh -e [embedder]
We attempt to evaluate the adjective information that is linearly encoded in the noun representations.
bash scripts/adj_noun_composition/adjective_probing.sh -e [embedder]
After downloading the zip files, move them to models/
and unzip *.zip
.
mv *.zip models/
unzip *.zip
We thank the reviewers and Liunian (Harold) Li for their helpful discussions. Part of the code are built based on huggingface transformers.