/TARA

Primary LanguagePythonApache License 2.0Apache-2.0

TARA: There’s a Time and Place for Reasoning Beyond the Image

new-york-protest-floyd modi-board Can you tell the time and location when the images were taken?

In this work, we identify and formulate this problem, spatio-temporal grounding of images, a task aiming at identifying the time and location the given image was taken. Specifically, we develop a novel dataset TARA, (Time and plAce for Reasoning beyond the imAge), a challenging and important dataset with 16k images with their associated news, time and location automatically extracted from New York Times (NYT), and an additional 61k examples as distant supervision from WIT. On top of the extractions, we present a crowdsourced subset in which images are believed to be feasible to find their spatio-temporal information for evaluation purpose. We show that there exists a gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge.

In this repository, we provide the dateset for TARA: There’s a Time and Place for Reasoning Beyond the Image, which is accepted to ACL 2022, along with the pytorch implementation of the baseline variants models.

Datasets

Download here. We provide the train, dev, and test set in the input folder. In addition, we provide an html file to better demonstrate the data.

Finetuned CLIP Models

We used the model 'ViT-B/32' provided in the original CLIP repo in our experiments, and used their code to load and finetune. Please make sure you satisfy all the requirements following the original CLIP repo.

Our fine-tuned models can be found here. The fine-tuning code are in time-reasoning.

To evaluate any CLIP-based model on our dataset, you can use the command

python eval.py --clip_model_name /YOUR_PATH_TO_SAVE_FINETUNED_CLIP_MODELS/finetune_segment_joint.pth --label_name gold_time_suggest

Please note that there are four kinds of labels: [gold_location, gold_location_suggest; gold_time, gold_time_suggest] in our dataset. In all our experiments, we use gold_location_suggest and gold_time_suggest only.

The only difference between gold_LABEL and gold_LABEL_suggest is the granularity. gold_LABEL_suggest is the adjusted gold_LABEL after MTurk annotation. For example, the gold_LABEL is the most precise label we got, e.g. 2017-5-23. Then, during our MTurk annotation process, our human annotators may find that they can only reason the time to a year but not a specific date, so in this case, the gold_LABEL_suggest will become 2017.

Citation

@inproceedings{FZCVR22,
    author = {Xingyu Fu and Ben Zhou and Ishaan Preetam Chandratreya and Carl Vondrick and Dan Roth},
    title = {{There’s a Time and Place for Reasoning Beyond the Image}},
    booktitle = {Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL)},
    year = {2022},
    url = "https://cogcomp.seas.upenn.edu/papers/paper-to-come.pdf",
    funding = {KAIROS},
}