ViLMA

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models [website] [arxiv]

Introduction

This repository contains all necessary information for the ViLMA benchmark, including data setup, model setup and execution, and evaluation procedures. ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models (VidLMs) to evaluate their linguistic and temporal grounding capabilities in five dimensions: action counting, situation awareness, change of state, rare actions and spatial relations. ViLMA also includes a two stage evaluation procedure as (i) proficiency test (P) that assesses fundamental capabilities deemed essential before solving the five tests, (ii) main test (T) which evaluates the model under the proposed five diverse tests, and (iii) a combined score of these two tasks (P+T).

Environment Setup

Execute the following steps,

git clone git@github.com:ilkerkesen/vilma.git  # clone this repo.
cd vilma
conda create -n vilma --file spec-file  # create the environment.
pip install -e .  # install the codebase as an editable package.

Data Resources

Our benchmark is built upon several different video resources. In this section, we share the details on how to setup these data resources.

QUVA Repetition Dataset

This dataset is required to run the experiments for the repetitive action counting task. Use this link to download the data. You can run the following command sequence to setup this data resource,

wget -c http://isis-data.science.uva.nl/tomrunia/QUVARepetitionDataset.tar.gz
tar -xvf QUVARepetitionDataset.tar.gz -C /path/to/extracted/
./bin/normalize_fps /path/to/quva/videos /path/to/quva/normalized_videos
python tasks/counting/normalize_annotations.py --data-dir /path/to/quva

This snippet first downloads the data, extracts it to somewhere, then performs FPS normalization on the data. For more details check the counting documentation.

Something-Something V2 Dataset

This dataset is required to run the experiments for the relation task. Follow the instruction reported to this link to download the data.

YouCook2 Dataset

This dataset is required to run the experiments for the change-state task. Authors only provide the annotations and do not distribute the videos. We have selected a subset of the dataset annotations limiting ourselves to items for which videos are still available on YouTube (last checked 06.04.2023). Videos can be downloaded using this script.

COIN Dataset

This dataset is required to run the experiments for the change-state task. Authors only store the urls of videos and their annotations in JSON format. We have selected a subset of their annotations for which videos are still availble (last checked 06.04.2023). Videos can be downloaded using this script.

RareAct Dataset

This dataset is required to run the experiments for the change-state task. Videos are provided by the original authors and can be downloaded in a single zipped file via the following link. The video names are the YouTube ids of the videos.

STAR Dataset

This dataset is required to run the experiments for the change-state task. Raw videos are not provided by the authors. We have selected a subset of their annotations for which videos are still availble (last checked 06.04.2023). Videos can be downloaded using this script.

VidSitu Dataset

This dataset is required to run the experiments for the Semantic Role Labelling task. Raw videos are not provided by the authors. We have selected a subset of their annotations for which videos are still availble (last checked 16.05.2023). Videos can be downloaded using this script.

Models

We share the details of each model in a seperate documentation file under ./models/{modality}/ directory. We have implemented the following models so far,

Evaluation

Each adapted model produces a result annotation file which contains the predicted scores. We use a script to evaluate these result annotation files.

Result Annotation File Format

Here is a dummy results annotation file,

{
    "0": {
        "scores": [
            0.9998931884765625,
            0.9999251365661621
        ]
    },
    "1": {
        "scores": [
            0.9999557733535767,
            0.999962568283081
        ]
    }
}

These JSON files are actually key/value stores where the keys are the example ids (same with the data annotations). Each value is a dict which has the scores key. The value of this scores key represent the scores for each text input for the corresponding example. The first value always belongs to the true caption score. The remaining values are the scores for the foils in the same order as in the data annotation file. Note that, these scores could be either perplexity values, similarity scores or video-text matching probabilities, and this completely depends on the model.

Running the Evaluation Script

Please do run the ./bin/eval.py script to evaluate the models. It takes one argument which is the file path, and one option which specifies whether the model produces probabilities or scores (TODO: implement for the perplexity scores also as well). Here is an example,

python ./bin/eval.py /path/to/the/results/file.json --mode {similarity,probability,perplexity}

Passing --mode probability option makes the script treat the scores as probabilities, and allows user to produce the scores for the accuracy, precision and AUROC metrics. Similarly, passing --mode perplexity forces script to work with perplexity values.

Citation

If you find ViLMA beneficial for your research, please cite it,

@misc{kesen2023vilma,
    title={ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models},
    author={Ilker Kesen and Andrea Pedrotti and Mustafa Dogan and Michele Cafagna and Emre Can Acikgoz and Letitia Parcalabescu and Iacer Calixto and Anette Frank and Albert Gatt and Aykut Erdem and Erkut Erdem},
    year={2023},
    eprint={2311.07022},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

bpiyush/ViLMA