/perception_test

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

News

Join the first Perception Test challenge organised as an ICCV2023 workshop, website here ptchallenge-workshop.github.io.

Download data Links
Evaluation scripts (including data loader, dummy baseline, evaluation metrics) Object tracking, multi-choice vQA (coming soon), point tracking (coming soon), action/sound localisation (coming soon), grounded vQA (coming soon)
Evaluation server Coming soon
Leaderboard Coming soon

Overview

Perception Test: A Diagnostic Benchmark for Multimodal Video Models is a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models. The Perception Test dataset introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks (object and point tracking, action and sound localisation, multiple-choice and grounded video question-answering) that require understanding of memory, abstract patterns, physics, and semantics, across visual, audio, and text modalities.

In this repository, you will find:

  • A summary of the Perception Test and the associated challenge
  • A detailed description of the data and annotations in the Perception Test (interactive demo notebook here)
  • Details about how to download the data and annotations in the Perception Test (download section here)
  • Metrics for evaluating the performance on the different tasks
  • Dummy baselines showcasing how to evaluate models on each of the tasks

5-minutes summary of the Perception Test

Perception Test Overview Presentation

Try the Perception Test for yourself by accessing this quiz.

For more example videos in the Perception Test, check out this playlist.

Download the data and annotations

The Perception Test dataset can be downloaded as zip files containing:

  • annotations in JSON format
  • videos (including audio) as MP4 files
  • audio-only files in WAV format
  • pre-computed features for the action localisation and sound localisation tasks.

Links

Task Split Videos Audio Labels
Sample All sample_videos.zip (214.9MB) sample_audios.zip (83.9MB) sample_annotations.zip (3MB)
All Tasks Train train_videos.zip (26.5GB) train_audios.zip (12.3GB) train_annotations.zip (30.6MB)
All Tasks Valid valid_videos.zip (70.2GB) valid_audios.zip (33.1GB) valid_annotations.zip (81.5MB)

Baselines

In this repo we provide dummy baselines to demonstrate how to load the data, evaluate and recreate some baseline results from the paper. For the other results in the baselines section in the paper, we will be adding another external repo.

Computational task Baseline
Object tracking Static baseline
Point tracking Static baseline (available soon)
Multi-choice vQA Frequency baseline (available soon)

Metrics

Computational task Metric
Object tracking mean IoU
Point tracking Jaccard
Temporal action localisation mean Average Precision
Tempotal sound localisation mean Average Precision
Multi-choice vQA top-1 accuracy
Grounded vQA HOTA

Metrics code to evaluate performance for the different tasks coming soon.

Perception Test annotations

Explore the annotations: data_visualisation.ipynb

Summary

Annotation type Number of videos Number of annotations
Object tracks 11,609 189,940
Point tracks 145 8,647
Action segments 11,353 73,503
Sound segments 11,433 137,128
Multi-choice vQA 10,361 38,060
Grounded vQA 3,063 6,086

Video metadata

Field Name Description
split The data split the video belongs to
video_id The ID of the video
frame_rate The frame rate of the video in frames per second
num_frames The total number of frames in the video
resolution The height and width of the video in pixels
audio_samples The total number of audio samples in the video
audio_sample_rate The sample rate of the audio in the video in Hz
is_cup_game Whether the video shows a cups-game or not
is_camera_moving Whether the camera used to film the video is moving or not

Object tracks

Field Name Description
task_id A unique annotation ID for each object track
label The name of the object, can also contain object attributes
is_occluder Whether the object occludes other objects in the video
bounding_boxes The coordinates of the object's bounding boxes (collected at 1fps)
initial_tracking_box One-hot vector indicating which box annotation should be used to start tracking
frame_ids The IDs of the frames that are annotated
timestamps The timestamps of the annotated frames in ms
is_masked Whether the object is masked in the annotated frame

Point tracks

Field Name Description
task_id A unique annotation ID for each point track
label The label of the point track
parent_objects The track_id of the object that the point belongs to
frame_ids The IDs of the frames that are annotated
points The coordinates of the points (collected at 30fps)

Action segments

Field Name Description
task_id A unique annotation ID for each action segment
label The templated class of the action segment
parent_objects The task_ids of the objects involved in the action
timestamps The start and end timestamps of the action segment
frame_ids The start and end frame IDs of the action segment
label_id A unique class ID for each label in the dataset

Sound segments

Field Name Description
id A unique annotation ID for each sound segment
label The name or class of the sound segment
parent_objects The object task_ids related to this sound segment
timestamps The start and end timestamps of the sound segment
frame_ids The start and end frame IDs of the sound segment
is_visible Whether the objects causing the sound in this segment are visible or not
label_id A unique class ID for each label in the dataset

Multi-choice video question-answers

Field Name Description
task_id A unique annotation ID for each question
question The text of the question
options The 3 possible options for the question, only 1 is correct
answer_id The ID of the correct option for the question
area The skill area the question pertains to
reasoning The type of reasoning required to answer the question
tags Different skills involved in answering the given question

Grounded video question-answers

Field Name Description
task_id A unique annotation ID for each question
question The text of the question
answers The answer for the question given as a list of object track_ids (corresponding to object tracks)
area The skill area the question pertains to
reasoning The type of reasoning required to answer the question

Feedback and support

If you have any questions, feedback, or require support regarding the Perception Test dataset or challenge, please contact us at perception-test@google.com.

Citing this work

@misc{patraucean2023perception,
      title={Perception Test: A Diagnostic Benchmark for Multimodal Video Models}, 
      author={Viorica Pătrăucean and Lucas Smaira and Ankush Gupta and Adrià Recasens Continente and Larisa Markeeva and Dylan Banarse and Skanda Koppula and Joseph Heyward and Mateusz Malinowski and Yi Yang and Carl Doersch and Tatiana Matejovicova and Yury Sulsky and Antoine Miech and Alex Frechette and Hanna Klimczak and Raphael Koster and Junlin Zhang and Stephanie Winkler and Yusuf Aytar and Simon Osindero and Dima Damen and Andrew Zisserman and João Carreira},
      year={2023},
      eprint={2305.13786},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License and disclaimer

Copyright 2022 DeepMind Technologies Limited

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.