/vsr-benchmark-review

A review of existing VSR benchmarks from a developmental perspective.

VSR Benchmark Review

A review of existing VSR benchmarks from a developmental perspective.

Psychometric

Test Description Intrinsic/Extrinsic Static/Dynamic Perceptive/Generative 3D/2D Labeled Skills Dataset size Number of sub-tests Citation
ARC The Abstraction and Reasoning Corpus by François Chollet. Both Both Both Mainly 2D No 600 item, ~1k egs <= 600 François Chollet. The measure of intelligence.arXiv preprint arXiv:1911.01547, 2019.
RPM+ Raven's Progressive Matrices and similarly formatted tests such as Weschler Matrix Reasoning Both Static Primarily Perceptive 2D No 60 (RPM) 1 Raven, J., & Raven, J. (2003). Raven Progressive Matrices. In R. S. McCallum (Ed.), Handbook of nonverbal assessment (p. 223–237). Kluwer Academic/Plenum Publishers. https://doi.org/10.1007/978-1-4615-0153-4\_11
ACE Geometric Analogy American Council of Education geometric analogy test used with Evans' ANALOGY system Both Static Primarily Perceptive 2D No 30 1 Evans, T. G. (1964, April). A heuristic program to solve geometric-analogy problems. In Proceedings of the April 21-23, 1964, spring joint computer conference (pp. 327-338).
CORE Core Knowledge of Geometry odd-one-out test Both Static Perceptive 2D Yes 45 8 Stanislas Dehaene, Véronique Izard, Pierre Pica, and Elizabeth Spelke. Core knowledge of ge-ometry in an amazonian indigene group.Science, 311(5759):381–384, 2006. ISSN 0036-8075.doi: 10.1126/science.1121739. URLhttps://science.sciencemag.org/content/311/5759/381.
Leiter The Leiter-R test of nonverbal intelligence Both Static Both Mainly 2D Several major skill types are used as test names; many included skills are unmentioned Hundreds 20 Gale H Roid and Lucy J Miller. Leiter international performance scale-revised (leiter-r).WoodDale, IL: Stoelting, 1997.
WAIS Block Design Weschler Adult Intelligence Scale Block Design Task Primarily Intrinsic Both Both Both No Dozens 1 Drozdick, L. W., Wahlstrom, D., Zhu, J., & Weiss, L. G. (2012). The Wechsler Adult Intelligence Scale—Fourth Edition and the Wechsler Memory Scale—Fourth Edition. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (p. 197–223). The Guilford Press.
WAIS Picture Completion Weschler Adult Intelligence Scale Picture Completion Task Primarily Intrinsic Both Both 2D No Dozens 1 Drozdick, L. W., Wahlstrom, D., Zhu, J., & Weiss, L. G. (2012). The Wechsler Adult Intelligence Scale—Fourth Edition and the Wechsler Memory Scale—Fourth Edition. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (p. 197–223). The Guilford Press.

Basic Skills

Test Description Intrinsic/Extrinsic Static/Dynamic Perceptive/Generative 3D/2D Labeled Skills Dataset size Number of sub-tests Citation
SPARE3D Tests focused on 2D/3D understanding Both Static Both Both Yes 21518 objects 5 Wenyu Han, Siyuan Xiang, Chenhui Liu, Ruoyu Wang, and Chen Feng. Spare3d: A dataset forspatial reasoning on three-view line drawings.arXiv preprint arXiv:2003.14034, 2020.
Paper Folding Two-dimensional drawings of "unfolded" cubes with arrows on two edges Intrinsic Dynamic Both Both Yes 164 1 Roger N Shepard and Christine Feng. A chronometric study of mental paper folding. Cognitive psychology, 3(2):228–243, 1972.
Mental Rotation Line-drawings of 10 cubes Intrinsic Dynamic Both Both Yes 1600 1 Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science,171(3972):701–703, 1971.
Purdue SVT Tests use mental rotation measure spatial visualization skills Intrinsic Dynamic Both Both Yes 32 items 1 Guay, R. (1977). Purdue spatial visualization test-visualization of rotations.W. Lafayette, IN.Purdue Research Foundation; Yoon, S. (2011). Revised purdue spatial visualization test: Visualization of rotations (revised psvt:R).Texas A&M University, College Station, TX,7

Physics Understanding and Video Prediction

Test Description Intrinsic/Extrinsic Static/Dynamic Perceptive/Generative 3D/2D Labeled Skills Dataset size Number of sub-tests Citation
Tools Task involves placing objects in 2D environments to achieve goals Both Dynamic Both 2D No 58 items 20 Kelsey R Allen, Kevin A Smith, and Joshua B Tenenbaum. The tools challenge: Rapid trial-and-error learning in physical problem solving.arXiv preprint arXiv:1907.09620, 2019.;
PhyRe Similar to Tools, but with one or two circles available Both Dynamic Both 2D No 5000 tasks 2 tiers, 25 templates per tier Bakhtin, A., van der Maaten, L., Johnson, J., Gustafson, L., & Girshick, R. (2019). Phyre: Anew benchmark for physical reasoning.Advances in Neural Information Processing Systems(pp.5083–5094).
Moving MNIST Multiple handwritten digits move independently in videos Both Both Both 2D No 10k x20 frames 1 Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015, June). Unsupervised learning of video representations using lstms. In International conference on machine learning (pp. 843-852).
20BN SSD Short videos of objects and actions in the real world Both Both Perceptive Both No 20000000000 frames 174 Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne West-phal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al.The" something something" video database for learning and evaluating visual common sense. InICCV, volume 1, page 5, 2017.
IntPhys Benchmark for intuitive physics reasoning Both Both Both Both Yes ~20k video clips 4 conceptually pure concepts Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, VéroniqueIzard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physicsreasoning.arXiv preprint arXiv:1803.07616, 2018.
Physics 101 Video clips of 101 objects annotated with materials and physical properties Both Dynamic Both Both Yes 4352 videos, 101 objs Several Jiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics101: Learning physical object properties from unlabeled videos. InBMVC, volume 2, page 7,2016.
Physical Bongard Bongard problems involving physics-like properties Both Both Both 2D No 23 1 Weitnauer, E., & Ritter, H. (2012, September). Physical bongard problems. In Ifip international conference on artificial intelligence applications and innovations (pp. 157-163). Springer, Berlin, Heidelberg.

Visual Question Answering

Test Description Intrinsic/Extrinsic Static/Dynamic Perceptive/Generative 3D/2D Labeled Skills Dataset size Number of sub-tests Citation
Shapestacks Scenes of 3D objects stacked on top of one another Both Both Both Both No 20000 1 Groth, O., Fuchs, F. B., Posner, I., & Vedaldi, A. (2018). Shapestacks: Learning vision-based physical intuition for generalised object stacking. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 702-717).
CLEVR+ Includes CLEVR and CLEVRER, Synthetic images of objects in a 3D scene Both Static Perceptive Both No 70k images, 700k questions (CLEVR), 20k/300k (CLEVRER) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, andRoss Girshick. Clevr: A diagnostic dataset for compositional language and elementary visualreasoning. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July2017.; Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua BTenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprintarXiv:1910.01442, 2019.
COG Dataset with artificial and configurable items in the visual question answering format Both Static Perceptive 2D No > 2 trillion instances 44 Yang, G. R., Ganichev, I., Wang, X. J., Shlens, J., & Sussillo, D. (2018, September). A dataset and architecture for visual reasoning with a working memory. In European Conference on Computer Vision (pp. 729-745). Springer, Cham.
VQA Images and natural language questions Both Static Both Both No 0.25M images, 0.76M questions, 10M answers 2 (real and abstract scenes) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
Bongard Problems Bongard-like problem sets, with two sets of items differentiated by some rule. Both Static Primarily perceptive 2D No 100+ 1 Mikhail Moiseevich Bongard. The recognition problem. Technical report, FOREIGN TECHNOL-OGY DIV WRIGHT-PATTERSON AFB OHIO, 1968.

Egocentric Navigation

Test Description Intrinsic/Extrinsic Static/Dynamic Perceptive/Generative 3D/2D Labeled Skills Dataset size Number of sub-tests Citation
TOUCHDOWN A Google-StreetView-based environment and linguistic instructions for tasks Both Both Both Both No 9,326 task/demonstration pairs 2 Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. InProceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019.

Diagram Understanding

Test Description Intrinsic/Extrinsic Static/Dynamic Perceptive/Generative 3D/2D Labeled Skills Dataset size Number of sub-tests Citation
AI2D Diagrams from grade school science, segmented annotations, Both Static Perceptive Both No 5k images, 118k annotations, 15000 MC questions 1 Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016). A diagram isworth a dozen images.European Conference on Computer Vision(pp. 235–251). Springer.
FigureSeer Annotated figures from 20k scientific papers Both Static Perceptive 2D No 60k labeled figures, 600 annotated 7 figure types Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. Figureseer: Parsingresult-figures in research papers. InEuropean Conference on Computer Vision, pages 664–680.Springer, 2016.
FigureQA Images of scientific plots and natural language questions about them Both Static Perceptive 2D No 100k images, 1.3M questions 5 image classes, 15 question templates Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, andYoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning.arXiv preprintarXiv:1710.07300, 2017.