A review of existing VSR benchmarks from a developmental perspective.
Test | Description | Intrinsic/Extrinsic | Static/Dynamic | Perceptive/Generative | 3D/2D | Labeled Skills | Dataset size | Number of sub-tests | Citation |
---|---|---|---|---|---|---|---|---|---|
ARC | The Abstraction and Reasoning Corpus by François Chollet. | Both | Both | Both | Mainly 2D | No | 600 item, ~1k egs | <= 600 | François Chollet. The measure of intelligence.arXiv preprint arXiv:1911.01547, 2019. |
RPM+ | Raven's Progressive Matrices and similarly formatted tests such as Weschler Matrix Reasoning | Both | Static | Primarily Perceptive | 2D | No | 60 (RPM) | 1 | Raven, J., & Raven, J. (2003). Raven Progressive Matrices. In R. S. McCallum (Ed.), Handbook of nonverbal assessment (p. 223–237). Kluwer Academic/Plenum Publishers. https://doi.org/10.1007/978-1-4615-0153-4\_11 |
ACE Geometric Analogy | American Council of Education geometric analogy test used with Evans' ANALOGY system | Both | Static | Primarily Perceptive | 2D | No | 30 | 1 | Evans, T. G. (1964, April). A heuristic program to solve geometric-analogy problems. In Proceedings of the April 21-23, 1964, spring joint computer conference (pp. 327-338). |
CORE | Core Knowledge of Geometry odd-one-out test | Both | Static | Perceptive | 2D | Yes | 45 | 8 | Stanislas Dehaene, Véronique Izard, Pierre Pica, and Elizabeth Spelke. Core knowledge of ge-ometry in an amazonian indigene group.Science, 311(5759):381–384, 2006. ISSN 0036-8075.doi: 10.1126/science.1121739. URLhttps://science.sciencemag.org/content/311/5759/381. |
Leiter | The Leiter-R test of nonverbal intelligence | Both | Static | Both | Mainly 2D | Several major skill types are used as test names; many included skills are unmentioned | Hundreds | 20 | Gale H Roid and Lucy J Miller. Leiter international performance scale-revised (leiter-r).WoodDale, IL: Stoelting, 1997. |
WAIS Block Design | Weschler Adult Intelligence Scale Block Design Task | Primarily Intrinsic | Both | Both | Both | No | Dozens | 1 | Drozdick, L. W., Wahlstrom, D., Zhu, J., & Weiss, L. G. (2012). The Wechsler Adult Intelligence Scale—Fourth Edition and the Wechsler Memory Scale—Fourth Edition. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (p. 197–223). The Guilford Press. |
WAIS Picture Completion | Weschler Adult Intelligence Scale Picture Completion Task | Primarily Intrinsic | Both | Both | 2D | No | Dozens | 1 | Drozdick, L. W., Wahlstrom, D., Zhu, J., & Weiss, L. G. (2012). The Wechsler Adult Intelligence Scale—Fourth Edition and the Wechsler Memory Scale—Fourth Edition. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (p. 197–223). The Guilford Press. |
Test | Description | Intrinsic/Extrinsic | Static/Dynamic | Perceptive/Generative | 3D/2D | Labeled Skills | Dataset size | Number of sub-tests | Citation |
---|---|---|---|---|---|---|---|---|---|
SPARE3D | Tests focused on 2D/3D understanding | Both | Static | Both | Both | Yes | 21518 objects | 5 | Wenyu Han, Siyuan Xiang, Chenhui Liu, Ruoyu Wang, and Chen Feng. Spare3d: A dataset forspatial reasoning on three-view line drawings.arXiv preprint arXiv:2003.14034, 2020. |
Paper Folding | Two-dimensional drawings of "unfolded" cubes with arrows on two edges | Intrinsic | Dynamic | Both | Both | Yes | 164 | 1 | Roger N Shepard and Christine Feng. A chronometric study of mental paper folding. Cognitive psychology, 3(2):228–243, 1972. |
Mental Rotation | Line-drawings of 10 cubes | Intrinsic | Dynamic | Both | Both | Yes | 1600 | 1 | Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science,171(3972):701–703, 1971. |
Purdue SVT | Tests use mental rotation measure spatial visualization skills | Intrinsic | Dynamic | Both | Both | Yes | 32 items | 1 | Guay, R. (1977). Purdue spatial visualization test-visualization of rotations.W. Lafayette, IN.Purdue Research Foundation; Yoon, S. (2011). Revised purdue spatial visualization test: Visualization of rotations (revised psvt:R).Texas A&M University, College Station, TX,7 |
Test | Description | Intrinsic/Extrinsic | Static/Dynamic | Perceptive/Generative | 3D/2D | Labeled Skills | Dataset size | Number of sub-tests | Citation |
---|---|---|---|---|---|---|---|---|---|
Tools | Task involves placing objects in 2D environments to achieve goals | Both | Dynamic | Both | 2D | No | 58 items | 20 | Kelsey R Allen, Kevin A Smith, and Joshua B Tenenbaum. The tools challenge: Rapid trial-and-error learning in physical problem solving.arXiv preprint arXiv:1907.09620, 2019.; |
PhyRe | Similar to Tools, but with one or two circles available | Both | Dynamic | Both | 2D | No | 5000 tasks | 2 tiers, 25 templates per tier | Bakhtin, A., van der Maaten, L., Johnson, J., Gustafson, L., & Girshick, R. (2019). Phyre: Anew benchmark for physical reasoning.Advances in Neural Information Processing Systems(pp.5083–5094). |
Moving MNIST | Multiple handwritten digits move independently in videos | Both | Both | Both | 2D | No | 10k x20 frames | 1 | Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015, June). Unsupervised learning of video representations using lstms. In International conference on machine learning (pp. 843-852). |
20BN SSD | Short videos of objects and actions in the real world | Both | Both | Perceptive | Both | No | 20000000000 frames | 174 | Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne West-phal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al.The" something something" video database for learning and evaluating visual common sense. InICCV, volume 1, page 5, 2017. |
IntPhys | Benchmark for intuitive physics reasoning | Both | Both | Both | Both | Yes | ~20k video clips | 4 conceptually pure concepts | Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, VéroniqueIzard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physicsreasoning.arXiv preprint arXiv:1803.07616, 2018. |
Physics 101 | Video clips of 101 objects annotated with materials and physical properties | Both | Dynamic | Both | Both | Yes | 4352 videos, 101 objs | Several | Jiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics101: Learning physical object properties from unlabeled videos. InBMVC, volume 2, page 7,2016. |
Physical Bongard | Bongard problems involving physics-like properties | Both | Both | Both | 2D | No | 23 | 1 | Weitnauer, E., & Ritter, H. (2012, September). Physical bongard problems. In Ifip international conference on artificial intelligence applications and innovations (pp. 157-163). Springer, Berlin, Heidelberg. |
Test | Description | Intrinsic/Extrinsic | Static/Dynamic | Perceptive/Generative | 3D/2D | Labeled Skills | Dataset size | Number of sub-tests | Citation |
---|---|---|---|---|---|---|---|---|---|
Shapestacks | Scenes of 3D objects stacked on top of one another | Both | Both | Both | Both | No | 20000 | 1 | Groth, O., Fuchs, F. B., Posner, I., & Vedaldi, A. (2018). Shapestacks: Learning vision-based physical intuition for generalised object stacking. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 702-717). |
CLEVR+ | Includes CLEVR and CLEVRER, Synthetic images of objects in a 3D scene | Both | Static | Perceptive | Both | No | 70k images, 700k questions (CLEVR), 20k/300k (CLEVRER) | Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, andRoss Girshick. Clevr: A diagnostic dataset for compositional language and elementary visualreasoning. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July2017.; Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua BTenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprintarXiv:1910.01442, 2019. | |
COG | Dataset with artificial and configurable items in the visual question answering format | Both | Static | Perceptive | 2D | No | > 2 trillion instances | 44 | Yang, G. R., Ganichev, I., Wang, X. J., Shlens, J., & Sussillo, D. (2018, September). A dataset and architecture for visual reasoning with a working memory. In European Conference on Computer Vision (pp. 729-745). Springer, Cham. |
VQA | Images and natural language questions | Both | Static | Both | Both | No | 0.25M images, 0.76M questions, 10M answers | 2 (real and abstract scenes) | Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015. |
Bongard Problems | Bongard-like problem sets, with two sets of items differentiated by some rule. | Both | Static | Primarily perceptive | 2D | No | 100+ | 1 | Mikhail Moiseevich Bongard. The recognition problem. Technical report, FOREIGN TECHNOL-OGY DIV WRIGHT-PATTERSON AFB OHIO, 1968. |
Test | Description | Intrinsic/Extrinsic | Static/Dynamic | Perceptive/Generative | 3D/2D | Labeled Skills | Dataset size | Number of sub-tests | Citation |
---|---|---|---|---|---|---|---|---|---|
TOUCHDOWN | A Google-StreetView-based environment and linguistic instructions for tasks | Both | Both | Both | Both | No | 9,326 task/demonstration pairs | 2 | Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. InProceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019. |
Test | Description | Intrinsic/Extrinsic | Static/Dynamic | Perceptive/Generative | 3D/2D | Labeled Skills | Dataset size | Number of sub-tests | Citation |
---|---|---|---|---|---|---|---|---|---|
AI2D | Diagrams from grade school science, segmented annotations, | Both | Static | Perceptive | Both | No | 5k images, 118k annotations, 15000 MC questions | 1 | Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016). A diagram isworth a dozen images.European Conference on Computer Vision(pp. 235–251). Springer. |
FigureSeer | Annotated figures from 20k scientific papers | Both | Static | Perceptive | 2D | No | 60k labeled figures, 600 annotated | 7 figure types | Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. Figureseer: Parsingresult-figures in research papers. InEuropean Conference on Computer Vision, pages 664–680.Springer, 2016. |
FigureQA | Images of scientific plots and natural language questions about them | Both | Static | Perceptive | 2D | No | 100k images, 1.3M questions | 5 image classes, 15 question templates | Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, andYoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning.arXiv preprintarXiv:1710.07300, 2017. |