We're releasing CortexBench and our first Visual Cortex model: VC-1. CortexBench is a collection of 17 different EAI tasks spanning locomotion, navigation, dexterous and mobile manipulation. We performed the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) for Embodied AI (EAI), and find that none of the existing PVRs perform well across all tasks. Next, we trained VC-1 on a combination of over 4,000 hours of egocentric videos from 7 different sources and ImageNet, totaling over 5.6 million images. We show that when adapting VC-1 (through task-specific losses or a small amount of in-domain data), VC-1 is competitive with or outperforms state of the art on all benchmark tasks.
We're open-sourcing two visual cortex models (model cards):
- VC-1 (ViT-L): Our best model, uses a ViT-L backbone, also known simply as
VC-1
| Download - VC-1-base (VIT-B): pre-trained on the same data as VC-1 but with a smaller backbone (ViT-B) | Download
To install our visual cortex models and CortexBench, please follow the instructions in INSTALLATION.md.
vc_models
: contains config files for visual cortex models, the model loading code and, as well as some project utilities.- See README for more details.
cortexbench
: embodied AI downstream tasks to evaluate pre-trained representations.third_party
: Third party submodules which aren't expected to change often.data
: Gitignored directory, needs to be created by the user. Is used by some downstream tasks to find (symlinks to) datasets, models, etc.
To use the VC-1 model, you can install the vc_models
module with pip. Then, you can load the model with code such as the following or follow our tutorial:
import vc_models
from vc_models.models.vit import model_utils
model,embd_size,model_transforms,model_info = model_utils.load_model(model_utils.VC1_LARGE_NAME)
# To use the smaller VC-1-base model use model_utils.VC1_BASE_NAME.
# The img loaded should be Bx3x250x250
img = your_function_here ...
# Output will be of size Bx3x224x224
transformed_img = model_transforms(img)
# Embedding will be 1x768
embedding = model(transformed_img)
To reproduce the results with the VC-1 model, please follow the README instructions for each of the benchmarks in cortexbench
.
To load your own encoder model and run it across all benchmarks, follow these steps:
- Create a configuration for your model
<your_model>.yaml
in the model configs folder of thevc_models
module. - In the config, you can specify the custom methods (as
_target_
field) for loading your encoder model. - Then, you can load the model as follows:
import vc_models from vc_models.models.vit import model_utils model, embd_size, model_transforms, model_info = model_utils.load_model(<your_model>)
- To run the CortexBench evaluation for your model, specify your model config as a parameter (
embedding=<your_model>
) for each of the benchmarks incortexbench
.
If you would like to contribute to Visual Cortex and CortexBench, please see CONTRIBUTING.md.
If you use Visual Cortex in your research, please cite the following paper:
@inproceedings{majumdar2023vc1,
title = {Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?},
author = {Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
publisher = {arXiv},
year = {2023}
}
The majority of Visual Cortex and CortexBench code is licensed under CC-BY-NC (see the LICENSE file for details), however portions of the project are available under separate license terms: trifinger_simulation is licensed under the BSD 3.0 license; mj_envs, mjrl are licensed under the Apache 2.0 license; Habitat Lab, dmc2gym, mujoco-py are licensed under the MIT license.
The trained policies models and the task datasets are considered data derived from the correspondent scene datasets.
- Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.
- Gibson based task datasets, the code for generating such datasets, and trained models are distributed with Gibson Terms of Use and under CC BY-NC-SA 3.0 US license.