vidar: A Python repository from Toyota Research Institute - Machine Learning - Toyota Research Institute

TRI-VIDAR

Official PyTorch repository for some of TRI's latest publications, including self-supervised learning, multi-view geometry, and depth estimation. Our goal is to provide a clean environment to reproduce our results and facilitate further research in this field. This repository is an updated version of PackNet-SfM, our previous monocular depth estimation repository, featuring a different license.

Models

(Experimental) For convenient inference, we provide a growing list of our models (ZeroDepth, PackNet, DeFiNe) model over torchhub without installation.

(New!) ZeroDepth

import torch
zerodepth_model = torch.hub.load("TRI-ML/vidar", "ZeroDepth", pretrained=True, trust_repo=True)

intrinsics = torch.tensor(np.load('examples/ddad_intrinsics.npy')).unsqueeze(0)
rgb = torch.tensor(imread('examples/ddad_sample.png')).permute(2,0,1).unsqueeze(0)/255.

depth_pred = zerodepth_model(rgb, intrinsics)

PackNet

PackNet is a self-supervised monocular depth estimation model, to load a model trained on KITTI and run inference on an RGB image:

import torch
packnet_model = torch.hub.load("TRI-ML/vidar", "PackNet", pretrained=True, trust_repo=True)
rgb = torch.tensor(imread('examples/ddad_sample.png')).permute(2,0,1).unsqueeze(0)/255.

depth_pred = model(rgb_image)

DeFiNe

DeFiNe is a multi-view depth estimation model, to load a model trained on Scannet and run inference on multiple posed RGB images:

import torch
define_model = torch.hub.load("TRI-ML/vidar", "DeFiNe", pretrained=True, trust_repo=True)
frames = {} 
frames["rgb"] = # a list of frames as 13HW torch.tensors
frames["intrinsics"] = # a list of 133 torch.tensor intrinsics matrices (one for each image)
frames["pose"] = # a batch of 144 relative poses to reference frame (one will be identity)
depth_preds = define_model(frames) # list of depths, one for each frame

Installation

We recommend using our provided dockerfile (see nvidia-docker2 instructions) to have a reproducible environment. To set up the repository, type in a terminal (only tested in Ubuntu 18.04):

git clone --recurse-submodules https://github.com/TRI-ML/vidar.git   # Clone repository with submodules
cd vidar                                                             # Move to repository folder
make docker-build                                                    # Build the docker image (recommended)

To start our docker container, simply type make docker-interactive. From inside the docker, you can run scripts with the following command pattern:

python scripts/launch.py <config.yaml>         # Single CPU/GPU  
python scripts/launch_ddp.py <config.yaml>     # Distributed Data Parallel (DDP) multi-GPU

To verify that the environment is set up correctly, you can run a simple overfit test:

# Download a tiny subset of KITTI
mkdir /data/datasets 
curl -s https://tri-ml-public.s3.amazonaws.com/github/vidar/datasets/KITTI_tiny.tar | tar xv -C /data/datasets/
# Inside docker
python scripts/launch.py configs/overfit/kitti/selfsup_resnet18.yaml

Once training is over (which takes around 1 minute), you should achieve results similar to this:

If you want to use features related to AWS (for dataset access) and WandB (for experiment management), you can create associated accounts and configure your shell with the following environment variables:

export AWS_SECRET_ACCESS_KEY=something    # AWS secret key
export AWS_ACCESS_KEY_ID=something        # AWS access key
export AWS_DEFAULT_REGION=something       # AWS default region
export WANDB_ENTITY=something             # WANDB entity
export WANDB_API_KEY=something            # WANDB API key

Configuration

Configuration files (stored in the configs folder) are the entry points for training and inference. The basic structure of a configuration file is:

wrapper:                            # Training parameters 
    <parameters>
arch:                               # Architecture used
    model:                          # Model file and parameters
        file: <model_file>
        <parameters>
    networks:                       # Networks available to the model
        network1:                   # Network1 file and parameters 
            file: <network1_file>
            <parameters>
        network2:                   # Network2 file and parameters 
            file: <network2_file>
            <parameters>
        ...
    losses:                         # Losses available to the model 
        loss1:                      # Loss1 file and parameters
            file: <loss1_file>      
            <parameters>
        loss2:                      # Loss2 file and parameters
            file: <loss2_file>      
            <parameters>
        ...        
evaluation:                         # Evaluation metrics for different tasks
    evaluation1:                    # Evaluation1 and parameters
        <parameters>
    evaluation2:                    # Evaluation2 and parameters
        <parameters>
    ...
optimizers:                         # Optimizers used to train the networks
    network1:                       # Optimizer for network1 and parameters
        <parameters>
    network2:                       # Optimizer for network2 and parameters
        <parameters>
    ...
datasets:                           # Datasets used
    train:                          # Training dataset and parameters
        <parameters>                
        augmentation:               # Training augmentations and parameters 
            <parameters>
        dataloader:                 # Training dataloader and parameters
            <parameters>
    validation:                     # Validation dataset and parameters
        <parameters>                
        augmentation:               # Validation augmentations and parameters
            <parameters>
        dataloader:                 # Validation dataloader and parameters
            <parameters>

To enable WandB logging, you can set these additional parameters in your configuration file:

wandb:
    folder: /data/wandb     # Where the wandb run is stored
    entity: your_entity           # Wandb entity
    project: your_project         # Wandb project
    num_validation_logs: X        # Number of visualization logs
    tags: [tag1,tag2,...]         # Wandb tags
    notes: note                   # Wandb notes

To enable checkpoint saving, you can set these additional parameters in your configuration file:

checkpoint:
    folder: /data/checkpoints       # Local folder to store checkpoints
    save_code: True                       # Save repository folder as well
    keep_top: 5                           # How many checkpoints should be stored
    s3_bucket: s3://path/to/s3/bucket     # [optional] AWS folder to store checkpoints        
    dataset: [0]                          # [optional] Validation dataset index to track
    monitor: [depth|abs_rel_pp_gt(0)_0]   # [optional] Validation metric to track
    mode: [min]                           # [optional] If the metric is minimized (min) or maximized (max)

To facilitate the reutilization of configuration files, we also provide a recipe functionality, that enables parameter sharing. To use a recipe, simply type recipe: <path/to/recipe>|<entry> as an additional parameter, to copy all entries from that recipe onto that section. For example:

wrapper: 
  recipe: wrapper|default

will insert all parameters from section default of configs/recipes/wrapper.yaml onto the wrapper section of the configuration file. Parameters added after the recipe will overwrite those copied over, to facilitate customization.

Datasets

In our provided configuration files, datasets are assumed to be downloaded in /data/datasets/<dataset-name>. We have a separate repository for dataset management, that is a submodule of this repository and can be found here. It contains dataloaders for all datasets used in our works, as well as visualization tools that build upon our CamViz library.

cd externals/efm_datasets
python scripts/display_datasets/display_datasets.py scripts/config.yaml <dataset>

Note that you need to execute xhost +local: before entering the docker with make docker-interactive, to enable local visualization. Some examples of visualization results you will generate for KITTI and DDAD are shown below:

Publications

3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020, oral)

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon

Abstract: Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symmetrical packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. Although self-supervised, our method outperforms other self, semi, and fully supervised methods on the KITTI benchmark. The 3D inductive bias in PackNet enables it to scale with input resolution and number of parameters without overfitting, generalizing better on out-of-domain data such as the NuScenes dataset. Furthermore, it does not require large-scale supervised pretraining on ImageNet and can run in real-time. Finally, we release DDAD (Dense Depth for Automated Driving), a new urban driving dataset with more challenging and accurate depth evaluation, thanks to longer-range and denser ground-truth depth generated from high-density LiDARs mounted on a fleet of self-driving cars operating world-wide.

GT depth	Abs.Rel.	Sq.Rel.	RMSE	RMSElog	SILog	d_1.25	d_1.25²	d_1.25³
ResNet18 \| Self-Supervised \| 192x640 \| ImageNet → KITTI
Original	0.116	0.811	4.902	0.198	19.259	0.865	0.957	0.981
Improved	0.087	0.471	3.947	0.135	12.879	0.913	0.983	0.996
PackNet \| Self-Supervised \| 192x640 \| KITTI
Original	0.111	0.800	4.576	0.189	18.504	0.880	0.960	0.982
Improved	0.078	0.420	3.485	0.121	11.725	0.931	0.986	0.996

@inproceedings{tri-packnet,
  title = {3D Packing for Self-Supervised Monocular Depth Estimation},
  author = {Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon},
  booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
  year = {2020},
}

Multi-Frame Self-Supervised Depth Estimation with Transformers (CVPR 2022)

Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon

Abstract: Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit feature matching for self-supervised monocular depth estimation, and propose a novel transformer architecture for cost volume generation. We use depth-discretized epipolar sampling to select matching candidates, and refine predictions through a series of self- and cross-attention layers. These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.

GT depth	Frames	Abs.Rel.	Sq.Rel.	RMSE	RMSElog	SILog	d_1.25	d_1.25²	d_1.25³
DepthFormer \| Self-Supervised \| 192x640 \| ImageNet → KITTI
Original	Single (t)	0.117	0.876	4.692	0.193	18.940	0.874	0.960	0.981
Original	Multi (t-1,t)	0.090	0.661	4.149	0.175	17.260	0.905	0.963	0.982
Improved	Single (t)	0.083	0.464	3.591	0.126	12.156	0.926	0.986	0.996
Improved	Multi (t-1,t)	0.055	0.271	2.917	0.095	9.160	0.955	0.991	0.998

@inproceedings{tri-depthformer,
  title = {Multi-Frame Self-Supervised Depth with Transformers},
  author = {Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon},
  booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
  year = {2022},
}

Full Surround Monodepth from Multiple Cameras (RA-L + ICRA 2022)

Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon

Abstract: Self-supervised monocular depth and ego-motion estimation is a promising approach to replace or supplement expensive depth sensors such as LiDAR for robotics applications like autonomous driving. However, most research in this area focuses on a single monocular camera or stereo pairs that cover only a fraction of the scene around the vehicle. In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs. Using generalized spatio-temporal contexts, pose consistency constraints, and carefully designed photometric loss masking, we learn a single network generating dense, consistent, and scale-aware point clouds that cover the same full surround 360 degree field of view as a typical LiDAR scanner. We also propose a new scale-consistent evaluation metric more suitable to multi-camera settings. Experiments on two challenging benchmarks illustrate the benefits of our approach over strong baselines.

Camera	Abs.Rel.	Sq.Rel.	RMSE	RMSElog	SILog	d_1.25	d_1.25²	d_1.25³
FSM \| Self-Supervised \| 384x640 \| ImageNet → DDAD
Front	0.131	2.940	14.252	0.237	22.226	0.824	0.935	0.969
Front Right	0.205	3.349	13.677	0.353	30.777	0.667	0.852	0.922
Back Right	0.243	3.493	12.266	0.394	33.842	0.594	0.821	0.907
Back	0.194	3.743	16.436	0.348	29.901	0.669	0.850	0.926
Back Left	0.235	3.641	13.570	0.387	31.765	0.594	0.816	0.907
Front Left	0.226	3.861	12.957	0.378	32.795	0.652	0.836	0.909

@inproceedings{tri-fsm,
  title = {Full Surround Monodepth from Multiple Cameras},
  author = {Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon},
  booktitle = {Robotics and Automation Letters (RA-L)}
  year = {2022},
}

Self-Supervised Camera Self-Calibration from Videos (ICRA 2022)

Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew R.Walter

Abstract: Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.

@inproceedings{tri-self_calibration,
  title = {Self-Supervised Camera Self-Calibration from Video},
  author = {Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew Walter},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}
  year = {2022},
}

Depth Field Networks for Generalizable Multi-view Scene Representation (ECCV 2022)

Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Greg Shakhnarovich, Matthew Walter, Adrien Gaidon

Abstract: Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

@inproceedings{tri-define,
  title={Depth Field Networks For Generalizable Multi-view Scene Representation},
  author={Guizilini, Vitor and Vasiljevic, Igor and Fang, Jiading and Ambrus, Rares and Shakhnarovich, Greg and Walter, Matthew R and Gaidon, Adrien},
  booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXII},
  pages={245--262},
  year={2022},
  organization={Springer}
}

Towards Zero-Shot Scale-Aware Monocular Depth Estimation (ICCV 2023)

Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares Ambrus, Adrien Gaidon

Abstract: Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates.

@inproceedings{tri-zerodepth,
  title={Towards Zero-Shot Scale-Aware Monocular Depth Estimation},
  author={Guizilini, Vitor and Vasiljevic, Igor and Chen, Dian and Ambrus, Rares and Gaidon, Adrien},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month={October},
  year={2023},
}

Robust Self-Supervised Extrinsic Self-Calibration (IROS 2023)

Takayuki Kanai, Igor Vasiljevic, Vitor Guizilini, Adrien Gaidon, Rares Ambrus

Abstract: Autonomous vehicles and robots need to operate over a wide variety of scenarios in order to complete tasks efficiently and safely. Multi-camera self-supervised monocular depth estimation from videos is a promising way to reason about the environment, as it generates metrically scaled geometric predictions from visual data without requiring additional sensors. However, most works assume well-calibrated extrinsics to fully leverage this multi-camera setup, even though accurate and efficient calibration is still a challenging problem. In this work, we introduce a novel method for extrinsic calibration that builds upon the principles of self-supervised monocular depth and ego-motion learning. Our proposed curriculum learning strategy uses monocular depth and pose estimators with velocity supervision to estimate extrinsics, and then jointly learns extrinsic calibration along with depth and pose for a set of overlapping cameras rigidly attached to a moving vehicle. Experiments on a benchmark multi-camera dataset (DDAD) demonstrate that our method enables self-calibration in various scenes robustly and efficiently compared to a traditional vision-based pose estimation pipeline. Furthermore, we demonstrate the benefits of extrinsics self-calibration as a way to improve depth prediction via joint optimization.

Abs.Rel.	Front	F.Left	F.Right	B.Left	B.Right	Back
SESC \| Self-Supervised \| 384x640 \| ImageNet → DDAD (GT-extrinsics)
Metric (PP)	0.159	0.194	0.213	0.212	0.220	0.205
Scaled (PP_MD)	0.156	0.200	0.225	0.217	0.234	0.216

@inproceedings{tri_sesc_iros23,
  title = {Robust Self-Supervised Extrinsic Self-Calibration},
  author = {Takayuki Kanai and Igor Vasiljevic and Vitor Guizilini and Adrien Gaidon and Rares Ambrus},
  booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  year = {2023},
}

License

This repository is released under the CC BY-NC 4.0 license.