Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

This is the official PyTorch implementation for the system proposed in the paper :

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

Seokju Lee, Sunghoon Im, Stephen Lin, and In So Kweon

AAAI-21 [PDF] [Project]

⟹ Unified Visual Odometry : Our holistic visualization of depth and motion estimation from self-supervised monocular training.

If you find our work useful in your research, please consider citing our paper :

@inproceedings{lee2021learning,
  title={Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency},
  author={Lee, Seokju and Im, Sunghoon and Lin, Stephen and Kweon, In So},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
  year={2021}
}

Install

Our code is tested with CUDA 10.2/11.0, Python 3.7.x (conda environment), and PyTorch 1.4.0/1.7.0.

At least 2 GPUs (each 12 GB) are required to train the models with batch_size=4 and maximum_number_of_instances_per_frame=3.

Create a conda environment with PyTorch library as :

conda create -n my_env python=3.7.4 pytorch=1.7.0 torchvision torchaudio cudatoolkit=11.0 -c pytorch
conda activate my_env

Install prerequisite packages listed in :

pip3 install -r requirements.txt

or install manually the following packages :

opencv-python
imageio
matplotlib
scipy==1.1.0
scikit-image
argparse
tensorboardX
blessings
progressbar2
path
tqdm
pypng
open3d==0.8.0.0

Please install torch-scatter and torch-sparse following this link.

pip3 install torch-scatter torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html

Datasets

We provide our KITTI-VIS and Cityscapes-VIS dataset (download link), which is composed of pre-processed images, auto-annotated instance segmentation, and optical flow.

Images are pre-processed with SC-SfMLearner.
Instance segmentation is pre-processed with PANet.
Optical flow is pre-processed with PWC-Net.

We associate them to operate video instance segmentation as implemented in datasets/sequence_folders.py.

Please allocate the dataset as the following file structure :

kitti_256 (or cityscapes_256)
    └ image
        └ $SCENE_DIR
    └ segmentation
        └ $SCENE_DIR
    └ flow_f
        └ $SCENE_DIR
    └ flow_b
        └ $SCENE_DIR
    ├ train.txt
    └ val.txt

Training and validation scenes can be randomly generated in train.txt and val.txt.

Training

You can train the models on KITTI-VIS by running :

sh scripts/train_resnet_256_kt.sh

You can train the models on Cityscapes-VIS by running :

sh scripts/train_resnet_256_cs.sh

Please indicate the location of the dataset with $TRAIN_SET.

The hyperparameters (batch size, learning rate, loss weight, etc.) are defined in each script file and default arguments in train.py. Please also check our main paper.

During training, checkpoints will be saved in checkpoints/.

You can also start a tensorboard session by running :

tensorboard --logdir=checkpoints/ --port 8080 --bind_all

and visualize the training progress by opening https://localhost:8080 on your browser.

For convenience, we provide two breakpoints (supported with pdb), commented as BREAKPOINT in train.py. Each breakpoint represents an important point in projecting the object.

BREAKPOINT-1 : Breakpoint after the 1st projection with camera motion. Visualize ego-warped images.
BREAKPOINT-2 : Breakpoint after the 2nd projection with each object motion. Visualize fully-warped images and motion fields.

You can visualize the intermediate outputs with the commented code. This will improve your visibility on debugging the code.

Models

We provide KITTI-VIS and Cityscapes-VIS pretrained models (download link).

The architectures are based on the ResNet18 encoder. Please see the details of them in models/.

Models trained under three different conditions are released :

KITTI : Trained on KITTI-VIS using ImageNet (ResNet18) pretrained model.
CS : Trained on Cityscapes-VIS using ImageNet (ResNet18) pretrained model. This model is only for the pretraining and demo.
CS+KITTI : Pretrained on Cityscapes-VIS, and finetuned on KITTI-VIS.

Evaluation

We evaluate our depth estimation following the KITTI Eigen split. For the evaluation, it is required to download the KITTI raw dataset provided on the official website. Tested scenes are listed in kitti_eval/test_files_eigen.txt.

You can evaluate the models by running :

sh scripts/run_eigen_test.sh

Please indicate the location of the raw dataset with $DATA_ROOT, and the models with $DISP_NET.

We demonstrate our results as follows :

Models	Abs Rel	Sq Rel	RMSE	RMSE log	Acc 1	Acc 2	Acc 3
ResNet18, 832x256, ImageNet → KITTI	0.112	0.777	4.772	0.191	0.872	0.959	0.982
ResNet18, 832x256, Cityscapes → KITTI	0.109	0.740	4.547	0.184	0.883	0.962	0.983

For convenience, we also provide precomputed depth maps in this link.

Demo

We demonstrate Unified Visual Odometry, which shows the results of depth, ego-motion, and object motion holistically.

You can visualize them by running :

sh scripts/run_demo.sh

Please indicate the location of the image samples with $SCENE. We recommend to visualize Cityscapes scenes since it contains more dynamic objects than KITTI.

More results are demonstrated in this link.

References

SC-SfMLearner (NeurIPS 2019, our baseline framework)
PANet (CVPR 2018, instance segmentation for data pre-processing)
PWC-Net (CVPR 2018, optical flow for data pre-processing)
PyTorch-Sparse (PyTorch library for sparse tensor representation)
Struct2Depth (AAAI 2019, object scale loss)
Depth from Video in the Wild (ICCV 2019, motion field representation)

License

The source code is released under the MIT license.

SeokjuLee/Insta-DM