[SIGGRAPH 2020] Consistent Video Depth Estimation
Paper] [Project Website] [Google Colab]
[We present an algorithm for reconstructing dense, geometrically consistent depth for all pixels in a monocular video. We leverage a conventional structure-from-motion reconstruction to establish geometric constraints on pixels in the video. Unlike the ad-hoc priors in classical reconstruction, we use a learning-based prior, i.e., a convolutional neural network trained for single-image depth estimation. At test time, we fine-tune this network to satisfy the geometric constraints of a particular input video, while retaining its ability to synthesize plausible depth details in parts of the video that are less constrained. We show through quantitative validation that our method achieves higher accuracy and a higher degree of geometric consistency than previous monocular reconstruction methods. Visually, our results appear more stable. Our algorithm is able to handle challenging hand-held captured input videos with a moderate degree of dynamic motion. The improved quality of the reconstruction enables several applications, such as scene reconstruction and advanced video-based visual effects.
Consistent Video Despth Estimation
Xuan Luo,
Jia-Bin Huang,
Richard Szeliski,
Kevin Matzen, and
Johannes Kopf
In SIGGRAPH 2020.
Prerequisite
- Pull third-party packages.
git submodule update --init --recursive
- Install python packages.
conda create -n consistent_depth python=3.6 conda activate consistent_depth ./scripts/install.sh
- FFmpeg
- Install COLMAP following https://colmap.github.io/install.html. Note COLMAP >= 3.6 is required to exclude extracting features on dynamic objects.
If you are using Ubuntu, you can install COLMAP by
./scripts/install_colmap_ubuntu.sh
.
Quick Start
You can run the following demo without installing COLMAP. The demo takes 37 min when tested on one NVIDIA GeForce RTX 2080 GPU.
- Download models and the demo video together with its precomputed COLMAP results.
./scripts/download_model.sh ./scripts/download_demo.sh results/ayush
- Run
where
python main.py --video_file data/videos/ayush.mp4 --path results/ayush \ --camera_params "1671.770118, 540, 960" --camera_model "SIMPLE_PINHOLE" \ --make_video
1671.770118, 540, 960
is camera intrinsics (f, cx, cy
) andSIMPLE_PINHOLE
is the camera model. - You can inspect the test-time training process by
tensorboard --logdir results/ayush/R_hierarchical2_mc/B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam/tensorboard/
- You can find your results as below.
Expected output can be found here. Your results can be different due to randomness in the test-time training process.
results/ayush/R_hierarchical2_mc videos/ color_depth_mc_depth_colmap_dense_B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam.mp4 # comparison of disparity maps from mannequin challenge, COLMAP and ours B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam/ depth/ # final disparity maps checkpoints/0020.pth # final checkpoint eval/ # disparity maps and losses after each epoch of training
The demo runs everything including flow estimation, test-time training, etc. except the COLMAP part for quick demonstration and ease of installation.
To enable testing the COLMAP part, you can delete results/ayush/colmap_dense
and results/ayush/depth_colmap_dense
.
And then run the python command above again.
Customized Run:
Please refer to params.py
or run python main.py --help
for the full list of parameters.
Here I demonstrate some examples for common usage of the system.
Run on Your Own Videos
- Place your video file at
$video_file_path
. - [Optional] Calibrate camera using
PINHOLE
(fx, fy, cx, cy) orSIMPLE_PINHOLE
(f, cx, cy) model. Camera intrinsics calibration is optional but suggested for more accurate and faster camera registration. We typically calibrate the camera by capturing a video of a textured plane with really slow camera motion while trying to let target features cover the full field of view, selecting non-blurry frames, running COLMAP on these images. - Run
- Run without camera calibration.
python main.py --video_file $video_file_path --path $output_path --make_video
- Run with camera calibration. For instance, run with
PINHOLE
model andfx, fy, cx, cy = 1660.161322, 1600, 540, 960
python main.py --video_file $video_file_path --path $output_path \ --camera_model "PINHOLE" --camera_params "1660.161322, 1600, 540, 960" \ --make_video
- You can also specify backend monocular depth estimation network by
The supported model types are
python main.py --video_file $video_file_path --path $output_path \ --camera_model "PINHOLE" --camera_params "1660.161322, 1600, 540, 960" \ --make_video --model_type "${model_type}"
mc
(Mannequin Challenge by Zhang et al. 2019), ,midas2
(MiDaS by Ranftl el al. 2019) andmonodepth2
(Monodepth2 by Godard et al. 2019).
- Run without camera calibration.
Run with Precomputed Camera Poses
We rely on COLMAP to for camera pose registration. If you have precomputed camera poses instead,
you can provide them to the system in folder $path
as follows.
(Example file structure of $path
see here.)
- Save your color images as
color_full/frame_%06d.png
. - Create
frame.txt
of format (example see here):number_of_frames width height frame_000000_timestamp_in_seconds frame_000001_timestamp_in_seconds ...
- Convert your camera pose to COLMAP sparse reconstruction format following this.
Put your
images.txt
,cameras.txt
andpoints3D.txt
(or.bin
) undercolmap_dense/pose_init/
. Note that thePOINTS2D
inimages.txt
and thepoints3D.txt
can be empty. - Run.
python main.py --path $path --initialize_pose
Mask out Dynamic Object for Camera Pose Estimation
To get better pose for dynamic scene, you can mask out dynamic objects when extracting features with COLMAP. Note COLMAP >= 3.6 is required to extract features in masked regions.
-
Extract frames
python main.py --video_file $video_file_path --path $output_path --op extract_frames
-
Run your favourite segmentation method (e.g., Mask-RCNN) on images in
$output_path/color_full
to extract binary mask for dynamic objects (e.g., human). No features will be extracted in regions, where the mask image is black (pixel intensity value 0 in grayscale). Following COLMAP document, save the mask of frame$output_path/color_full/frame_000010.png
, for instance, at$output_path/mask/frame_000010.png.png
. -
Run the rest of the pipeline.
python main.py --path $output_path --mask_path $output_path/mask \ --camera_model "${camera_model}" --camera_params "${camera_intrinsics}" \ --make_video
Result Folder Structure
The result folder is of the following structure. Lots of files are saved only for debugging purposes.
frames.txt # meta data about number of frames, image resolution and timestamps for each frame
color_full/ # extracted frames in the original resolution
color_down/ # extracted frames in the resolution for disparity estimation
color_down_png/
color_flow/ # extracted frames in the resolution for flow estimation
flow_list.json # indices of frame pairs to finetune the model with
flow/ # optical flow
mask/ # mask of consistent flow estimation between frame pairs.
vis_flow/ # optical flow visualization. Green regions contain inconsistent flow.
vis_flow_warped/ # visualzing flow accuracy by warping one frame to another using the estimated flow. e.g., frame_000000_000032_warped.png warps frame_000032 to frame_000000.
colmap_dense/ # COLMAP results
metadata.npz # camera intrinsics and extrinsics converted from COLMAP sparse reconstruction.
sparse/ # COLMAP sparse reconstruction
dense/ # COLMAP dense reconstruction
depth_colmap_dense/ # COLMAP dense depth maps converted to disparity maps in .raw format
depth_${model_type}/ # initial disparity estimation using the original monocular depth model before test-time training
R_hierarchical2_${model_type}/
flow_list_0.20.json # indices of frame pairs passing overlap ratio test of threshold 0.2. Same content as ../flow_list.json.
metadata_scaled.npz # camera intrinsics and extrinsics after scale calibration. It is the camera parameters used in the test-time training process.
scales.csv # frame indices and corresponding scales between initial monocular disparity estimation and COLMAP dense disparity maps.
depth_scaled_by_colmap_dense/ # monocular disparity estimation scaled to match COLMAP disparity results
vis_calibration_dense/ # for debugging scale calibration. frame_000000_warped_to_000029.png warps frame_000000 to frame_000029 by scaled camera translations and disparity maps from initial monocular depth estimation.
videos/ # video visualization of results
B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam/
checkpoints/ # checkpoint after each epoch
depth/ # final disparity map results after finishing test-time training
eval/ # intermediate losses and disparity maps after each epoch
tensorboard/ # tensorboard log for the test-time training process
Citation
If you find our code useful, please consider citing our paper:
@article{Luo-VideoDepth-2020,
author = {Luo, Xuan and Huang, Jia{-}Bin and Szeliski, Richard and Matzen, Kevin and Kopf, Johannes},
title = {Consistent Video Depth Estimation},
booktitle = {ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH)},
publisher = {ACM},
volume = {39},
number = {4},
year = {2020}
}
License
This work is licensed under MIT License. See LICENSE for details.
Acknowledgments
We would like to thank Patricio Gonzales Vivo, Dionisio Blanco, and Ocean Quigley for creating the artistic effects in the accompanying video. We thank True Price for his practical and insightful advice on reconstruction and Ayush Saraf for his suggestions in engineering.