🎉🎉🎉 Welcome to the NVDS GitHub repository! 🎉🎉🎉
The repository is official PyTorch implementation of ICCV2023 paper "Neural Video Depth Stabilizer" (NVDS).
Authors: Yiran Wang1, Min Shi1, Jiaqi Li1, Zihao Huang1, Zhiguo Cao1, Jianming Zhang2, Ke Xian3*, Guosheng Lin3
Institutes: 1Huazhong University of Science and Technology, 2Adobe Research, 3Nanyang Technological University
Project Page | Arxiv | Video | 视频 | Poster | Supp | VDW Dataset
NVDS is the first plug-and-play stabilizer that can remove flickers from any single-image depth model without extra effort. Besides, we also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Don't forget to star this repo if you find it interesting!
Our VDW dataset is quite large (2.23 million frames, over 8TB on hard drive). Heavy works are needed for open-source. The VDW dataset can only be used for academic and research purposes. We will gradually release our VDW dataset for the community. Stay tuned!
- [2023.07.16] Our work is accepted by ICCV2023.
- [2023.07.18] The Arxiv version of our NVDS paper is released.
- [2023.07.18] Our Project Page is built and released.
- [2023.07.21] We present the NVDS checkpoint and demo (inference) code.
- [2023.08.05] Update license of VDW dataset: CC BY-NC-SA 4.0.
- [2023.08.10] Update the camera ready version of NVDS paper and supplementary.
- [2023.08.11] Release evaluation code and checkpoint of NYUDV2-finetuned NVDS.
- [2023.09.09] VDW official website and application mailbox (vdw.dataset@gmail.com) go online. Refer to the website for usage and applications.
- [2023.09.09] Evaluation code on VDW test set is released.
- [2023.09.17] Upload NVDS Poster for ICCV2023.
- [TODO] We will gradually update our VDW training set. Stay tuned!
-
VDW dataset.
We plan to release VDW dataset under strict conditions. VDW dataset is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). It cannot be used for any commercial purposes. For our video sequences, we will gradually release IMDB numbers, starting time, end time, movie durations, resolutions, cropping areas, and some data processing tools to utilize the data. We will provide an application template and mailbox. If you need our processed dataset for your research, apply to our mailbox. Your name, institution, purpose for using our data, and agreement to our license will be included in the application form. We will examine your application and send you feedback in 3-5 weekdays. Overall, we will follow the practices of the community (previous open-source datasets with movie data, e.g., Hollywood 3D, MovieNet, etc.) to legally release VDW dataset. Please refer to our VDW official website for data usage, download, and applications.
-
NVDS code and model.
Following MiDaS and CVD, NVDS model adopts the widely-used MIT License. We will gradually release our code and model as scheduled.
Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.
-
Basic environment.
Our code is based on
python=3.8.13
andpytorch==1.9.0
. Refer to therequirements.txt
for installation.conda create -n NVDS python=3.8.13 conda activate NVDS conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c conda-forge pip install numpy imageio opencv-python scipy tensorboard timm scikit-image tqdm glob h5py
-
Installation of GMflow.
We utilize state-of-the-art optical flow model GMFlow in the temporal loss and the OPW metric. The temporal loss is used to enhance consistency while training. The OPW metric is evaluated in our demo (inference) code to showcase quantitative improvements.
Please refer to the GMFlow Official Repo for the installation. -
Installation of mmcv and mmseg.
Cross attention in our stabilization network contains functions based on
mmcv-full==1.3.0
andmmseg==0.11.0
. Please refer to MMSegmentation-v0.11.0 and their official document for detailed installation instructions step by step. The key is to match the version of mmcv-full and mmsegmentation with the version of cuda and pytorch on your server. For instance, I haveCUDA 11.1
andPyTorch 1.9.0
on my server, thusmmcv-full 1.3.x
andmmseg 0.11.0
(as in our installation instructions) are compatible with my environment (confirmed by mmcv-full 1.3.x). Different servers adopt different Cuda versions, thus I can not specify the specific installation for all people. You should check the matching version of your own server on the official documents of mmcv-full and mmseg. You can choose different versions in their documents and check the version matching relations. By reading and following the detailed mmcv-full and mmseg documents, the installation seems to be easy. You can also refer to Issue #1 for some discussions.Besides, we suggest you to install
mmcv-full==1.x.x
, because some API or functions are removed inmmcv-full==2.x.x
(you need to adjust our code for mmcv-full==2.x.x).
-
Preparing Demo Videos.
We put 8 demo input videos in
demo_videos
folder, in whichbandage_1
andmarket_6
are examples of MPI Sintel dataset.motocross-jump
is from DAVIS dataset. Others are a few examples of our VDW test dataset. You can also prepare your own testing sequences like us. -
Downloading checkpoints of depth predictors.
In our demo, we adopt MiDaS and DPT as different depth predictors. We use midas_v21-f6b98070.pt and dpt_large-midas-2f21e586.pt. Download those checkpoints and put them in
dpt/checkpoints/
folder. You may need to modify the MiDaS checkpoint name (midas_v21_384.pt) or our code (midas_v21-f6b98070.pt) since its name is adjusted by the MiDaS repo. -
Preparing checkpoint of NVDS Stabilizer.
Download and put the
NVDS_Stabilizer.pth
inNVDS_checkpoints/
folder. -
Running NVDS Inference Demo.
infer_NVDS_dpt_bi.py
andinfer_NVDS_midas_bi.py
use DPT and Midas as depth predictors. Those scripts contain: (1) NVDS Bidirectional Inference; (2) OPW Metric Evaluations with GMFlow. The only difference between those two scripts is the depth predictor. For running the code, taking DPT as an example, the basic command is:CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir /XXX/XXX --vnum XXX --infer_w XXX --infer_h XXX
--base_dir
represents the folder to save results.--vnum
refer to the video numbers or names.--infer_w
and--infer_h
are the width and height for inference. We use--infer_h 384
by default. The--infer_w
is set to maintain the aspect ratio of original videos. Besides, the--infer_w
and--infer_h
should be set to integer multiples of32
for alignment of resolutions in the up-sampling and down-sampling processes.Specifically, for the videos of VDW test dataset (
000423
as an example):CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/000423/ --vnum 000423 --infer_w 896 --infer_h 384 CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/000423/ --vnum 000423 --infer_w 896 --infer_h 384
For the videos of Sintel dataset (
market_6
as an example):CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/market_6/ --vnum market_6 --infer_w 896 --infer_h 384 CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/market_6/ --vnum market_6 --infer_w 896 --infer_h 384
For the videos of DAVIS dataset (
motocross-jump
as an example):CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/motocross-jump/ --vnum motocross-jump --infer_w 672 --infer_h 384 CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/motocross-jump/ --vnum motocross-jump --infer_w 672 --infer_h 384
Under the resolution of
$896\times384$ , the inference of DPT-Large and our stabilizer takes about 20G and 5G GPU memory (RTX-A6000). If the memory occupancy is too large for your server, you can (1) run DPT/Midas initial depth results and our NVDS separately; (2) reduce the inference resolution ($e.g.$ ,$384\times384$ ); (3) if not needed, remove the OPW evaluations, in which the inference of GMFlow also brings some computational costs. (4) if not needed, remove the bidirectional (backward and mixing) inference. The forward inference process can also produce satisfactory results, while bidirectional inference can further improve consistency.After running the inference code, the result folder
--base_dir
will be organized as follows:demo_outputs/dpt_init/000423/ └─── result.txt ├── initial/ └── color/ └── gray/ ├── 1/ └── color/ └── gray/ ├── 2/ └── color/ └── gray/ ├── mix/ └── color/ └── gray/
result.txt
contains the OPW evaluations of initial depth (depth predictor,initial/
), NVDS forward predictions (1/
), backward predictions (2/
), and final bidirectional results (mix/
).color
contains depth visualizations andgray
contains depth results in uint16 format (0-65535). -
Video Comparisons.
After getting the results, video comparisons can be generated and saved in
demo_outputs_videos/
:python pic2v.py --vnum 000423 --infer_w 896 --infer_h 384 python pic2v.py --vnum market_6 --infer_w 896 --infer_h 384 python pic2v.py --vnum motocross-jump --infer_w 672 --infer_h 384
We show 8 video comparisons in
demo_outputs_videos/
. The first row is RGB video, the second row is initial depth (DPT and MiDaS), and the third row is NVDS results with DPT and MiDaS as depth predictors. To ensure the correctness of your running results, you can compare the results you obtained withdemo_outputs_videos
anddemo_outputs
(png results). We show png results of the 8 videos by LINK. Besides, you are also encouraged to modify our code to stabilize your own depth predictors and discuss the results with us. We hope our work can serve as a solid baseline for future works in video depth estimation and other relevant tasks.
-
Preparing 654 testing sequences.
Download the 654 testing sequences from LINK. Put the sequences in the
./test_nyu_data
folder. The./test_nyu_data
folder should only contain the 654 folders of all testing sequences. The folder of each sequence is organized by:test_nyu_data/1/ ├── rgb/ └── 000000.png 000001.png 000002.png 000003.png ├── gt/ └── 000003.png
We follow the commonly-applied Eigen split with 654 images for testing. In our case, we locate each image
(000003.png)
in its video and use its previous three frames(000000.png, 000001.png, and 000002.png)
as reference frames. -
Preparing NVDS checkpoint finetuned on NYUDV2.
Download and put the
NVDS_Stabilizer_NYUDV2_Finetuned.pth
inNVDS_checkpoints/
folder. -
Evaluations with Midas and DPT as different depth predictors.
Run
test_NYU_depth_metrics.py
with specified depth predictors (--initial_type dpt
ormidas
).CUDA_VISIBLE_DEVICES=0 python test_NYU_depth_metrics.py --initial_type dpt CUDA_VISIBLE_DEVICES=1 python test_NYU_depth_metrics.py --initial_type midas
The
test_NYU_depth_metrics.py
contains three parts: (1) Inference of depth predictors, producing initial results of Midas or DPT; (2) Inference of NVDS based on the initial results; (3) Metric evaluations of depth predictor and NVDS. All inference processes are conducted by the resolution of$384\times384$ as Midas and DPT. For simplicity, we only adopt NVDS forward prediction in this code. By running the code, you can reproduce similar results as our paper:Methods $\delta_1$ $Rel$ Methods $\delta_1$ $Rel$ Midas 0.910 0.095 DPT $0.928$ $0.084$ NVDS (Midas) 0.941 0.076 NVDS (DPT) 0.950 0.072 After running the evaluation code, the
test_nyu_data
will be organized by:test_nyu_data/1/ ├── rgb/ └── 000000.png 000001.png 000002.png 000003.png ├── gt/ └── 000003.png ├── initial_midas/ └── 000000.png 000001.png 000002.png 000003.png ├── initial_dpt/ └── 000000.png 000001.png 000002.png 000003.png ├── NVDS_midas/ └── 000003.png ├── NVDS_dpt/ └── 000003.png
We evaluate depth metrics of all methods only using the 654 images in Eigen split, i.e.,
000003.png
of each sequence.000000.png, 000001.png, and 000002.png
are produced by depth predictors as the input of the stabilization network.
-
Applying for the VDW test set.
Please refer to our VDW official website for data usage, download, and applications. If your application is approved by us, you can download and put the VDW test set in a certain folder. Here we take
/xxx/vdw_test
as an example. The VDW test set contains 90 videos with 12,622 frames. For each video (e.g.,/xxx/vdw_test/000008/
), the test set is organized as follows. Theleft
orright
folders contain the RGB video frames of left and right views, while gt folders are for disparity annotations and mask folders for valid masks./xxx/vdw_test/000008/ ├── left/ └── frame_000000.png frame_000001.png frame_000002.png ... ├── left_gt/ └── frame_000000.png frame_000001.png frame_000002.png ... ├── left_mask/ └── frame_000000.png frame_000001.png frame_000002.png ... ├── right/ └── frame_000000.png frame_000001.png frame_000002.png ... ├── right_gt/ └── frame_000000.png frame_000001.png frame_000002.png ... ├── right_mask/ └── frame_000000.png frame_000001.png frame_000002.png ...
-
Inference and evaluations for each test video.
For each test video, the evaluations contain two steps: (1) inference; and (2) depth metrics evaluations. We provide the
write_sh.py
to generate evaluation scripts (for Midas and DPT). You should modify some details inwrite_sh.py
(e.g., gpu number, VDW test set path, directory for saving NVDS results with Midas/DPT) in order to generate thetest_VDW_NVDS_Midas.sh
andtest_VDW_NVDS_DPT.sh
. We provide the two example scripts with/xxx/
for those directories.To be specific, (1) the inference step is the same as the previous
Demo & Inference
part withinfer_NVDS_dpt_bi.py
andinfer_NVDS_midas_bi.py
. In this step, the temporal metricOPW
is automatically evaluated and saved in theresult.txt
. (2) Depth metrics evaluations utilize thevdw_test_metric.py
to calculate$\delta_1$ and$Rel$ for each video. Taking./vdw_test/000008/
as an example,--gt_dir
specifies the path forvdw_test
,--result_dir
specifies your directory for saving results, and--vnum
represents the video number.python vdw_test_metric.py --gt_dir /xxx/vdw_test/ --result_dir /xxx/NVDS_VDW_Test/Midas/ --vnum 000008 python vdw_test_metric.py --gt_dir /xxx/vdw_test/ --result_dir /xxx/NVDS_VDW_Test/DPT/ --vnum 000008
After generating
test_VDW_NVDS_Midas.sh
andtest_VDW_NVDS_DPT.sh
, you can run inference and evaluations for all the videos by:bash test_VDW_NVDS_Midas.sh bash test_VDW_NVDS_DPT.sh
-
Average metrics calculations for all 90 videos.
When the scripts are finished for all videos,
NVDS_VDW_Test
folder will contain the results of 90 test videos with Midas/DPT as depth predictors (/xxx/NVDS_VDW_Test/Midas/
and/xxx/NVDS_VDW_Test/DPT/
). For each video, there will be anaccuracy.txt
to store the depth metrics. The last step is to calculate the average temporal and depth metrics for all the 90 videos. You can simply run thecal_mean_vdw_metric.py
for the final results.python cal_mean_vdw_metric --test_dir /xxx/NVDS_VDW_Test/Midas/ python cal_mean_vdw_metric --test_dir /xxx/NVDS_VDW_Test/DPT/
Finally, you can get the same results as our paper. This also serves as an example to conduct evaluations on the VDW test set.
Methods $\delta_1$ $Rel$ $OPW$ Methods $\delta_1$ $Rel$ $OPW$ Midas 0.651 0.288 0.676 DPT 0.730 0.215 0.470 NVDS-Forward (Midas) 0.700 0.240 0.207 NVDS-Forward (DPT) 0.741 0.208 0.165 NVDS-Backward (Midas) 0.699 0.240 0.218 NVDS-Backward (DPT) 0.741 0.208 0.174 NVDS-Final (Midas) 0.700 0.240 0.180 NVDS-Final (DPT) 0.742 0.208 0.147
We thank the authors for releasing PyTorch, MiDaS, DPT, GMFlow, SegFormer, VSS-CFFM, Mask2Former, PySceneDetect, and FFmpeg. Thanks for their solid contributions and cheers to the community.
@article{wang2023neural,
title={Neural Video Depth Stabilizer},
author={Wang, Yiran and Shi, Min and Li, Jiaqi and Huang, Zihao and Cao, Zhiguo and Zhang, Jianming and Xian, Ke and Lin, Guosheng},
journal={arXiv preprint arXiv:2307.08695},
year={2023}
}