/tfoptflow

Optical Flow Estimation with TensorFlow. Implements "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," by Deqing Sun et al. (CVPR 2018)

Primary LanguageJupyter NotebookMIT LicenseMIT

Optical Flow Estimation with Tensorflow

This repo provides a TensorFlow-based implementation of the wonderful paper "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," by Deqing Sun et al. (CVPR 2018).

There are already a few attempts at implementing PWC-Net using TensorFlow out there. However, they either use outdated architectures of the paper's CNN networks, only provide TF inference (no TF training), only work on Linux platforms, and do not support multi-GPU training.

This implementation provides both TF-based training and inference. It is portable: because it doesn't use any dynamically loaded CUDA-based TensorFlow user ops, it works on Linux and Windows. It also supports multi-GPU training (the notebooks and results shown here were collected on a GTX 1080 Ti paired with a Titan X). The code also allows for mixed-precision training.

Finally, as shown in the "Links to pre-trained models" section, we achieve better results than the ones reported in the official paper on the challenging MPI-Sintel 'final' dataset.

Table of Contents

Background

The purpose of optical flow estimation is to generate a dense 2D real-valued (u,v vector) map of the motion occurring from one video frame to the next. This information can be very useful when trying to solve computer vision problems such as object tracking, action recognition, video object segmentation, etc.

Figure [2017a] (a) below shows training pairs (black and white frames 0 and 1) from the Middlebury Optical Flow dataset as well as the their color-coded optical flow ground truth. Figure (b) indicates the color coding used for easy visualization of the (u,v) flow fields. Usually, vector orientation is represented by color hue while vector length is encoded by color saturation:

The most common measures used to evaluate the quality of optical flow estimation are angular error (AE) and endpoint error (EPE). The angular error between two optical flow vectors (u0, v0) and (u1, v1) is defined as arccos((u0, v0) . (u1, v1)). The endpoint error measures the distance between the endpoints of two optical flow vectors (u0, v0) and (u1, v1) and is defined as sqrt((u0 - u1)2 + (v0 - v1)2).

Environment Setup

The code in this repo was developed and tested using Anaconda3 v.5.2.0. To reproduce our conda environment, please refer to the following files:

On Ubuntu:

On Windows:

Links to pre-trained models

Pre-trained models can be found here. They come in two flavors: "small" (sm) models don't use dense connections or residual connections, "large" (lg) models do. They are all built with a 6-level pyramid, upsampling level 2 by 4 in each dimension to generate the final prediction, and construct an 81-channel cost volume at each level from a search range (maximum displacement) of 4.

Please note that we trained these models using slightly different dataset and learning rate schedules. The official multistep schedule discussed in [2018a] is as follows: Slong 1.2M iters training, batch size 8 + Sfine 500k iters finetuning, batch size 4). Ours is Slong only, 1.2M iters, batch size 8, on a mix of FlyingChairs and FlyingThings3DHalfRes. FlyingThings3DHalfRes is our own version of FlyingThings3 where every input image pair and groundtruth flow has been downsampled by two in each dimension. We also use a different set of augmentation techniques.

Model name Notebooks FlyingChairs (384x512) AEPE Sintel clean (436x1024) AEPE Sintel final (436x1024) AEPE
pwcnet-lg-6-2-multisteps-chairsthingsmix train 1.44 (notebook) 2.60 (notebook) 3.70 (notebook)
pwcnet-sm-6-2-multisteps-chairsthingsmix coming soon coming soon coming soon coming soon

As a reference, here are the official, reported results:

We also measured the following MPI-Sintel (436 x 1024) inference times on a few GPUs:

Model name Titan X GTX 1080 GTX 1080 Ti
pwcnet-lg-6-2-cyclic-chairsthingsmix 90ms 81ms 68ms
pwcnet-sm-6-2-cyclic-chairsthingsmix 68.5ms 64.4ms 53.8ms

PWC-Net

Basic Idea

Per [2018a], PWC Net improves on FlowNet2 [2016a] by adding domain knowledge into the design of the network. The basic idea behind optical flow estimation it that a pixel will retain most of its brightness over time despite a positional change from one frame to the next ("brightness" constancy). We can grab a small patch around a pixel in video frame 1 and find another small patch in video frame 2 that will maximize some function (e.g., normalized cross correlation) of the two patches. Sliding that patch over the entire frame 1, looking for a peak, generates what's called a cost volume (the C in PWC). This techniques is fairly robust (invariant to color change) but is expensive to compute. In some cases, you may need a fairly large patch to reduce the number of false positives in frame1, raising the complexity even more.

To alleviate the cost of generating the cost volume, the first optimization is to use pyramidal processing (the P in PWC). Using a lower resolution image lets you perform the search sliding a smaller patch from frame 1 over a smaller version of frame 2, yielding a smaller motion vector, then use that information as a hint to perform a more targeted search at the next level of resolution in the pyramid. That multiscale motion estimation can be performed in the image domain or in the feature domain (i.e., using the downscaled feature maps generated by a convnet). In practice, PWC warps (the W in PWC) frame 1 using an upsampled version of the motion flow estimated at a lower resolution because this will lead to searching for a smaller motion increment in the next higher resolution level of the pyramid (hence, allowing for a smaller search range). Here's a screenshot of a talk given by Deqing Sun that illustrates this process using a 2-level pyramid:

Note that none of the three optimizations used here (P/W/C) are unique to PWC-Net. These are techniques that were also used in SpyNet [2016b] and FlowNet2 [2016a]. However, here, they are used on the CNN features, rather than on an image pyramid:

The authors also acknowledge the fact that careful data augmentation (e.g., adding horizontal flipping) was necessary to reach best performance. To improve robustness, the authors also recommend training on multiple datasets (Sintel+KITTI+HD1K, for example) with careful class imbalance rebalancing.

Since this algorithm only works on two continuous frames at a time, it has the same limitations as methods that only use image pairs (instead of n frames with n>2). Namely, if an object moves out of frame, the predicted flow will likely have a large EPE. As the authors remark, techniques that use a larger number of frames can accommodate for this limitation by propagating motion information over time. The model also sometimes fails for small, fast moving objects.

Network

Here's a picture of the network architecture described in [2018a]:

Jupyter Notebooks

The recommended way to test this implementation is to use the following Jupyter notebooks:

Training

Multisteps learning rate schedule

Differently from the original paper, we do not train on FlyingChairs and FlyingThings3D sequentially (i.e, pre-train on FlyingChairs then finetune on FlyingThings3D). This is because the average flow magnitude on the MPI-Sintel dataset is only 13.5, while the average flow magnitudes on FlyingChairs and FlyingThings3D are 11.1 and 38, respectively. In our experiments, finetuning on FlyingThings3D would only yield worse results on MPI-Sintel.

We got more stable results by using a half-resolution version of the FlyingThings3D dataset with an average flow magnitude of 19, much closer to FlyingChairs and MPI-Sintel in that respect. We then trained on a mix of the FlyingChairs and FlyingThings3DHalfRes datasets. This mix, of course, could be extended with additional datasets.

Here are the training curves for the Slong training notebooks listed above:

Note that, if you click on the IMAGE tab in Tensorboard while running the training notebooks above, you will be able to visualize the progress of the training on a few validation samples (including the levels of the feature pyramids), as demonstrated here:

Cyclic learning rate schedule

If you don't want to use the long training schedule, but still would like to play with this code, try our very short cyclic learning rate schedule (100k iters, batch size 8). The results are nowhere near as good, but they allow for quick experimentation:

Model name Notebooks FlyingChairs (384x512) AEPE Sintel clean (436x1024) AEPE Sintel final (436x1024) AEPE
pwcnet-lg-6-2-cyclic-chairsthingsmix train 2.67 (notebook) 3.99 (notebook) 5.08 (notebook)
pwcnet-sm-6-2-cyclic-chairsthingsmix train 2.79 (notebook) 4.34 (notebook) 5.3 (notebook)

Below are the training curves for the Cyclicshort training notebooks:

Mixed-precision training

You can speed up training even further by using mixed-precision training. But, again, don't expect the same level of accuracy:

Model name Notebooks FlyingChairs (384x512) AEPE Sintel clean (436x1024) AEPE Sintel final (436x1024) AEPE
pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16 train 3.14 (notebook) 5.15 (notebook) 6.10 (notebook)

Evaluation

As shown in the evaluation notebooks, and as expected, it becomes harder for the PWC-Net models to deliver accurate flow predictions if the average flow magnitude from one frame to the next is high:

It is especially hard for this -- and any other 2-frame based motion estimator! -- model to generate accurate predictions when picture elements simply disappear out-of-frame or suddenly fly-in:

Still, when the average motion is moderate, both the small and large models generate remarkable results:

Datasets

Datasets most commonly used for optical flow estimation include:

Additional optical flow datasets (not used here):

  • Middlebury Optical Flow [web]
  • Heidelberg HD1K Flow [web]

Per [2018a], KITTI and Sintel are currently the most challenging and widely-used benchmarks for optical flow. The KITTI benchmark is targeted at autonomous driving applications and its semi-dense ground truth is collected using LIDAR. The 2012 set only consists of static scenes. The 2015 set is extended to dynamic scenes via human annotations and more challenging to existing methods because of the large motion, severe illumination changes, and occlusions.

The Sintel benchmark is created using the open source graphics movie "Sintel" with two passes, clean and final. The final pass contains strong atmospheric effects, motion blur, and camera noise, which cause severe problems to existing methods.

References

2018

2017

  • [2017a] Baghaie et al. 2017. Dense Descriptors for Optical Flow Estimation: A Comparative Study. [web]

2016

2015

Acknowledgments

Other TensorFlow implementations we are indebted to:

@InProceedings{Sun2018PWC-Net,
  author    = {Deqing Sun and Xiaodong Yang and Ming-Yu Liu and Jan Kautz},
  title     = {{PWC-Net}: {CNNs} for Optical Flow Using Pyramid, Warping, and Cost Volume},
  booktitle = CVPR,
  year      = {2018},
}
@InProceedings\{DFIB15,
  author       = "A. Dosovitskiy and P. Fischer and E. Ilg and P. H{\"a}usser and C. Hazirbas and V. Golkov and P. v.d. Smagt and D. Cremers and T. Brox",
  title        = "FlowNet: Learning Optical Flow with Convolutional Networks",
  booktitle    = "IEEE International Conference on Computer Vision (ICCV)",
  month        = "Dec",
  year         = "2015",
  url          = "http://lmb.informatik.uni-freiburg.de//Publications/2015/DFIB15"
}

Contact Info

If you have any questions about this work, please feel free to contact us here:

https://www.linkedin.com/in/philferriere