GPU-Accelerated Obstruction Free Photography

Hacked together by David Liao and Zhan Xiong Chin

Tested on: AWS g2.2xlarge instance (Linux ip-172-31-53-81 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux, Nvidia GK104GL [GRID 520])

Technical Presentation

Presentation

Build instructions

Just run make. Requires CUDA, OpenCV, and CImg.

Overview

The field of computer vision is the dual to computer graphics. In many cases, the unknowns that we are solving for in optical models are swapped around. For example, the graphics problem "given camera motion and lighting + scene conditions, generate a photorealistic image" has the counterpart vision problem "given photorealistic images and estimated lighting + scene conditions, calculate the camera motion".

In this project, we take on the challenge of recovering a background scene from an image with both a background scene and an occluding layer (e.g. fence, glass pane reflection, non-symmetric objects). The input to the algorithm is a series of images extracted from a camera that pans across a scene in planar motion. The inspiration and intuition driving the solution of this challenge involves leveraging parallax motion of the foreground and background images to separate a foreground and background layer. Given the data parallel property of the challenge (per-pixel calculation), this algorithm is a perfect target for GPU acceleration. While the main focus is emphasized on the performance gained by implementing the algorithm on the GPU (as opposed to the CPU), we will briefly go over the algorithm.

Algorithm Outline

Let's briefly go over the algorithm to get an idea of what's going on under the hood.

Description

Input: a sequence of images (we use 5) with the middle one being the "reference frame"

Assumptions:

There needs to be a distinct occluding layer and background layer
The majority of the background and foreground motion needs to be uniform
Every background pixel is visible in at least one frame

Output: A background image without obstruction

Pipeline

Initialization

Edge detection
Sparse flow
Layer separation
Dense flow interpolation
Warp matrix generation + Image warping
Background and foreground initial estimated separation

Optimization

Minimize cost function

Implementation

Here we describe how we implement each step in the pipeline.

Edge Detection

We use Canny's edge detection algorithm that has the sub-pipeline as follows:

Grayscale conversion
Noise removal (smoothing)
Gradient calculation
Non-maximum edge suppression
Hysteresis

Step 1 is an intensity average and steps 2 + 3 are convolutions using a Gaussian kernel and Sobel operators.

Step 4 takes the edge direction and inspects pixels on both sides of the edge (in the forward and reverse direction of the gradient) and if the pixel is not a "mountain", it will be suppressed. The end result is thinned edges that are pixel thick.

Step 5 is a technique to reduce noise in the edges when thresholding. When setting an intensity, threshold, it is possible that an edge dances around the boundaries of the threshold. The result is a noisy dotted edge that we don't want. Instead, we set a lower threshold and upper threshold where if an edge starts above the upper threshold but dips below the upper threshold while staying above the lower threshold, it will still be considered an edge. The below image borrowed from OpenCV docs gives a good idea of what this step does. Here the vertical represents the intensity while the horizontal maps to some 1D projected plane of the edge. Line A is considered an edge even though C is below the threshold. Line B, however, is not since it begins, ends and maintains within the two thresholds.

Sparse Flow

We then need to calculate how the image moves. The original paper defines a Markov Random Field over the edges, constraining pixels to match the location they flow to, as well as being similar to the flow of their neighboring pixels. We were not able to replicate this method, instead opting to use a more common optical flow algorithm, the Lucas-Kanade method, to approximate this. We made use of the OpenCV implementation for this step.

We run the Lucas-Kanade algorithm on the set of pixels that make up the edges. For each pixel, the Lucas-Kanade method assumes that the displacement of a small window (e.g. 21x21 pixels) around the window is constant between two frames, and sets up a system of linear equations on the spatial and temporal gradients of the window of pixels. Then, it uses least squares to solve this system of equations, with the result being the optical flow of the chosen pixel. While this is also ideal for GPU acceleration, as each pixel can be treated independently, we did not have time to write a kernel for this step. Also, the method no longer maintains similarity between adjacent pixels, so the flow may not be as continuous as the original paper's method.

Layer Separation

To then figure out which edges are part of the occluding layer and which edges are part of the background layer, we use RANSAC to separate the vertices into two clusters.

For this, we make the assumption that both the occluding layer and the background layer are translated in the same way between two frames. This is also a simplification from the original paper's method, which estimates a homography rather than a simple translation.

Then, we fix a percentage P% of pixels that will be classified as part of the background layer. We pick a random initial starting vector v, then take the nearest P% of motion vectors to it and classify that as our background layer. Then, we take the mean of this newly computed background layer's optical flow, and treat this as our new vector v. Repeating the process a fixed number of iterations, we end up with a separation of pixels into background and foreground layer.

This did not always give stable results, so we augmented this with additional heuristics, such as separating based on the intensity of the pixels (e.g. when the occluding layer is a fence or racket, "whiter" areas tend to be part of the occluding layer, showing wall or sky)

Dense Flow Interpolation

After we figure out how the edges move and separate the, we interpolate sparse motion vectors of the background layer into a dense motion field. Basically we take the edge motion vectors and interpolate to figure out how every other pixel moves. There were two methods we used to calculate this. One was to do a K-nearest neighbors weighted interpolation, while the other was a naive average to generate an affine transformation matrix. As it turns out, the latter approximation works fine in the context of an initial estimate.

Warp Matrix

We take the background motion field from the last step and generate a warp matrix (basically inverting the motion matrix). Then apply it to all the non-reference images to warp them to the reference image. This has the effect of moving moving all the images in a way such that their backgrounds are stacked on top of each other. Their foreground, as a result, is the only moving layer.

Background and Foreground Initial Estimation

After the last step, we've reduced the problem to a state where the camera perspective is static. We can then estimate a pixel for each layer based on this information.

Mean based estimation

If the occluding the layer is opaque, we can take the average of all the pixels at a location. If it is reflective, we can take the minimum (since reflection is additive light).

Alternatively, we can employ smarter techniques that take into account spatial coherence. In this case, we will select the pixel that has the most number of neightbors with a pixel intensity similar to it.

Spatial coherence based estimation

Optimization

The quality of the image can be measured by an objective function defined over the foreground and background layers and motion fields, as well as the alpha mask between the two layers. Then, optimizing this objective function will (in principle) give a better image.

The original paper made use of an objective function that penalizes the following terms:

Differences between the original image and the combination of the estimated background and foreground
Noisy alpha mask
Noisy background/foreground images
High gradients on the same pixel of the background and foreground image (i.e. highly detailed of the image probably only belong to one layer rather than a mix of both)
Noisy motion fields

They optimized this using iteratively reweighted least squares, using the pyramid method to start from a coarse image and optimize on successively finer ones.

For our code, we made use of the same objective function, but we performed a naive gradient descent to optimize the objective function. Tweaking the parameters of the optimization, we sometimes were able to obtain slightly better results, but in general, we did not find a significant improvement in image quality, possibly due to the slow convergence of our naive gradient descent approach.

Results

The pipeline benchmarked exclusively for the individual kernel invocations. The overhead costs were excluded in this analysis as they are not useful/relevant. The input used was img/hanoiX.png where X is replaced by the image number. Please look at the entire recorded data set here. Note the tabs at the bottom!

Performance Analysis

We note some of the more important numbers here. At a glance, it's pretty obvious that the pipeline hugely benefited from GPU acceleration. The most notable is the gradient descent optimization step which received a huge speedup. Edge detection, generating foreground and background estimates, calculating spatial coherence, and gradient descent are all low-cost computations (<5ms). This makes sense as most of the work that they do is very computational. They all have low branch counts and perform only the most basic of logical evaluations.

Since the pipeline is sequential, performance bottlenecks are distributed across each pipeline step and can be targets for future improvement. The most notable bottleneck is the RANSAC algorithm. Since it is an iterative model, tweaking iteration parameters would speed it up at the cost of accuracy. At its current benchmarked state, it is running ~50 iterations. So each iteration runs at about ~2 ms.

Time Breakdown

The overall time spent computing on the GPU was on RANSAC as well as Gradient descent. This is because gradient descent and RANSAC are both iterative methods that require multiple steps to convergence. As a result, we want high iterations for for both of these to get the best images. On the other hand, if we were to make this an real-time pipeline, we would choose to do ~5 iterations to achieve sub 100ms time.

Other images

Reflection removal as well. Top left: original image, top right: separation, bottom left: without gradient descent, bottom right: with gradient descent

Removing the spokes of a bike wheel from Harrison College House.

Bloopers

When canny edge detection goes wrong...

A bug in the pipeline propagates and magnifies in the pipeline downstream.

Maya scene where we have grounded camera movements as well as distances. But still wrong output!

Acknowledgements

We want to give a big thanks to Patrick Cozzi for teaching the CIS 565 course as well as Benedict Brown for giving us general CV tips.

saccadic/Obstruction-Free-Photography