Parallel Implementation of Canny Edge Detection

A parallel implementation of the Canny Edge Detection algorithm using OpenMP, CUDA and OpenCL.

There are 4 Visual Studio 2019 projects with 4 different implementation - OpenMP, CUDA, OpenCL, Serial

Prerequisite

CUDA Toolkit
OpenMP
OpenCL
Visual Studio

Speedup Result

More detailed documentation can be found in canny_doc.pdf

Limitation and Future works

By using Nvidia Visual Profiler, we can see the apply_gaussian_filter kernel takes up 59.1%, and apply_sobel_filter kernel takes up 33.1% of the computation time. Besides, there is no kernel concurrency. We can introduce kernel concurrency by separating the apply_sobel_filter kernel into two kernels, which can be sobel_seperable_pass_x and sobel_seperable_pass_y. These two kernels use two different directions of the Sobel filter. They will have no dependency so they can be executed concurrently.

In addition, we also found out that Sobel and Gaussian filter is separable functions. In the current implementation, we have not utilised separable filter; therefore, a filter of window size M×M computes M2 operations per pixel. If we utilised separable functions correctly, the cost would be reduced to computing M + M = 2M operations. This is a two step process where the intermediate results from the first separable convolutionis stored and then convolved with the second separable filterto produce the output. We believed the performance would be significantly improved by utilising separable functions.

We can also use multiple streams to parallelise the process of memcpy and kernel execution, although the improvement may not be too significant, it is one thing that can be done in order to push performance to its limit.

Contribute

Fork the project.
Make feature addition or bug fix.
Report Issues
Send me a pull request.

justgam3/canny-edge-parallel

Parallel Implementation of Canny Edge Detection

Prerequisite

Speedup Result

Limitation and Future works

Contribute