CUDA Image Blurring

Implementation of an image blurring kernel. With and without the shared memory.

System Specifications

• Azure NC6
• Cores: 6
• GPU: Tesla K80
• Memory: 56 GB
• Disk: 380 GB SSD
The Tesla K80 delivers 4992 CUDA cores with a dual-GPU design, up to 2.91 Teraflops of double- precision and up to 8.93 Teraflops of single-precision performance.

Implementation Details

We will implement a simple blurring. This is also known as a box linear filter. The operation samples neighboring pixels of the input image and calculates an output image with average value. In this implementation, values that are located outside of the bounds of the input image are given zero. The implementation can work with any size of images. Blur box size can be changed before running the code. To run the code smoothly, please follow command line arguments given below.
•argv[1]: IMAGENAME
•argv[2]: BLURTYPE
BLURTYPE represents the memory type. To run the code with unshared memory, type ‘0’ for second argument. To run it with shared memory, type ‘1’ for second argument. An example command can be like below.
•./imageBlur 1.ppm 0

Results and Graphs

The test images are checkboard images with different sizes.
• 1.ppm = 800x600
• 2.ppm = 1600x1200
• 3.ppm = 2400x1800
• 4.ppm = 3200x2400
• 5.ppm = 4000x3000
• 6.ppm = 4800x3600
• 7.ppm = 5600x4200
• 8.ppm = 6400x4800
• 9.ppm = 7200x5400
• 10.ppm = 8000x6000

Kernel block size in the graph above is 16x16 and blur size is 1. It spans three rows ( Row-1, Row, Row+1 ). As we can see in the graph, both shared and unshared memory implementations finish in near times at the first 3 images. However, after the image 3.ppm, the difference gets bigger. The process time of unshared memory increases much more than the process time of shared memory. The unshared memory approach process time increases almost x2 with each image. In contrast, the shared memory approach process increases almost x1.7. Nonetheless, shared memory processes have small computation time and this makes time consumption less. We can say that shared memory approach is averagely 10 times faster than unshared memory approach.

Blur size in the graph above is 1. It spans three rows ( Row-1, Row, Row+1 ). As we can see in the graph, increasing block size decreases the process times. For unshared memory approach, the block size changes affect process time more rather than shared memory approach. After 16x16 block size, the shared memory approach does not change much like the other memory type. For unshared memory approach, change from 4x4 to 32x32 reduces process time to half of the 4x4. For shared memory approach, change from 4x4 to 32x32 reduces process time for almost a third of the 4x4. Change gets really small after 8x8 for shared memory and we can say that the effect is not that much. Hence, block size increase has more effect on unshared memory approach.

Kernel block size in the graph above is 16x16. As we can. See in the graph, blur size changes affect unshared memory approach the most. Unshared memory process times increases for almost x4 with the first change. Then, it increases for almost x2.5. Shared memory process times increases for almost x4 with the first change. It is almost the same with the unshared one. Then, it increases for almost x2.6. The process time changes are very close to each other for both memory approaches. We can say that blur size changes have the same effect on memory approaches.

nuwandda/cuda-image-blurring

CUDA Image Blurring

System Specifications

Implementation Details

Results and Graphs