Report by Dennis Weggenmann and Xiang Rong Lin for the lecture "High Performance Computing" during the winter semester 21/22 at "Hochschule für Technik Stuttgart"

https://github.com/sspeiser/hpc-uebungen

Motivation

RGB to Grayscale conversation is an embarrassingly parallel Problem. So it's perfect for multithreaded programms. Each Pixel can be calculated/converted independently.

GPU

Problems

memory transfer between Host and Device
when to start measure time ? Before the kernel launch, before the memcopy or before the allocation
The first cuda function calls takes a lot of time. For benchmarking reasons i just call a cudaFree(0). Now the first cuda call (most of the time its cudaMalloc) takes for example 0,023 seconds instead of 0,153 seconds
cudaHostRegister was not able to allocate enough memory but after trying it 10 times its working

Implementation

GPU Workflow

Host = CPU

Device = GPU

The GPU needs to acces the Data from the main memory. Then the CPU instructs the GPU. The calculation will be parallel executed in cores and the result will be copied back to the main memory.

https://www.academia.edu/20415057/Parallel_Implementation_of_Grayscale_Conversion_in_Graphics_Hardware

Thread

Each Thread gets mapped to one cuda core

Blocks:

Threads are grouped into Blocks

Grids:

Blocks are grouped into a Grid

Each Kernel launch creates one single Grid

https://developer-blogs.nvidia.com/wp-content/uploads/2020/06/kernel-execution-on-gpu-1-625x438.png

We have an input image which is loaded by stbi_load and returns an unsigned char. Then we allocate memory for our grayscale image which the results will be copied in

So now we need to allocate memory for our RGB image on the Device. We do this with CudaMalloc

cudaMalloc(&device_rgb, sizeof(uchar3) * pixel_size*3 );

we also allocate memory for our greyimage on the Device with

cudaMalloc(&device_grey, sizeof(unsigned char) * pixel_size);

with cudaMemcpy the Imagedata will be copied to the memory we allocated for our RBG image. We also have to pass the size of the copied data and in which direction we are going to copy.

cudaMemcpy(device_rgb, Image, sizeof(unsigned char) * pixel_size*3 , cudaMemcpyHostToDevice);

Now the data is in the GPU memory and we are able to launch our kernel.

We are on the Device now

A Kernelfunction looks like a normal function but has a global keyword befor it. With the global identifer we define a function that will run on the Device.

ConvertToGrey<<<Grid, Block>>>(device_rgb, device_grey, rows, columns);

As parameters we pass our already allocated device_rgb and device_grey references and the rows and columns which are basically the width and height of our image. If we take a look at the Kernel function the first thing we see is this

int index_x = threadIdx.x + blockIdx.x * blockDim.x;

ThreadIdx, blockIdx and blockDim are cuda variables. We can access them when we run on the Device. threadIdx : thread index in the block blockIdx : block index in the grid blockDim : number of threads by blocks.

We want the unique Grid index of a thread because threadIdx is only unique in its own Thread Block. So we multiply the Blocks index with the block dimension and add the Threadindex. We do the same for the y index. And now we have the current pixel location

1d coordinate of the greyscale image

int output_offset = index_y * columns + index_x;

now we write the result into the outputimage

 output[output_offset] = rgb.x * 0.299f +rgb.y* 0.587f +rgb.z * 0.114f

we are back on the Host now

The kernel call is asynchronous. But in our case this doesnt bother us because we only have one stream so cudaMemcpy waits until the GPU has finished. Now we copy the data back from the Device to the Host

cudaMemcpy(host_grey, device_grey, sizeof(unsigned char) * pixel_size, cudaMemcpyDeviceToHost);

in the end we need to free the allocated memory on the Device

cudaFree(device_rgb);
cudaFree(device_grey);

Review

Looking at the Performance its interessting for a 27000x6000 pixel image the calculation takes 0,007 seconds but here is the catch. With nvprof or nvvp we can see that the GPU is only 1,8% of the time busy with computing. The rest is allocation and memory transfer.

Allocation in seconds	memcopy HtoD in seconds	memcopy DtoH in seconds	Kernel in seconds
0.018321	0.054598	0.048913	0.00754453

The orange bars are all the called cuda functions like cudamalloc. The blue bar is the kernel activity

There is also a huge overhead from the cudaMemcpy call and the actual Memcpy operation.

Memory on the GPU

The GPU has different kinds of memory. The biggest one is the global memory. Its located on the device Dram. There are other types of memory(like Local memory, constant memory) located on the Dram but we dont need them now. For optimisation is the On Chip memory more interessting especially the Shared Memory. Its very fast so why we didnt used it to make the Kernel function even faster. For this specific task there is no performance gain from using shared memory instead of the global memory because the shared memory doesnt reuse any data so the number of global memory reads stays the same. Also the calculation time only takes arround 2% of the time so maybe its better to focus on the other 98%.

Memorytransfer between GPU and CPU

There is no free lunch In general avoid memory transfer between device and host. Its recommended to to copy the data to the device. Then calculate on the device and then copy the data back.

like in (greyscale.cu)

pinned memory

Generaly pinned memory is recommended if we want to overlap copy and compute https://developer-blogs.nvidia.com/wp-content/uploads/2012/12/pinned-1024x541.jpg

Instead of malloc() we could use cudaMallocHost(). This will allocate the data in the Pinned Memory. I didnt find a way to direclty allocate the Image with the stb_image.h functionality. But luckily CUDA offers cudaHostRegister which will pin memory that is already allocated. instead of allocating host_grey with malloc we use cudaMallocHost which will allocate in the Pinnend Memory.

So the Memory transfer between Host and Device should be faster.

well yes and no. The Memcpy HtoD is faster compared to the nonpinnend version(see greyscale.cu). (37 ms vs 55 ms) the cudaMemcpy call still takes (100 ms instead of 120 ms in the nonpinnend version). The Memcpy DtoH is also faster (12 ms v 37 ms) and for the cudaMemcpy call (19,8 ms vs 58 ms)

So we get a minimal transfer speed-up but the Host allocation takes now longer (27 ms)

Allocation in seconds	memcopy HtoD in seconds	memcopy DtoH in seconds	Kernel in seconds
0.292566	0.032401	0.0121894	0.0075523

final runtime with CUDA allocation, memory transfer and kernel execution time in seconds done on a GTX 1060 6GB

greyscale	greyscalePinnedMemory	greyscaleV2(using pinned memory too)
0.16000	0.16200	0.1100

Possible Solution and idea for future research

It would be interessting to see how a grayscale conversation performs on a M1 with a Unified Memory Architecture. It could be faster since the GPU and CPU using that common pool of memory. There should be no need to transfer the data. The GPU will let the CPU know when its done

CPU

Problem

naive solution is single threaded
processors can calculate 128/256 bit at once, but only part of it is used in a single iteration
data is in rgbrgbrgbrgb format, but rrrrggggbbbb is needed
memory is not aligned, meaning that in order to read 8-bytes we may need to actually read 16-bytes

Solution attempt

Multithreaded

The easiest place to optimize it, is utilizing all cores of a CPU and thus convert it to a multithreaded application. This is done with OpenMP by adding the pragma #pragma omp parallel for collapse(2) (see openmp_baseline.c)). omp parallel for parallelizes the for loop with collapse(2) collapsing both loops and thus parallelizing both. This gives a more than 6 times performance boost.

In the next step the memory access can optimized. Currently each thread calculates the grey value for a random pixel, depending on how it is scheduled by openMP. For this it needs to load 3 unsigned char, so 24 bytes from memory. But a CPU preloads more data into the cache anticipating that it will be needed. This behavior can be used to by having each thread operating on a continuous section, thus using the data that is already in the CPU Cache (see memory.c).

SIMD FMA

All references to intrinsic functions can be looked up here: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

The next place to optimize is utilizing the full register of the CPU by using Single Instruction Multiple Data (SIMD). For example we could add 16 8-bit integers at once in a 128 bit register instead of only a single one, thus theoretically creating a 16 time speedup. Additionally one can use a dedicated arithmetic logic unit that multiplies two numbers and add it to an accumulator, knows as MAC-unit (multiplier-accumulator). This takes the form of "fused multiply add" (FMA), which additionally only rounds at the end, thus combining two operations into one.

The data being in the form of rgbrgbrgbrgb appears for the first time. For FMA a whole register need to be filled with only red, green or blue values, meaning we want the data in rrrrggggbbbb format. This problem is ignored for now, by just setting the register with the appropriate values, which comes with its own performance problems, because with the data being spread out like this, multiple reads may be necessary.

r_vector = _mm_set_ps(img[(i * channels)], img[(i + 1) * channels], img[(i + 2) * channels], img[(i + 3) * channels]);
g_vector = _mm_set_ps(img[(i * channels) + 1], img[(i + 1) * channels + 1], img[(i + 2) * channels + 1], img[(i + 3) * channels + 1]);
b_vector = _mm_set_ps(img[(i * channels) + 2], img[(i + 1) * channels + 2], img[(i + 2) * channels + 2], img[(i + 3) * channels + 2]);

With the data in the correct format the multiplication is very simple

gray_vector = _mm_setzero_ps();
gray_vector = _mm_fmadd_ps(r_vector, r_factor, gray_vector);
gray_vector = _mm_fmadd_ps(g_vector, g_factor, gray_vector);
gray_vector = _mm_fmadd_ps(b_vector, b_factor, gray_vector);

Full code see memory_simd_fma.c

A problem with FMA is, that the basic FMA instruction set only supports working with 32-bit and 64-bit floating point numbers. This means that with a 128-bit register a maximum of 4 pixel can be calculated at once.

SIMD SSE

This implementation if completly copied from a Stackoverflow post by Rotem: https://stackoverflow.com/a/57844027/13516981 Only modification made was making it compatible with pure C, since it was using C++ features (see memory_simd_sse.c)

It has 2 major optimization areas.

First it utilizes shuffle (_mm_shuffle_epi8), concat (_mm_alignr_epi8) and shift (_mm_slli_si128) functions to solve the problem of rearranging the bytes from rgbrgbrgbrgb to rrrrggggbbbb. For example one can group the bytes according to their color like this.

const __m128i shuffle_mask = _mm_set_epi8(9, 6, 3, 0, 11, 8, 5, 2, 10, 7, 4, 1, 9, 6, 3, 0);

__m128i r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0 = _mm_shuffle_epi8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, shuffle_mask);

Or like this

// The 12 is the amount of bytes to shift the result
__m128i b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4 = _mm_alignr_epi8(b7_g7_r7_b6_g6_r6_b5_g5, r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, 12);

If the bytes are not at the start or end in order to concatenate them they are shifted like this

// 8 bytes to the left
__m128i g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 8);
// 4 bytes to the right
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 4);

Additionally it sacrifices some accuracy by calculating the gray value with 16-bit integers instead of 32-bit floats. But the most important for me is, that it showed me how to work with the bit modification functions in a structured manner by naming the variables according to the bytes it contains.

SIMD AVX

With this knowledge the next step is using the AVX instruction set, which operates on 256-bit registers unlike 128-bit in SSE. This in theory allows one to process twice the amount of pixels at once. Unfortunately many AVX functions behave slightly different compared to their SSE counterpart in the form of only operating within 128-bit lanes instead of across the whole 256-bit register. This means that it is not possible to shuffle a byte from the lower lane to the upper lane with _mm256_shuffle_epi8. So in a register with gA_rA_b9_g9_r9_b8_g8_r8_b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, where the lane split is between g5 and r5, it is not possible to group all red values because r6, r7, r8 and r9 are in the upper lane whereas the other values are in the lower lane.

Because of this for AVX a different set of functions is used in order to group the bytes of each color. The central function is _mm256_blendv_epi8(__m256i a, __m256i b, __m256i mask) which allows to combine parts of the first register with the second one according to the mask.

__m256i g4_g3_g2_g1_g0_b4_b3_b2_b1_b0_r5_r4_r3_r2_r1_r0_b9_b8_b7_b6_b5_rA_r9_r8_r7_r6_r5_r4_r3_r2_r1_r0 =
    _mm256_blendv_epi8(
        g4_g3_g2_g1_g0_b4_b3_b2_b1_b0_r5_r4_r3_r2_r1_r0_b9_b8_b7_b6_b5_rA_r9_r8_r7_r6_gA_g9_g8_g7_g6_g5,
        b9_b8_b7_b6_b5_rA_r9_r8_r7_r6_gA_g9_g8_g7_g6_g5_g4_g3_g2_g1_g0_b4_b3_b2_b1_b0_r5_r4_r3_r2_r1_r0,
        _mm256_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /**/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 128, 128, 128, 128, 128, 128));

In this example we want to combine the group of red bytes r5_r4_r3_r2_r1_r0 of the second register with the group of rA_r9_r8_r7_r6 in the first one. They are deliberately lined up, so that in the first register before the group of rA_r9_r8_r7_r6 is exactly enough space to fit in the first 6 bytes r5_r4_r3_r2_r1_r0. Accordingly the mask is set to use all values of the first register, except the first 6 ones.

This setup is done through _mm256_shuffle_epi8 with which we can still shuffle inside the lanes in order to group the bytes. Afterwards we can use _mm256_permute2x128_si256(__m256i a, __m256i b, 1) in order to swap the lanes so that we can blend them. In some places _mm256_alignr_epi8 is also used. Like in SSE it concatenates registers, although it only works on 128-bit lanes.

For the exact steps see memory_simd_avx.c where the variable names reflect the outcome of a function.

Another change is aligning the memory of the input and output image. This allows using the aligned load and store functions instead of the unaligned ones (_mm256_load_si256 instead of _mm256_loadu_si256). With this we may avoid loading data that is spread across boundaries of a address and thus reducing memory transfer. The output is aligned with by allocating the memory with aligned_alloc(32, size) instead of malloc(size). For the input image the library used to load the image stb_image.h, allows to override the default malloc function used. We do so be defining following values before the include of that library.

#define STBI_MALLOC(sz)           aligned_alloc(32, size)

Benchmarks

With

AMD Ryzen 5 3600 6-Core Processor
cygwin gcc 11
compiled gcc -fopenmp grayscale.c -lm -march=native -O3
20 runs each
27000x6000 pixel image https://photojournal.jpl.nasa.gov/catalog/?IDNumber=PIA03239

Baseline

time in s	megapixel per s
2.739997	56.3852

openmp baseline

thread number	time in s	megapixel per s
12	0.411790	375.1795
32	0.414154	373.0381
64	0.414555	372.6776
128	0.430195	359.1287

memory

thread number	time in s	megapixel per s
32	0.032053	4820.0608
64	0.031559	4895.3717
128	0.030711	5030.6157
256	0.032755	4716.6270

FMA

thread number	time in s	megapixel per s
12	0.035387	4365.8509
32	0.035855	4308.9137
64	0.035242	4383.8510
128	0.034176	4520.5836
256	0.035138	4396.8387

SSE

thread number	time in s	megapixel per s
12	0.030364	5088.0302
32	0.030036	5143.7032
64	0.030248	5107.6352
128	0.030838	5009.9306
256	0.032062	4818.7077

AVX

thread number	time in s	megapixel per s
12	0.030029	5144.9279
32	0.029775	5188.7483
64	0.030188	5117.7360
128	0.030685	5034.9111
256	0.032302	4782.7716

CPU comparison

CPU	algorithm	thread number	time in s	megapixel per s
AMD Ryzen 5 3600 (6 Core)	memory	128	0.030711	5030.6157
AMD Ryzen 5 3600 (6 Core)	simd_sse	32	0.030036	5143.7032
AMD Ryzen 5 3600 (6 Core)	simd_avx	32	0.029775	5188.7483
Intel Core i7-4710HQ (4 Core)	memory	128	0.080105	1928.6639
Intel Core i7-4710HQ (4 Core)	simd_sse	128	0.056655	2726.9456
Intel Core i7-4710HQ (4 Core)	simd_avx	128	0.055232	2797.2128
Intel Core i9-9880H (8 Core)	memory	128	0.038570	4005.5441
Intel Core i9-9880H (8 Core)	simd_sse	64	0.041884	3688.6061
Intel Core i9-9880H (8 Core)	simd_avx	128	0.027279	5663.5644

Review

A review of the AVX variant

Memory Bottleneck

The memory access should not be the bottleneck. There are many indicator for this.

First one being that the Intel Core i9-9880H performs better than the AMD Ryzen 5 3600 even though in has a memory bandwidth of only 39.74 GiB/s compared to 47.68 GiB/s.

Second one being that the alignment of the memory in the AVX step, did not change the performance in any noticeable way. One can test it out by replacing _mm256_load_si256 with _mm256_loadu_si256 and _mm256_store_si256 with _mm256_storeu_si256 and reverting the changes to align the memory.

Lastly doing a calculation of the theoretical amount of transferred data, we are below the available bandwidth.

pixel = 27000*6000 = 162000000
duration = 30ms = 0.03s
pixel_per_iteration = 32
bytes_per_pixel = 3
iterations = pixel / pixel_per_iteration = 5062500
bytes_read_per_iteration = 128 // it is not bytes_per_pixel*pixel_per_iteration=3*32=96, because we are effectively reading so many bytes in this sequence: 32 - 16 - 32 - 16. But because each read is 32-bytes, it results in 4*32=128 bytes read. Because the CPU address width is 8-byte there are no considerations here with 32-byte aligned memory.
bytes_read = bytes_read_per_iteration * iterations = 128 * 5062500  = 648000000
bytes_written = pixel * bytes_per_pixel = 162000000 * 3 = 486000000
bytes_transferred = bytes_read + bytes_written = 648000000 + 486000000 = 1134000000
transfer_rate = bytes_transferred / duration = 1134000000 (byte) / 0.03s = 37800000000 b/s = 35,20 Gb/s

Latency and Throughput

Each SIMD functions has a different latency and throughput. Latency means how many clock cycles it takes for the calculation to be complete and throughput means how much of a clock cycle the operations takes. Taking _mm256_load_si256 as example on the Haswell architecture. It has a latency of 1 and throughput of 0.25. This mean we can load 4 independent datasets in a single clock cycle. This is a source of optimization, that was not done here. It could also be the source of the different benchmark results for the different algorithms between the CPUs. Instead functions were selected purely on being able to arrange the bytes in a way that was needed. The linked intrinsic guide is for Intel, so one for AMD would need to be found first.

Number of Threads

This is something that can be further investigated. Currently they were just discovered through trial and error without putting much thought behind them.

Conclusion

Grayscale conversion is not suited to be computed with the GPU due to transfer 4 bytes of data for each pixel, which are 5 floating point operations in order to convert it to grayscale. The CPU on the other hand is very suited with the easiest performance improvement being to parallelize it and grouping the memory access. This can be easily done because each calculation is independent from another one. SIMD can be used to further improve it, but only in the very smallest margins at the cost of huge developer overhead.

XiangRongLin/grayscale-conversion

Motivation

GPU

Problems

Implementation

GPU Workflow

We are on the Device now

we are back on the Host now

Review

Memory on the GPU

Memorytransfer between GPU and CPU

pinned memory

final runtime with CUDA allocation, memory transfer and kernel execution time in seconds done on a GTX 1060 6GB

Possible Solution and idea for future research

CPU

Problem

Solution attempt

Multithreaded

SIMD FMA

SIMD SSE

SIMD AVX

Benchmarks

Baseline

openmp baseline

memory

FMA

SSE

AVX

CPU comparison

Review

Memory Bottleneck

Latency and Throughput

Number of Threads

Conclusion