CUDA_BY_EXAMPLE

  • The text book, Jason Sanders, Edward Kandrot, 'CUDA by Example: An Introduction to General-Purpose GPU Programming1st', 2011

  • The major reference: NVIDA CUDA Toolkit Documentation v11.7.0

  • The environment

    CUDA version: CUDA 11.7
    OS: Window 10 64-bits
    GPU: NVIDIA GeForce GTX 1060 6GB

Unfortunately, All comments or descriptions of source codes are written by korean to improve the efficiency of studying.

Contents

00_start from Jun 27, 2022
01_Host Code and Device Code
02_Allocating and Using Device Memory
03_Query of Device Properties
04_Use Device Properties_Find appropriate CUDA device
05_Parallel programming by CUDA: Vector Sum
06_Parallel programming by CUDA_Julia Set
07_Multi thread by CUDA_Vector Sum
08_Multi thread by CUDA_Ripple
09_Shared Memory_Vector dot product
10_Constant Memory and Time recording



00_start from June 27 2022

Install CUDA and construct the development environment.

  • keywords: host code, device code, __global__

1. What is host code and device code?

  • Host code is the code execute on CPU;
  • Device code is the code execute on GPU for parallel prongramming;

2. What is __global__ keword?

__global__ void kernel(void){}
int main(void) {
	kernel<<<1, 1>>>();
	return 0;
}
  • The functions that declared __global__ is delivered to the compiler that deal with 'device code', such as NVCC(NVDIA CUDA Compiler)
  • The parameters in "<<< >>>" are not passed into the function. The parameters are passed to CUDA Runtime, and affect how to launch the device codes by CUDA Runtime. (The more detail is in '05_Parallel programming by CUDA: Vector Sum')
  • We can make the parameters in "()" passed into the function, like we have done in C/C++.



  • keywords: cudaError, malloc, cudaMalloc, free, cudaFree, memcpy, cudaMemcpy

1. enum cudaError

  • The flag that means CUDA error types.

    0: cudaSuccess: The API call returned with no errors.
    1: cudaErrorInvalidValue: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
    2: cudaErrorMemoryAllocation: The API call failed because it was unable to allocate enough memory to perform the requested operation.

    etc...


2. __host____device__ cudaError_t cudaMalloc((void** devPtr, size_t size)

int* dev_c;
cudaMalloc((void**)&dev_c, sizeof(int));
  • cudaMalloc(): CUDA Runtime allocates the device memory, whose size is size(bytes), to dev_c.
  • Host codes never dereferece the pointer that points the device memory to read or write. Moving the position of pointer, operating using the pointer and transformating the type of pointer are possible. But, device codes can do.
  • The pointer allocated by cudaMalloc() can be passed into the function that executes in whether device or host.

3. __host__ cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind)

int c;
int* dev_c;
cudaMalloc((void**)&dev_c, sizeof(int));
cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
  • cudaMemcpy(): copies count bytes from the memory area pointed to by src to the memory area pointed to by dst.
  • cudaMemcpyKind is the enum that means direction of copy.

    0: cudaMemcpyHostToHost // It is equal to call memcpy() of C in host.
    1: cudaMemcpyHostToDevice
    2: cudaMemcpyDeviceToHost
    3: cudaMecmcpyDeviceToDevice
    4: cudaMemcpyDefault: Direction of the transfer is inferred from the pointer values. Requires unified virtual addressing

  • Host codes can use the variable of device code, dev_c by using cudaMemcpy().

4. __host____device__ cudaError_t cudaFree(void* devPtr)

cudaFree(dev_c);
  • The device memory allocated by cudaMalloc() can be deallocated by cudaFree(), not free() of C.



  • keywords: cudaDeviceProp, cudaGetDeviceCount, cudaGetDeviceProperties

1. cudaDeviceProp


2. __host____device__ cudaError_t cudaGetDeviceCount(int* count)

cudaDeviceProp prop;
int count;
cudaGetDeviceCount(&count)
  • cudaGetDeviceCount() returns the number of devices with compute capability greater or equal to 2.0 into *count.

3. __host__ cudaError_t cudaGetDeviceProperties(cudaDeviceProp* prop, int device)

cudaDeviceProp prop;
int count;
cudaGetDeviceCount(&count)
for (int i = 0; i < count; i++)
{
	cudaGetDeviceProperties(&prop, i);
	// query //
}
  • cudaGetDeviceProperties() returns the pointer of cudaDeviceProp of device whose number is device into *prop.
  • We can access sequencially all information of each device in loop using the count by cudaGetDeviceCount().



  • keywords: cudaGetDevice, cudaSetDevice, cudaChooseDevice

1. __host____device__ cudaError_t cudaGetDevice(int* device), __host__ cudaError_t cudaSetDevice(int dev)

int dev;
cudaGetDevice(&dev);
cudaSetDevice(dev);
  • cudaGetDevice() returns the device on which the active host thread executes the device code into *device.
  • cudaSetDevice() set device to be used for GPU executions.

2. __host__ cudaError_t cudaChooseDevice(int* device, const cudaDeviceProp* prop)

cudaDeviceProp prop;
memset(&prop, 0, sizeof(cudaDeviceProp));
prop.major = 1;
prop.minor = 3;
cudaChooseDevice(&dev, &prop);
  • cudaChooseDevice() is the function that returns the closest compute-device to values of *prop into *device by CUDA Runtime.
  • We can find the appropriate device to our needs without access all devices by loop.



  • keywords: device function, grid, block, thread, parallel programming, vector sum

// Single-Core Vector Sum
void add(int* a, int* b, int* c) {
	int tid = 0;	// 0th CPU
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid++;		// only one CPU -> only one increasing
	}
}

  The code above is 'vector sum' function running in only CPU. And if the CPU has dual core, we can use two core to sum vectors using like the code below.

// the part of code, Dual-Core Vector Sum
void add(int* a, int* b, int* c) {
	int tid = 0;  // 0th CPU // other core initialize 'tid' to 1
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += 2;		// only one CPU -> only one increasing
	}
}

   But you know that the code above is not enough. You have to design the function add run with kernel. And the function must run in multi-process or multi-thread. Moreover, you have to control the race condtion because scheduling is a non-deterministic part to programmers, and you also have to control the deadlock because the shared memory is three arrays a, b and c. (Actually, there is no shared memory, since processors access memories.)

1. Device function

  The code below is the function 'add' written as a kernel functon.

// Kernel function of Vector Sum using 1 grid that has N blocks
__global__ void add(int* a, int* b, int* c) {
	int tid = blockIdx.x;	// what index this block has
	
	// compute the data of this index, if the block is allocated.
	if(tid < N) c[tid] = a[tid] + b[tid];
}

int main(void) {
	///...///
	add <<<N, 1 >>> (dev_a, dev_b, dev_c);
	///...///
}

__global___kernel<<<the number of blocks, the number of threads per block>>>()   The first parameter in three angle brackets is how many blocks will be used. Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional unique index accessible within the kernel through the built-in blockIdx variable. So, in the kernel function add above, we use blockIdx.x. And The second parameter is how much threads are in one block.
  All blocks run in prarllel. So, we can know that there are N add functions are running in each blocks.

Fig 1.Grid of Thread Blocks, CUDA Toolkit

image


2. Point of Caution

  1. You can see the if clause in the kernel function add above. It checks the block called is allocated to run the function. Without that, we may access the memory not allocated - wrong memory.
  2. You have to consider the attributes about grids, blocks, thread and memory of cudaDevicProp. You never make the dimesion of block more than cudaDeviceProp.maxGridSize.

3. The whole process

  It is not over that we change the host function to the device function. In some case, we have to copy and paste between host and devices.
  If you construct and fill the arrays a and b in CPU, then

  1. Allocate device memory for a, b and c.
  2. Copy the data of a and b from host to device.
  3. Compute the vector sum by calling device function add.
  4. Copy the result data, c, from device to host.



  • keywords: device function, grid, block, thread, parallel programming, Julia Set

Fig 2. The output of the Julia Set GPU application example

Julia Set Example

The output of Julia Set Example

   The difference between Julia Set CPU application and Julia Set GPU application is same with the previous example, Vector Sum. So, we are just talking about passing the number of block by grid.

__global__ void kernel(unsigned char* ptr) {
	int x = blockIdx.x;
	int y = blockIdx.y;
	int offset = x + y * gridDim.x;
	
	int juliaValue = julia(x, y);
	ptr[offset * 4 + 0] = 255 * juliaValue;
	ptr[offset * 4 + 1] = 0;
	ptr[offset * 4 + 2] = 0;
	ptr[offset * 4 + 3] = 255;
}

#define DIM 1000

int main(void) {
	///...///
	dim3 grid(DIM, DIM);
	kernel <<< grid, 1 >>> (dev_bitmap);
	///...///
}

  You can see that the type of first parameter of kernel device function is dim3. dim3 is an integer vector type based on uint3 that is used to sepcify dimensions. This type is maximum three-dimension. Although the three-dimension grid is not supported, CUDA Runtime expect passing this type. If you don't pass three parameter, the dimension that get no parameter is initialized to 1.
  In the code above, kernel device function runs in two-dimensional grid by passing the dim3 type variable initialized by two parameters. So, each block means each position of an image, and you can see that the kernel device function of the each block runs one time for each block position = image position.
  • girdDim: dim3 type; contains the dimensions of the grid.



  • keywords: multi-thread, multi-core, device function, grid, block, thread, parallel programming, vector sum

1. Limit: The number of threads per block

  In the previous chapter, we use multi-block for parallel programming(05_Parallel programming by CUDA: Vector Sum). However, we can use multi-thread for same thing. The pros and cons will be discussed later.

// Case 1: One block & N threads
__global__ void add(int* a, int* b, int* c)
{
	int tid = threadIdx.x;
	if (tid < N)
		c[tid] = a[tid] + b[tid];
}
///...///
add << <1, N >> > (dev_a, dev_b, dev_c);

  We know that the second parameter is how much threads are in one block. The code above runs in one block and N threads. But, there is a limit of the number of threads per block and we can find that using a query about cudaDeviceProp.maxThreadsPerBlock. If the number of threads per block we use is greater than the limit, the code can't run.

// Case 2: Multi blocks & Multi threads - Consider the limit of the number of threads
__global__ void add(int* a, int* b, int* c)
{
	int tid = blockIdx.x * blockDim.x + threadIdx.x;
	if (tid < N)
		c[tid] = a[tid] + b[tid];
}
///...///
add << <(N+127)/128, 128 >> > (dev_a, dev_b, dev_c);

  The code above runs in (N+127)/128 blocks and 128 threads per block. Actually, the limit of the number of threads is smaller than the limit of size of grid. So, we can fix the number of thread per block and use more blocks. To use appropriate number of blocks, we have to round up the result of division. Else, the total number of threads is less than the number we need.

Example: we need 130 threads and use 128 threads per block.
> (1) N / 128 = 1 => we use one block. So, total number of threads we can use is 128.
> (2) (N+127)/128 => we use two blocks. So, total number of threads we can use is 256.

  In this way, we need to compute an index of each thread more complicatedly. blockDim is a constant variable. blockDim stores the number of used threads of each block. The supported maximum dimension of grid is two, but The maximum dimension of block is three. In this example, since we give the number of threads per block as one dimension, we can consider the relation between blocks and threads as a matrix. Consider indices of blocks as indices of row of a matrix and consider indices of threads as indices of columns of a matrix.

2. Limit: The number of threads per block & The grid size

   But we already know there is the limit of the grid size. If N/128 is greater than the dimension of grid, the code above doesn't work.

__global__ void add(int* a, int* b, int* c)
{
	int tid = blockIdx.x * blockDim.x + threadIdx.x;
	while (tid < N)
	{
		c[tid] = a[tid] + b[tid];
		tid += blockDim.x * gridDim.x;
	}
}
///...///
add << <128, 128 >> > (dev_a, dev_b, dev_c);

  We can use the threads of GPU as cores, like the code above. In this case, we use only 128 blocks and 128 threads per block. This constant numbers can be changed if the changed numbers do not go over the limit already discussed. So, what we have to consider is only whether the space needed for arrays storing a vector is less than the constant memory of a device(GPU), cudaDeviceProp.totalConstMem.



  • keywords: multi-thread, grid, block, thread, parallel programming, ripple animation

Fig 3. The output of the ripple animation example

ripple

   In this chapter, we use the given library, cpu_anim, which processes the operations for animations. The main process is simple and equals what we already studyed. One core practices of this chapter is computing an image by passessing the number of blocks and the number of threads per block as two dimensional parameter whose type is dim3, and another one is what we need to consider when we make an animation.

struct DataBlock{
	unsigned char *dev_bitmap;
	CPUAnimBitmap *bitmap;
}
///...///
int main(void){
	DataBlock data;
	CPUAnimBitmap bitmap(DIM, DIM, &data);
	data.bitmap = &bitmap;
	cudaMalloc((void**)&data.dev_bitmap, bitmap.image_size()));
	bitmap.anim_and_exit((void(*)(void*, int)) generate_frame, (void(*)(void*)) cleanup);
}

   DIM is the width(height), by pixel, of an image. First, allocate the linear (unsigned char array) device memory as many as the two dimensional image size. bitmap.anim_and_exit is just the function calling generate_frame once per frame by the class of the given library. (cleanup is just the function to deallocate the device memory.)

1. Two dimensional blocks and threads

void generate_frame(DataBlock *d, int ticks) {
	dim3 blocks(DIM / 16, DIM / 16);
	dim3 threads(16, 16);
	kernel<<<blocks, threads>>> (d->dev_bitmap, ticks);
	cudaMemcpy(d->bitmap->get_ptr(), d->dev_bitmap, d->bitmap->image_size(), cudaMemcpyDeviceToHOst);

  The function generate_frame is called once per frame. kernel is just the function compute the color of each pixel. Focus on two variables, blocks and threads. We pass 16 by 16 threads per block and DIM/16 by DIM/16 blocks to the function kernel. It means total number of threads is DIM * DIM. It is easy to think of the hierarchical structure of this memory. Each thread will compute the information of each pixel.

2. Compute the position of a thread

__global__ void kernel (unsigned char *ptr, int ticks) {
	// compute the position of pixel computed, using threadIdx and blockIdx.
	int x = threadIdx.x + blockIdx.x * blockDim.x;
	int y = threadIdx.y + blockIdx.y * blockDim.y;
	int offset = x + y * blockDim.x * gridDim.x;
	///Compute the color of the pixel.///
}

  The part of the function kernel for computing a color of a pixel is left out, since the part is just for making the ripple. Focus on three variables, x, y and offset. x and y are the real position of the thread which runs kernel in a matrix about whole threads. And offset is the linear offset of the position. Since we allocated the device memory for an image as linear array, we need to compute the linear offset.

Q. What is the parameter ticks?

A. ticks is time. To compute the exact color of each pixel, the device needs the information of real time of animation,



  • keywords: multi-thread, shared memory, race condition, synchronization, grid, block, thread, parallel programming, reduction, vector dot product, __syncthreads

1. Shared memory

__global__ void dot(float *a, float *b, float *c) {
	__shared__ float cache[threadsPerBlock];
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	int cacheIdx = threadIdx.x;
	// compute subsum of dot product
	float temp = 0;
	while(tid < N) {
		temp += a[tid] + b[tid];
		tid += blockDim.x * gridDim.x;
	}
	cache[cacheIdx] = temp;
	__syncthreads();

	/// reduction ///
}

   In this example, we compute dot product(inner product) of vectors of a long length. How to construct grid, blocks and threads equals what we study at the previous chapter. So, the core part of this chapter is shared memory. CUDA C compiler manages the variables declared with __shared__ specially. The __shared__ variables is the memory shared among theards in the same block. The latency to this memories is more shorter than others, since the memories reside physically in GPU, not DRAM.
  In this example, each thread in a same block uses one element of the array cache, shared memory. Threads compute the results of performing the dot product for each element of the vector assigned to them. And they sum all results, and store the partial scalar sum into cache.
  After then, the function dot will sum all partial scalar sum to compute the result of dot product. (reduction phase) But we have to assure that all partial scalar sum is already compute. For that, we use the function __syncthreads. This function make the threads, which already complete all computation, wait until all threads of their block finish the computation. To be exact, The codes after __syncthreads only run after all threads of a same block run all code before __syncthreads.

2. Reduction

  Reduction is popular computing method in parallel programming. In this phase, settle all result of a lot of computation ended in parallel. In this example, if we sum all partial scalar sum by naive approach, we need the time prportional to the number of threads. But reduction phase also use parallel programming.

__global__ void dot(float *a, float *b, float *c) {
	/// compute all parts of process in parallel ///
	int i = blockDim.x;
	while(i != 0) {
		if(cacheIdx < i)
			cache[cacheIdx] += cache[cacheIdx + i];
		__syncthreads();
		i /= 2;
	}
	if(cacheIdx == 0)
		c[blockIdx.x] = cache[0];
}

  The code above runs in log(the number of threads per block) time. Half threads of a block sum two partial scalar sum in the shared arrya cache, And this process iterate until only one value - the sum of partial scalar sums in a block. (The number of threads per block has to be the power of two.) Between stages, the threads using array cache have to be synchronized. After end of this phase, store the result into a global variable with host.

3. CAUTION: Wrong optimization - deadlock

  You may think __syncthreads is called only when summing elements of cache, and may revise the code above like one below.

int i = blockDim.x;
while(i != 0) {
	if(cacheIdx < i) {
		cache[cacheIdx] += cache[cacheIdx + i];
		__syncthreads();
	}
	i /= 2;

  But this code causes deadlock. Since all threads in a same block run __syncthreads to run the codee after __syncthreads, and the threads, whose index is out of range for reduction but used when computing each element of vector, never run the function, the program will be stop.



  • keywords: constant memory, cudaEvent, time recording, multi-thread, grid, block, thread, parallel programming, ray tracing

Fig 4. The output of the ray tracing example ray_tracing_example

1. Constant memory

__constant__ Sphere s[SPHERES];

  We can declare constant memory using keyword __constant__. Hardwares of NVIDIA grant 64kb constant memory. Constant memory is only read. If we use constant memory, memory bandwidth can be more less. There is two reaseon.

1. If threads call a constant memory, the number of calls can be reduced to 1/16.
  When using constant memory, Hardewares of NVIDIA will broadcast the call to all threads of the half-warp of the threads which really call the memory. Since a half-warp is consist of 16 threads, other threads which need to call the memory will not call the same constant memory again. It makes run time more less. However, since accessing of constant memory is executed sequentially. run time can more greater if threads of the same half-warp calls different memories. The process that may be performed in parallel is serialized.

2. Constant memory is taken from a cache of GPU.
  Since constant memory is never revised, GPU will caching constant memory enthusiastically. So, the constant memory will be hit more. It reduces the number of using memory bus.

2. CUDA Event

  When we decide what program, code or algorithm is more better than others, we use some measurement, such as time. There are some API about time event in CUDA C. We can use libraries of C or timer of OS, but it is not sure that measured time is precise. There are many variables such as scheduling of OS, effectiveness of CPU timer. The most important reason of this impreciseness is that host may compute time without synchronization with GPU kerne. So, we will use API of events of CUDA. Events of CUDA is the time stamp of GPU, and it records time when programmer specify.

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

///...run...///

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

  Generally, a recording time process is 'create cudaEvent -> record -> destroy cudaEvent'. It is similiar with allocating and using memories. cudaEvent_t is similar with a marker.

__host__ cudaError_t cudaEventCreate ( cudaEvent_t* event ): create an event obect.
__host____device__ cudaError_t cudaEventRecord ( cudaEvent_t event, cudaStream_t stream = 0 ): record an event.
__host__ cudaError_t cudaEventSynchronize ( cudaEvent_t event ): waits for an event to complete.
__host__ cudaError_t cudaEventElapsedTime ( float *ms, cudaEvent_t start, cudaEvent_t end ): computes the elapsed time between events and store the time(ms) into the first parameter.