/vector_addition_cuda

Addition between arrays using a CUDA kernel

Primary LanguageCudaMIT LicenseMIT

Vector Addition (CUDA)

In this tutorial, we will look at a simple vector addition program, which is often used as the "Hello, World!" of GPU computing. We will assume an understanding of basic CUDA concepts, such as kernel functions and thread blocks. If you are not already familiar with such concepts, there are links at the bottom of this page that will show you where to begin.

Let's walk through the following CUDA C vector addition program:

#include <stdio.h>

// Size of array
#define N 1048576

// Kernel
__global__ void add_vectors(double *a, double *b, double *c)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if(id < N) c[id] = a[id] + b[id];
}

// Main program
int main()
{
    // Number of bytes to allocate for N doubles
    size_t bytes = N*sizeof(double);

    // Allocate memory for arrays A, B, and C on host
    double *A = (double*)malloc(bytes);
    double *B = (double*)malloc(bytes);
    double *C = (double*)malloc(bytes);

    // Allocate memory for arrays d_A, d_B, and d_C on device
    double *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    // Fill host arrays A and B
    for(int i=0; i<N; i++)
    {
        A[i] = 1.0;
        B[i] = 2.0;
    }

    // Copy data from host arrays A and B to device arrays d_A and d_B
    cudaMemcpy(d_A, A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, bytes, cudaMemcpyHostToDevice);

    // Set execution configuration parameters
    //      thr_per_blk: number of CUDA threads per grid block
    //      blk_in_grid: number of blocks in grid
    int thr_per_blk = 256;
    int blk_in_grid = ceil( float(N) / thr_per_blk );

    // Launch kernel
    add_vectors<<< blk_in_grid, thr_per_blk >>>(d_A, d_B, d_C);

    // Copy data from device array d_C to host array C
    cudaMemcpy(C, d_C, bytes, cudaMemcpyDeviceToHost);

    // Verify results
    double tolerance = 1.0e-14;
    for(int i=0; i<N; i++)
    {
        if( fabs(C[i] - 3.0) > tolerance)
        {
            printf("\nError: value of C[%d] = %d instead of 3.0\n\n", i, C[i]);
            exit(1);
        }
    }

    // Free CPU memory
    free(A);
    free(B);
    free(C);

    // Free GPU memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    printf("\n---------------------------\n");
    printf("__SUCCESS__\n");
    printf("---------------------------\n");
    printf("N                 = %d\n", N);
    printf("Threads Per Block = %d\n", thr_per_blk);
    printf("Blocks In Grid    = %d\n", blk_in_grid);
    printf("---------------------------\n\n");

    return 0;
}

We'll ignore the add_vectors "kernel" function for the moment and jump down to the main function. Here, we first determine the number of bytes of memory required for an array with N double precision elements:

    // Number of bytes to allocate for N doubles
    size_t bytes = N*sizeof(double);

Next, we allocate memory on the CPU for arrays A, B and C:

    // Allocate memory for arrays A, B, and C on host
    double *A = (double*)malloc(bytes);
    double *B = (double*)malloc(bytes);
    double *C = (double*)malloc(bytes);

Now we allocate memory on the GPU for arrays d_A, d_B, and d_C (these variables are prefixed with "d_" to distinguish them as "device variables", but they can be named differently):

    // Allocate memory for arrays d_A, d_B, and d_C on device
    double *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

Here, we fill the arrays A and B on the CPU:

    // Fill host arrays A and B
    for(int i=0; i<N; i++)
    {
        A[i] = 1.0;
        B[i] = 2.0;
    }

Now we set the execution configuration parameters in preparation for launching the kernel function on the GPU. The ceil funciton is needed to ensure there are enough blocks in the grid to cover all N elements of the arrays.

    // Set execution configuration parameters
    //      thr_per_blk: number of CUDA threads per grid block
    //      blk_in_grid: number of blocks in grid
    int thr_per_blk = 256;
    int blk_in_grid = ceil( float(N) / thr_per_blk );

Now we launch the kernel function on the GPU:

    // Launch kernel
    add_vectors<<< blk_in_grid, thr_per_blk >>>(d_A, d_B, d_C);

This causes execution to jump up to the add_vectors kernel function (defined before main). The variable id is used to define a unique thread ID among all threads in the grid. The if statement ensures that we do not perform an element-wise addition on an out-of-bounds array element. In this program, blk_in_grid equals 4096, but if thr_per_blk did not divide evenly into N, the ceil function would increase blk_in_grid by 1. For example, if N had 1 extra element, blk_in_grid would be 4097, which would mean a total of 4097 * 256 = 1048832 threads. So without the if statement, element-wise additions would be calculated for elements that we have not allocated memory for.

// Kernel
__global__ void add_vectors(double *a, double *b, double *c)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if(id < N) c[id] = a[id] + b[id];
}

Now that the results of the vection addition have been stored in array d_C on the GPU, we can copy the results back to the CPU copy of C:

    // Copy data from device array d_C to host array C
    cudaMemcpy(C, d_C, bytes, cudaMemcpyDeviceToHost);

Now we check to make sure we received the correct results from the GPU:

    // Verify results
    double tolerance = 1.0e-14;
    for(int i=0; i<N; i++)
    {
        if( fabs(C[i] - 3.0) > tolerance)
        {
            printf("\nError: value of C[%d] = %d instead of 3.0\n\n", i, C[i]);
            exit(1);
        }
    }

And finally, we can free memory on the CPU:

    // Free CPU memory
    free(A);
    free(B);
    free(C);

And free the memory on the GPU:

    // Free GPU memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

Compiling and Running on Summit

To run this program on Summit, you must first load the CUDA module:

$ module load cuda

Now compile the program:

$ make

Now submit the job using the submit.lsf batch script (make sure to change PROJ123 to a project you are associated with:

$ bsub submit.lsf

You can check the status of your job with the jobstat command. Once your job has completed, you can view the results in the output file named add_vec.JOBID

$ cat add_vec.JOBID

---------------------------
__SUCCESS__
---------------------------
N                 = 1048576
Threads Per Block = 256
Blocks In Grid    = 4096
---------------------------

...

Helpful Links

CUDA C Programming Model: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

Problems?

If you see a problem with the code or have suggestions to improve it, feel free to open an issue.