This repository provides a CUDA prefix sum implementation for arbitrary even size array.
Include cuda_prefix_sum function in your code
void cuda_prefix_sum(int *input_array,int *output_array,int count)
The args of the function are
- int pointer for elements to be summed
- int pointer to store prefix sum result. This pointer needs to be properly allocated
- count is the size of the array.
Make sure to update the following constants based on your GPU
- MAX_THREADS_PER_BLOCK
- MAX_NUM_GPU_BLOCKS
cuda_prefix_sum function can take any arbitrary even size array. If you have odd array just add one dummy element to the end.
The function passes tests with arrays of size up to 10241024128. Bigger arrays were tested but started to get out of memory error. So as long as you have enough memory the function works fine.
To test the code, simply make the executable using the makefile. Then run ./main.o command. The code includes eight tests
One thing I wish to change in this code is the first loop
for(int i=0,block=0;i<count;i+=max_elements_per_block,block++)
I need to read more about GPU shared memory before doing so. The current function implementation is sufficient for my usage, so I didn't improve it.
If someone have a better fix for this issue, you are welcomed :)