Due Date: 11/13 at midnight
Implement the following approaches and compare their performance:
- CPU Reduction
- take various operators (sum, product, min, max)
- GPU Reduction
- shared memory
- multi-kernel, handle arrays > 2x shared mem size (> 2048)
- GPU Reduction w/ less Thread Divergence
- CPU Histogram
- GPU Histogram - non-strided
- GPU Histogram - strided
For both algorithms you should collect timing data and analyze the results for increasing array sizes. That is, at what point does each optimization pay off? You only need to compare within an algorithms (1 v 2 v 3) and not across.
- What went well with this assignment?
- What was difficult?
- How would you approach differently?
- Anything else you want me to know?
[ ] Code (1-6) [ ] Performance (table, graph) [ ] Reflection [ ] Peer feedback
If you encounter errors, I recommend two approahces:
-
Adding print statements both inside your kernel function as well as outside. This can include:
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
to catch any cuda errors. -
You can check to make sure that you code is not accessing unallocated memory by utilizing NVIDIA's memory sanitizer tool. You can run it ok
keroppi
using the following line.compute-sanitizer --tool memcheck ./your_cuda_executable_not_source
To update this assignment as changes are made, a new PR will be generated. You can find the tab here. On that page you can merge the pull request to get the update instructions. This may invovle rebasing or merging your contributions, reach out if you need help with this.
After the assignment is due, you will be randomly assigned to review another student's submission. This will be as a pull request (PR). One is already opened by default (PR #1) so you can leave your comments there. You should look at the results, explanations, code, etc for understandability, readability, style, etc and provide any constructive or positive insights. This peer feedback is due one week after the assignment closes.
- Calculate occupancy (see old calculators)
- 2 v 3
- 5 v 6
- Implement histogram on GPU without atomics; is it faster?