Initial investigation into using cuda.parallel.reduce_into in CuPy
Opened this issue · 2 comments
jrhemstad commented
Initial investigation into using cuda.parallel.reduce_into in CuPy
shwina commented
I did a bit of hacking CuPy to use cuda.parallel
to compute sum
.
- This was a relatively easy hack, thanks to CuPy's architecture: shwina/cupy@a694448
__cuda_array_interface__
made it trivial to pass CuPy arrays tocuda.parallel
.- Caching JIT builds is essential to achieving performance on par with existing CuPy code paths: #3001.
Benchmarking
The plot below shows the time taken to do array.sum()
for various input sizes, using three different configurations of CuPy:
- Baseline: this is just the current
main
CuPy branch, which uses custom, pre-built CUB code (which is problematic for the reasons outlined in #2958) - CuPy reduction: this is a custom reduction kernel that CuPy falls back to when CUB is disabled (can be done by setting the environment variable
CUPY_ACCELERATORS=""
cuda.parallel
: this is the version that usescuda.parallel.experimental.reduce_into()
to performsum()
Benchmarking script
import cupy as np
import timeit
# List of input sizes to benchmark
input_sizes = [10, 100, 1000, 10000, 100000, 1000000, 10_000_000, 100_000_000]
# Dictionary to store results (size, avg_time_with_first_run, avg_time_without_first_run)
results = {}
for size in input_sizes:
times = []
# Run the benchmark 10 times for each input size
for _ in range(10):
arr = np.random.rand(size) # Create a random array of the given size
# Record the start time using timeit.default_timer()
start_time = timeit.default_timer()
# Perform the sum operation
result = arr.sum()
print(result)
# Record the end time and calculate the time taken
end_time = timeit.default_timer()
# Calculate the time taken for this run and store it
time_taken = end_time - start_time
times.append(time_taken)
# Calculate the average time including the first run
avg_time_with_first_run = sum(times) / len(times)
# Calculate the average time excluding the first run
avg_time_without_first_run = sum(times[1:]) / (len(times) - 1)
# Store the results
results[size] = (avg_time_with_first_run, avg_time_without_first_run)
# Print out the results
print("Benchmark Results (input size, average time with first run, average time without first run):")
for size, (avg_with_first, avg_without_first) in results.items():
print(f"Input size: {size:10d} | Avg time with first run: {avg_with_first:.6f} seconds | Avg time without first run: {avg_without_first:.6f} seconds")
Things to note:
- For the largest problem sizes,
cuda.parallel
perf matches CuPy's pre-built CUB code. - For smaller problem sizes,
cuda.parallel
performs relatively poorly. There seems to be some fixed cost here that is independent of the problem size. - The custom reduction kernel that CuPy uses when CUB is disabled does poorly for larger problem sizes. I believe this is the definition of the kernel: https://github.com/cupy/cupy/blob/70c4dd33775ba8b4dcd743db8298ebff95a1931c/cupy/_core/_reduction.pyx#L54
shwina commented
Next steps:
- Try and identify and eliminate whatever is making us slower for smaller problem sizes.
- Replace all functionality in
cuda_cub.pyx
withcuda.parallel
algorithms. Study not only performance but also CuPy binary size changes.