NVIDIA/cccl

Initial investigation into using cuda.parallel.reduce_into in CuPy

Opened this issue · 2 comments

Initial investigation into using cuda.parallel.reduce_into in CuPy

I did a bit of hacking CuPy to use cuda.parallel to compute sum.

  • This was a relatively easy hack, thanks to CuPy's architecture: shwina/cupy@a694448
  • __cuda_array_interface__ made it trivial to pass CuPy arrays to cuda.parallel.
  • Caching JIT builds is essential to achieving performance on par with existing CuPy code paths: #3001.

Benchmarking

The plot below shows the time taken to do array.sum() for various input sizes, using three different configurations of CuPy:

  1. Baseline: this is just the current main CuPy branch, which uses custom, pre-built CUB code (which is problematic for the reasons outlined in #2958)
  2. CuPy reduction: this is a custom reduction kernel that CuPy falls back to when CUB is disabled (can be done by setting the environment variable CUPY_ACCELERATORS=""
  3. cuda.parallel: this is the version that uses cuda.parallel.experimental.reduce_into() to perform sum()
Benchmarking script
import cupy as  np
import timeit

# List of input sizes to benchmark
input_sizes = [10, 100, 1000, 10000, 100000, 1000000, 10_000_000, 100_000_000]

# Dictionary to store results (size, avg_time_with_first_run, avg_time_without_first_run)
results = {}

for size in input_sizes:
  times = []
  
  # Run the benchmark 10 times for each input size
  for _ in range(10):
      arr = np.random.rand(size)  # Create a random array of the given size
      
      # Record the start time using timeit.default_timer()
      start_time = timeit.default_timer()
      
      # Perform the sum operation
      result = arr.sum()
      print(result)
      
      # Record the end time and calculate the time taken
      end_time = timeit.default_timer()
      
      # Calculate the time taken for this run and store it
      time_taken = end_time - start_time
      times.append(time_taken)
  
  # Calculate the average time including the first run
  avg_time_with_first_run = sum(times) / len(times)
  
  # Calculate the average time excluding the first run
  avg_time_without_first_run = sum(times[1:]) / (len(times) - 1)
  
  # Store the results
  results[size] = (avg_time_with_first_run, avg_time_without_first_run)

# Print out the results
print("Benchmark Results (input size, average time with first run, average time without first run):")
for size, (avg_with_first, avg_without_first) in results.items():
  print(f"Input size: {size:10d} | Avg time with first run: {avg_with_first:.6f} seconds | Avg time without first run: {avg_without_first:.6f} seconds")

Image

Things to note:

  1. For the largest problem sizes, cuda.parallel perf matches CuPy's pre-built CUB code.
  2. For smaller problem sizes, cuda.parallel performs relatively poorly. There seems to be some fixed cost here that is independent of the problem size.
  3. The custom reduction kernel that CuPy uses when CUB is disabled does poorly for larger problem sizes. I believe this is the definition of the kernel: https://github.com/cupy/cupy/blob/70c4dd33775ba8b4dcd743db8298ebff95a1931c/cupy/_core/_reduction.pyx#L54

Next steps:

  1. Try and identify and eliminate whatever is making us slower for smaller problem sizes.
  2. Replace all functionality in cuda_cub.pyx with cuda.parallel algorithms. Study not only performance but also CuPy binary size changes.