HigherOrderCO/Bend

CUDA vs Bend performance comparison

Opened this issue · 6 comments

Hello, thanks for the great work.

Speed-up between CPU and GPU is obviously great but what about speed difference between CUDA and Bend for the same algorithm? A comparison using a tensor operation like a neural network library can be great.

You can use a some snippet from the examples (from the README.md in case) the official repo https://github.com/HigherOrderCO/Bend?tab=readme-ov-file#getting-started

def Sum(start, target):
  if start == target:
    return start
  else:
    return start + Sum(start + 1, target)  

def main():
  return Sum(1, 1_000_000)

Run the code above with bend run-cu file.py

And then you can compare that to a similar example in python using numba (or pycuda and JAX works as well).
I am not sure however if python cuda is fair comparison, but you it's uh... something? I hope @VictorTaelin does not punish me on this because I have no idea how many threads per block and blocks per grid bend uses. So adjust whatever below to make it a fair comparison

import numpy as np
from numba import cuda
import time

@cuda.jit
def parallel_sum(arr, result):
    idx = cuda.grid(1)
    if idx < arr.size:
        cuda.atomic.add(result, 0, arr[idx])

def main():
    N = 1_000_000
    arr = np.arange(1, N + 1, dtype=np.int32)
    result = np.zeros(1, dtype=np.int32)

    threads_per_block = 256
    blocks_per_grid = (arr.size + (threads_per_block - 1)) // threads_per_block

    start_time = time.time()
    parallel_sum[blocks_per_grid, threads_per_block](arr, result)
    cuda.synchronize()
    end_time = time.time()

    print(f"Sum result: {result[0]}")
    print(f"Time taken: {end_time - start_time} seconds")

if __name__ == "__main__":
    main()

Did you run this? What are the results?

I don't have an Nvidia GPU to test, but you can expect Bend to be way way worse than CUDA for trivial arithmetic problems (including all of linear algebra, tensor operations, etc).

We purposefully avoid making benchmarks relative to CUDA because that's not the point of Bend.

I don't have an Nvidia GPU to test, but you can expect Bend to be way way worse than CUDA for trivial arithmetic problems (including all of linear algebra, tensor operations, etc).

We purposefully avoid making benchmarks relative to CUDA because that's not the point of Bend.

What is the point of bend? (No offense)
A lot of attraction from what I've seen is the fact that you can seamlessly run with CUDA instead of having to use huge complicated CUDA libraries. The rest of bend seems to be so out of reach (advanced/abstract concepts like affine). Maybe it could help me find some inspiration if I find out the real point

@deadsoul44

Did you run this? What are the results?

For this specific test user@DESKTOP-C7548H1:~$ bend run-c x.bend -s
Result: 5908768

  • ITRS: 45999971
  • TIME: 0.93s
  • MIPS: 49.37

user@DESKTOP-C7548H1:~$ bend run-cu x.bend -s
Result: 5908768

  • ITRS: 45999971
  • TIME: 0.26s
  • MIPS: 177.96

What is the point of bend? (No offense) A lot of attraction from what I've seen is the fact that you can seamlessly run with CUDA instead of having to use huge complicated CUDA libraries. The rest of bend seems to be so out of reach (advanced/abstract concepts like affine). Maybe it could help me find some inspiration if I find out the real point

To run general programming in a massively parallel way by default. The advanced concepts (that are not really that complicated or advanced) that make it possible are not that important to users.
CUDA is really good for writing a very specific set of programs, but for everything else it becomes so complicated to write an efficient program that it must either be done by incredible specialists or it's just not done at all.

For those programs that CUDA is really good (like tensor operations), Bend is not at all a competitor. GPUs and CUDA are designed from the ground up to do those very specific things incredibly efficiently.

I have RTX 3070, CUDA is present but still bend run-cu doesn't work