gfx-rs/metal-rs

computation performs slower than cpu version in benchmark

viirya opened this issue · 3 comments

Hi, I'm running some benchmarks between the computation code from metal-rs and a cpu version.

I basically benchmark the compute example which does sum operation and a cpu version which simply loops input slice while summing it up.

I factor input data size as 1024 * factor. For all cases, metal-rs compute always performs worse than the cpu version. E.g.,

sum (metal), factor: 90 time:   [465.91 µs 479.72 µs 495.96 µs]                                      
                        change: [-2.5272% +1.6099% +6.1871%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

sum (cpu), factor: 90   time:   [32.132 µs 32.140 µs 32.148 µs]    

I'm wondering if the benchmark result is expected? Because I suppose metal version should speed up the operation and should be faster.

Do you have any idea or suggestion?

You might want to look at benchmarking only the actual compute operation (not device initialization, copying data into buffers, etc.). You generally want to reuse as many GPU resources as possible so this might be impacting your benchmarks if they're not omitted.

Even then depending on the size, it still might not beat the CPU version. It really depends on the exact kinds of computations you're doing. For sum operations specifically you might look into how to perform prefix sums on the GPU, then compare against prefix sums on the CPU (e.g., using SIMD).

As the GPU has unified memory model, I suppose we are not counting in the cost of copying data into buffers.

I tried to revamp the benchmark by reusing initialized device for all runs. Good thing is that there is some improvements about 30% on GPU runs. But it is still slower than CPU at significant scale.

sum (metal), factor: 90 time:   [295.16 µs 301.82 µs 308.61 µs]                                                    
                        change: [-37.171% -34.824% -32.562%] (p = 0.00 < 0.05)                                     
                        Performance has improved.                                                                  
                                                                                                                   
sum (cpu), factor: 90   time:   [28.919 µs 28.923 µs 28.927 µs]                                                    
                        change: [-0.2977% -0.0365% +0.1904%] (p = 0.80 > 0.05)                                     
                        No change in performance detected.                                                         

So I guess that you're right that the point is the sum computation. I'm looking at prefix sum algorithm on GPU and see if it can improve the performance more.

You still will have to copy into buffer, since the data needs to be properly aligned.