The timers used for GPU libraries are inaccurate
leofang opened this issue ยท 13 comments
Since we're using the CPU timer perf_counter()
as the proxy (technically we should use CUDA events, but it's OK), we need to do device-wide sync before and after the sandwiched code section; that is, the synchronization should also be inserted before line 29 that calls welch
:
scipy-2023-presentation/benchmarks/scipy_bench.py
Lines 28 to 33 in 26db2c0
Same applies to the other file: A sync should be inserted before line 43 that calls fit()
:
scipy-2023-presentation/benchmarks/scikit_learn_bench.py
Lines 41 to 45 in 26db2c0
in case the GPU is still busy (initializing input data) while the CPU already reaches to line 43.
Another note is these benchmarks do not do any warm-up for stabilizing the GPU state and preparing the array libraries, and it's problematic as the first call is always the slowest (initializing the CUDA context, loading the CUDA libraries, JIT'ing the kernels, if any, etc).
Generally, we want something like this
https://github.com/NVIDIA/cuQuantum/blob/2cc68ede865d1f2028f28c9ad411f85748e20b58/benchmarks/cuquantum_benchmarks/_utils.py#L453-L498
in which the function is run in a warm-up loop, and then we start the benchmark loop, measure the times, and report average (+ confirm the std is low).
Regarding a warmup, yes, I noticed that the scikit-learn fit
benchmark is considerably faster if you run it a second time in the same process (pytorch cuda goes from 2.4 seconds down to 0.3 seconds, cupy only goes from 1.1 seconds down to .97 seconds). For predict
re-running in the same process doesn't affect the output, and I haven't tested welch
yet.
The reason I am running the benchmarks in a separate process is because Tyler had mentioned this as an issue in his benchmarks https://github.com/tylerjereddy/bench_welch_array_api/blob/main/bench.py#L35-L38. I also did it because I felt this was more accurate to a real world application to run a "cold start".
I'd love to hear your thoughts on whether this makes sense, and what is standard to do?
Tyler's comment is a bit confusing, I think if the script focuses on one library at a time, it's OK to do a warm-up loop in the same process. There's no context switch, no destruction of the mempools, nothing. But we still want to launch one process per library, yes.
Cold-start data are frowned upon by CUDA ninjas, unfortunately ๐ From CUDA initialization to launching the first kernel, there are many unknowns (and some being or have been addressed by the CUDA team, such as lazy loading) irrelevant to the quality of a workload implementation and we should exclude them to reflect the status. One can argue that "hey but CUDA takes 10 seconds to initialize on my laptop," and no doubt it's the real world experience, but it's really irrelevant to our focus here (on making an array API compatible implementation run on GPUs). TBH even for CPU-only workloads, cold start is unfair as CPU frequency could dynamically adjust too.
So in my testing, I am already observing this just from running the timings in separate processes. For example, at https://github.com/data-apis/scipy-2023-presentation/blob/main/benchmarks/scipy_timings.csv you can see that the first torch_gpu, False
run is slower than the rest, and the same for cupy, False
.
The main thing that concerns me is the significant speedup of pytorch on scikit-learn's LinearDiscriminantAnalysis.fit
benchmark. Like I said, if you run it in separate processes, it takes about 2.4 seconds, but if you re-run it in the same process, it takes 0.3 seconds subsequently. I'm not completely clear why this is, but I suspect the torch jit is doing something to make things faster. This doesn't happen with cupy, and it doesn't happen with the other two benchmarks (LinearDiscriminantAnalysis.predict
or welch
).
By the way, can you confirm that the synchronize
calls should both go within the perf_counter
block (i.e., be including in the timings)?
So in my testing, I am already observing this just from running the timings in separate processes.
Can't speak for PyTorch, but it's kinda expected for CuPy, for two reasons:
- The CUDA driver would JIT & cache the kernel code on disk
- CuPy would do the same on its own (different JIT)
so for subsequent process launches, some JIT overhead is avoided. So this further justifies that we should really exclude unnecessary noise.
The main thing that concerns me is the significant speedup of pytorch on scikit-learn's
LinearDiscriminantAnalysis.fit
benchmark.
Too good to be true? ๐ I don't know if PyTorch would optimize for repeated workloads/graphs internally, or if the PyTorch path somehow would trigger another cache mechanism, if so it'd explain. CuPy has no smart runtime or compute graph system backing it.
We should fix the timer and retry.
By the way, can you confirm that the synchronize calls should both go within the perf_counter block (i.e., be including in the timings)?
It appears that I wasn't being accurate, sorry ๐ We should do
sync()
t1 = timer()
...
sync()
t2 = timer()
so that between t1 & t2 only the target workload is under exception execution, since we're using the CPU timer as proxy here.
You might be wondering why the benchmark helper from cuquantum-benchmarks that I linked above does it differently. It's because we don't use the CPU timer to time the GPU workload there (note there is no sync before the 2nd CPU timer call). The CPU timers are only used to measure the time spent on host, which due to asynchronicity could be drastically different from the GPU elapsed time (measured via CUDA events).
Too good to be true? ๐
I honestly don't know. I don't have enough experience with GPU to know if this is reasonable or not or to guess why it is happening. I don't really even know what sorts of things fit
is doing under the hood (@thomasjpfan would know more about that).
You might be wondering why the benchmark helper from cuquantum-benchmarks that I linked above does it differently. It's because we don't use the CPU timer to time the GPU workload there (note there is no sync before the 2nd CPU timer call). The CPU timers are only used to measure the time spent on host, which due to asynchronicity could be drastically different from the GPU elapsed time (measured via CUDA events).
Is there a more accurate GPU timer that I should be using? Or is that even worth bothering with?
Note that aside from the first run, the timings appear to have pretty low variance.
I updated the benchmarks script to run all benchmarks in the same process, sync before timing, and do a warmup run that isn't included in the timings. The biggest difference from before is that pytorch comes out as much faster for the scikit-learn fit
run, and a little faster for the SciPy welch
benchmark. See https://github.com/data-apis/scipy-2023-presentation/blob/main/benchmarks/bench.py. Let me know if you have any additional concerns.
@asmeurer LGTM! Note that NumPy results also get slightly better, if you compare the perf data before/after the changes (I assumed the benchmarks were done on the same machine) ๐
btw has the plot in the paper been updated?
Yes, I updated the plot in the paper.
FYI, both PyTorch and CuPy noted the importance of proper synchronization and warm-up runs.
- https://pytorch.org/tutorials/recipes/recipes/benchmark.html
- https://docs.cupy.dev/en/stable/user_guide/performance.html#benchmarking
In particular, essentially PyTorch's Timer
util is doing what we do here too (link).