data-apis/scipy-2023-presentation

Show the speedup against original code (without any change), not after change?

leofang opened this issue ยท 16 comments

Just a thought... Right now, my understanding of the benchmarks is that they all use modified implementations that are not yet upstreamed and would use the standardized APIs, and we compare the perf of non-NumPy libraries against NumPy, but all using the modified implementation.

I wonder how difficult is it to compare the results against the current SciPy/scikit-learn implementations without any change? IIRC I've seen it mentioned somewhere that the adoption is actually helping improve the perf even for NumPy (at least for non-strict, I guess?), so it'd be nice to showcase this finding too. It's a nice story that "by adopting the standard, it also helps the project maintainers to discover a better pattern that could improve the CPU perf by X %".

Another question: It doesn't seem that any of the benchmarks uses the compat layer? It'd be nice to note this.

Scikit learn is merged but not released. SciPy is indeed just a PR. I can do this, though I suspect the results won't be much different.

Both libraries use the compat layer. It's basically impossible for real world libraries to code against the array API right now without using the compat layer.

Agreed that it would be better to show timings relative to the original implementation, rather than the refactored implementation with NumPy as an array API compliant backend.

At the least, we should explicitly sanity check and verify that the original and the refactored with NumPy as a backend do not significantly differ.

I ran the NumPy benchmarks against the latest released versions of scikit-learn and scipy (i.e., no array API support).

For scikit-learn against the NumPy backend, the latest version (1.2.2) is actually slower than the array API supporting version (main) for fit (16 seconds vs. 11 seconds). The reason is that @thomasjpfan has made some performance optimizations in the code alongside the changes to add array API support, notably this one (the code in the if is_array_api_compliant block runs faster than the bincount code in the else block, which is what was used previously). That confounds things a bit as this is an optimization that would make sense even outside of the context of adding array API support.

predict is a little bit slower (0.08 seconds in scikit-learn 1.2.2 vs. 0.11 seconds in main). Given the small absolute time, I expect that may be mostly attributable to the overhead of wrapping, although I haven't investigated it further.

SciPy welch is also slower than the latest release. The welch benchmark runs in 1.35 seconds in SciPy 1.9.3 vs. 2 seconds in @tylerjereddy's branch. I haven't investigated this. I vaguely remember this being mentioned before, but I can't find it in any existing conversations. Tyler, do you have an idea for why this is the case?

Can you clarify which benchmark I'm checking? Are you just running the one I used in my branch or a modified version here somewhere?

It's this benchmark https://github.com/data-apis/scipy-2023-presentation/blob/main/benchmarks/bench.py. It's the same as the one in https://github.com/tylerjereddy/bench_welch_array_api/blob/main/bench.py except I had to lower the number of data points from 90 million to 50 million to avoid a memory error on my GPU with cupy.

I'll make an effort to check on a few GPUs tomorrow (May 31).

I think we're still ok with updated benchmarks (details: tylerjereddy/bench_welch_array_api#1) on another machine/GPU and this time compared against 1.10.1 wheel as well (see below). I agree we're a bit slower vs. "vanilla NumPy" but maybe just a discussion point for paper/something to refine? I guess we should line profile this and drill down, but hopefully we don't need that fully ironed out before submitting? Below is wheel vs. my welch branch for other benchmarks.

results

Did those early this morning so only a single trial for now.. let me know if you need something more specific, or with better hardware (I could maybe do A100 instead of V100, but latter is easier to get allocation on for quick work).

The question here is whether this performance difference matters enough that we should compare the results against the released scipy vs. the "numpy relaxed array API" (it's an especially relevant question considering what's going on with scikit-learn #15 (comment)).

By the way, another question about the welch benchmark? Is there a reason you use a simple signal consisting of all 0s and one non-zero entry, rather than something more complex like https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.welch.html#r34b375daf612-1? Would that matter for the performance?

For the welch benchmarks the result will still be impressive either way, right? I guess you could be conservative and do that--I agree that this needs to be cleaned up, but of course I'm not a signal maintainer and those shims haven't even been through code review by a domain expert, so I suspect I've given the floor rather than ceiling on performance.

Absolutely no reason for picking the weird benchmark apart from extracting the datastructure from one of the unit tests and just making it bigger. Indeed, we probably should play with the signal properties/density if we want to be really thorough I suppose.

If any of those tasks are important for the paper, let me know as soon as you can--I'm taking off on Saturday for a few days, but will do my best before then.

I'm working on updating the welch benchmark. I just wanted to get your thoughts on it since you did the implementation. If you don't already know why there is a difference in pure numpy, then it's something we can investigate.

By the way, the paper submission deadline is Friday, so we have to have everything ready by then anyway :)

I updated the welch benchmark to use a more realistic signal, based on that documentation example. Everything is the same except for PyTorch GPU, which now has the same performance as CuPy. It seems PyTorch was somehow optimizing the trivial "all zeros" example.

@tylerjereddy I'm a little surprised your plot #15 (comment) shows CuPy as slower than PyTorch CPU. This is obviously quite hardware dependent, but I would expect your (relaxed) CuPy and Torch CPU bars to be reversed.

Hmm, yeah I noticed that too--first-ever import of fresh install/compile of cupy might slow down a bit with pyc generation and stuff for a single trial benchmark run?

I repeated the torch GPU and CuPy relaxed runs and they match your expectations over at tylerjereddy/bench_welch_array_api#1. I guess that's part of why I mentioned I only did single trials, was mostly looking at the "vanilla NumPy" aspect.

Oh yeah, for your own benchmarks, you should make sure to do a warmup run, and also to sync the GPU before and after each run (see the discussion at #11).

Sounds good, looks like there's a CuPy developer who caught that--I did know about syncing and stuff to be fair, but yeah great that you're doing that.