Optimize BSS use of FFT with cupy, speed up of up to 3x for full tracks

Question

Optimize BSS use of FFT with cupy, speed up of up to 3x for full tracks

sevagh opened this issue 3 years ago · 6 comments

Hello,
I have been working on some potential performance optimizations for the BSS evaluation (which is rather slow/compute intensive for full tracks).

Baseline measurement with original museval code (the total execution involves also computing the IRM, adapted from https://github.com/sigsep/sigsep-mus-oracle/blob/master/IRM.py):

museval bss original execution time, 1 track of musdb
pybin: /home/sevagh/venvs/museval-orig/bin/python3
evaluating track AM Contra - Heart Peripheral

real    3m22.702s
user    3m21.577s
sys     0m39.376s

The original code takes ~3:20 minutes.

The second optimization uses cupy and the GPU, which is in my opinion a big cost/burden for end users. Installing the CUDA toolkit etc. is no joke. Here is the code: master...sevagh:feat/cupy-accel
However, the performance is rather good at ~1:20 minutes, so maybe almost ~3x faster than the original code:

museval bss optimization 2 (cupy on gpu) execution time, 1 track of musdb
pybin: /home/sevagh/venvs/museval-optimization-2/bin/python3
evaluating track AM Contra - Heart Peripheral

real    1m19.801s
user    1m27.077s
sys     0m29.615s

One final note is that the CUDA/cupy version has slight differences in the outputs due to numerical precision differences. It doesn't look too significant to me - here's an excerpt of a diff between the evaluated json files, showing small differences in the BSS scores:

@@ -10459,8 +10459,8 @@
-            "SAR": 30.60528,
-            "ISR": 30.67039
+            "SAR": 30.60525,
+            "ISR": 30.67036
@@ -10469,8 +10469,8 @@
-            "SAR": 30.45440,
-            "ISR": 30.52629
+            "SAR": 30.45438,
+            "ISR": 30.52627
@@ -10480,7 +10480,7 @@
-            "ISR": 20.99668
+            "ISR": 20.99667

I'm also trying to find a way to use CPU parallelism with scipy.fft and combining several of the FFTs in a single call, but this isn't really helping as much as the CUDA change. My code attempts can be seen here: master...sevagh:multiple-1d-fft

I'm aware of the separate repo for bss at https://github.com/sigsep/bsseval/ but I wasn't sure which project to discuss it in - I'm using museval because I'm trying to recreate the SiSec 2018 testbench.

Answer 1 · 2021-04-22T13:16:51.000Z

Also there could be a "super-performant" config with cupy, stacking multiple 1D FFTs (respecting GPU memory allocation limits), and using pinned host/gpu memory and FFT plans - I'll continue working in that direction.

Answer 2 · 2021-04-24T15:21:46.000Z

Optimized every slow line (discovered through kernprof + line_profiler): master...sevagh:feat/cupy-accel

This leads to just about 1 minute to compute the IRM mask and perform a BSS evaluation on 1 full-length MUSDB18 track:

real    1m1.762s
user    0m50.948s
sys     0m13.620s

This is down from the 3+ minutes originally:

real    3m22.702s
user    3m21.577s
sys     0m39.376s

Answer 3 · 2021-08-06T08:29:39.000Z

@sevagh i think this would be great. Do the regression tests pass using this?

Answer 4 · 2021-08-06T09:22:27.000Z

How can I run the tests? python setup.py test?

Answer 5 · 2021-08-10T09:43:50.000Z

install the test evironment pip install .[tests] and then run

py.test tests/test_regression.py -vs

Answer 6 · 2021-08-10T14:08:44.000Z

OK. My most recent commits get the regression tests passing. Casting explicitly to float32 was creating huge errors in SAR/SIR/ISR, so I just removed them.

I made the cupy install optional (although fixed to CUDA 11.4, which is rather recent).

Other notes/idiosyncrasies is that it's best to clear the cupy FFT cache between BSS evaluations of large songs. That's why I added this helper function:
master...sevagh:feat/cupy-accel#diff-cc17d32a9d811e616624c2f2699f853dd06b143931ea9e37a6cc0dab6a4b8ab9R75-R88

In real code you would do:

for track in mus.tracks:
    ...
    scores = museval.eval_mus_track(...) # cupy under the hood
    museval.clear_cupy_cache()

Passing regression test:

(museval-cupy) sevagh:sigsep-mus-eval $ py.test tests/test_regression.py -vs
===================================================== test session starts =====================================================
platform linux -- Python 3.9.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/sevagh/venvs/museval-cupy/bin/python
cachedir: .pytest_cache
rootdir: /home/sevagh/repos/sigsep-mus-eval, configfile: setup.cfg
collected 4 items

tests/test_regression.py::test_aggregate[Music Delta - 80s Rock]     time         target metric     score                   track
[...]
Aggrated Scores (median over frames, median over tracks)
vocals          ==> SDR: -15.622  SIR:   9.165  ISR:  -8.476  SAR:  -7.327
accompaniment   ==> SDR: -13.290  SIR: -18.765  ISR:  -0.322  SAR:  -7.427

PASSED
tests/test_regression.py::test_track_scores[Music Delta - 80s Rock] PASSED
tests/test_regression.py::test_random_estimate[Music Delta - 80s Rock] PASSED
tests/test_regression.py::test_one_estimate[Music Delta - 80s Rock] PASSED

====================================================== warnings summary =======================================================
../../venvs/museval-cupy/lib/python3.9/site-packages/past/builtins/misc.py:45
  /home/sevagh/venvs/museval-cupy/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    from imp import reload

tests/test_regression.py: 12 warnings
  /home/sevagh/repos/sigsep-mus-eval/museval/metrics.py:601: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    eps = np.finfo(np.float).eps

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=============================================== 4 passed, 13 warnings in 46.33s ===============================================