Question regarding VkFFTApp and pyvkfft.fft (OpenCL backend)

Question

Question regarding VkFFTApp and pyvkfft.fft (OpenCL backend)

Closed this issue 2 years ago · 9 comments

As a sort of learning exercise I'm implementing Richardson-Lucy deconvolution using pyVKFFT for the FFTs that are involved for each iteration.
Originally, I was using VkFFTApp to setup the R2C FFT:
r2c_out = VkFFTApp(psf_c.shape, psf_c.dtype, queue=cq, ndim=3, r2c=True, inplace=False)
Then I use fft = r2c_out.fft(decon, fft) to calculate the R2C (and r2c_out.ifft for the inverse)

After some time, the pyvkfft.fft module was implemented so I started to play with that, switching to just importing from pyvkfft.fft import rfftn, irfftn and using those instead of the App et al.
So now I use fft = rfftn(decon) for the R2C (and irfftn for the inverse).

Note: in both cases, I do some initial FFTs outside of my for loop—this is where i setup VKFFTApp for example—so I assume the Apps are created and cached and then have the FFTs and iFFTs in the loop. Also for this simple exercise everything has the same shape (3d, float32).

Anyhow, I've managed to get everything working—I had some various PBCAK snafus.
Both are considerably faster than clFFT-based or reikna-based python deconvolution as well!
However, I do notice though that 100 iterations using the manual VkFFTApp runs some 20% faster than the rfftn/ifftn option.

Is this expected or perhaps because I'm not passing all the extra elements that are in VkFFTApp when calling rfftn?

Additionally, I'm a bit confused about in-place vs out-of-place for the R2C case.
Looking at the example notebook: https://github.com/vincefn/pyvkfft/blob/master/examples/pyvkfft-fft.ipynb
The in-place example is:

 print("R2C transform, inplace")
    dr = cla.to_device(cq, ascent().astype(np.float32))
    sh = (dr.shape[0], dr.shape[1]//2+1)
    dc = cla.empty(cq, sh, dtype=np.complex64)
    do_rfft_and_plot(dr, dc)

and the function says:

def do_rfft_and_plot(dr, dc=None):
    # if dc is None, the transform is out-of-place and the destination
    # array is allocated on-the-fly

So if in both cases a dc array is created, what's the difference/advantage of one vs the other? Certainly the out-of-place option is easier to use...

Thanks for any advice and for your efforts in making this library!

Answer 1 · 2022-02-14T08:57:33.000Z

Anyhow, I've managed to get everything working—I had some various PBCAK snafus. Both are considerably faster than clFFT-based or reikna-based python deconvolution as well! However, I do notice though that 100 iterations using the manual VkFFTApp runs some 20% faster than the rfftn/ifftn option.

Is this expected or perhaps because I'm not passing all the extra elements that are in VkFFTApp when calling rfftn?

No, this is not expected. But if you are using just rfftn(dr) then the creation of the dc array would take a little time, so that probably explains the overhead.

If you pre-allocate the destination array then there should be no overhead. Also, if you use a memory pool then maybe the overhead for using rfftn(dr) would be smaller (I have not tested).

Additionally, I'm a bit confused about in-place vs out-of-place for the R2C case. Looking at the example notebook: https://github.com/vincefn/pyvkfft/blob/master/examples/pyvkfft-fft.ipynb The in-place example is:
 print("R2C transform, inplace")
    dr = cla.to_device(cq, ascent().astype(np.float32))
    sh = (dr.shape[0], dr.shape[1]//2+1)
    dc = cla.empty(cq, sh, dtype=np.complex64)
    do_rfft_and_plot(dr, dc)

That's a mistake in the example. There should actually be 3 types of transforms:

print("R2C transform, out-of-place in auto-allocated destination array")
d = cla.to_device(cq, ascent().astype(np.float32))
do_rfft_and_plot(d)

print("R2C transform, out-of-place")
dr = cla.to_device(cq, ascent().astype(np.float32))
sh = (dr.shape[0], dr.shape[1]//2+1)
dc = cla.empty(cq, sh, dtype=np.complex64)
do_rfft_and_plot(dr, dc)

print("R2C transform, inplace")
dr = cla.to_device(cq, ascent().astype(np.float32))
do_rfft_and_plot(dr, dr)

.. but if you try this you'll see that the inplace rfftn(dr,dr) does not return the complex64 view of the array, which is a bug.

Answer 2 · 2022-02-14T10:44:52.000Z

No, this is not expected. But if you are using just rfftn(dr) then the creation of the dc array would take a little time, so that probably explains the overhead.

Ok, that makes sense, since it's in a loop for 100+ iterations then the small overhead accululates for 2 R2C and 2 inverses. I pre-allocated a real CLA and a complex CLA before my loop and changed my code to dc = rfftn(decon, dc) and now I get identical performance!
Thanks for the tip, I should have considered that!
❤️

Additionally, I'm a bit confused about in-place vs out-of-place for the R2C case. Looking at the example notebook: https://github.com/vincefn/pyvkfft/blob/master/examples/pyvkfft-fft.ipynb The in-place example is:
 print("R2C transform, inplace")
    dr = cla.to_device(cq, ascent().astype(np.float32))
    sh = (dr.shape[0], dr.shape[1]//2+1)
    dc = cla.empty(cq, sh, dtype=np.complex64)
    do_rfft_and_plot(dr, dc)
That's a mistake in the example. There should actually be 3 types of transforms:
print("R2C transform, out-of-place in auto-allocated destination array")
d = cla.to_device(cq, ascent().astype(np.float32))
do_rfft_and_plot(d)

print("R2C transform, out-of-place")
dr = cla.to_device(cq, ascent().astype(np.float32))
sh = (dr.shape[0], dr.shape[1]//2+1)
dc = cla.empty(cq, sh, dtype=np.complex64)
do_rfft_and_plot(dr, dc)

print("R2C transform, inplace")
dr = cla.to_device(cq, ascent().astype(np.float32))
do_rfft_and_plot(dr, dr)
.. but if you try this you'll see that the inplace rfftn(dr,dr) does not return the complex64 view of the array, which is a bug.

Ok, thanks for explaining that, makes more sense now.

Answer 3 · 2022-02-14T11:13:16.000Z

The issue with the incorrect dtype for the array view returned in the case of an inplace r2c transform should be fixed by 1b89580. I'll make a new minor release.

Also I can confirm that using a memory pool helps for an out-of-place auto-allocate transform.

Without a pool:

n = 128
dr = cla.to_device(cq, np.random.uniform(0,1,(n,n,n)).astype(np.float32))

def goout(repeat):
    for i in range(repeat):
        d = rfftn(dr)
        d = irfftn(d)
    junk = d.get().sum()

def goin(repeat):
    for i in range(repeat):
        d = rfftn(dr, dr)
        d = irfftn(d, d)
    junk = d.get().sum()

%timeit goout(10)
%timeit goin(10)

10.6 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.42 ms ± 8.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And with the memory pool:

import pyopencl.tools as cltools
n = 128
mempool = cltools.MemoryPool(cltools.ImmediateAllocator(cq))
dr = cla.to_device(cq, np.random.uniform(0,1,(n,n,n)).astype(np.float32), allocator=mempool)

%timeit goout(10)
%timeit goin(10)

4.37 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.33 ms ± 4.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It may be easier to use the out-of-place transform rather than the in-place transform for Richardson-Lucy, because the handling of the extra 2 real columns in the inplace case would complicate things.

Thanks for pointing out these issues !

Answer 4 · 2022-02-14T12:10:50.000Z

Thanks a ton!
I'll check out MemoryPool for sure—I'm both a python and (py)OpenCL noob, so this is all a learning exercise. pyvkfft makes this all a lot easier than the alternatives for sure—and FAST! Next I will probably try to implement regularization, but I need to learn a bit more of pyopencl...
Feel free to close this issue, from my side everything is clear now.

Answer 5 · 2022-02-14T20:58:31.000Z

FYI, in case you're curious, I can replicate the two auto-allocate code blocks on my M1 MBPro (native arm64 python env, macOS 12.1).
First block—the one without MemoryPool:

28.2 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19.5 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Second with:

19.3 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19.2 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So you've got 4x better hardware than my laptop 🚀

Edit: Also I've checked and pre-assigning the memory yields the same times as in-place. But in-place should be more memory efficient right? less GPU memory blocked off? This could make a difference for larger images...

Answer 6 · 2022-02-17T12:51:39.000Z

Edit: Also I've checked and pre-assigning the memory yields the same times as in-place. But in-place should be more memory efficient right? less GPU memory blocked off? This could make a difference for larger images...

It depends what you mean by 'memory efficient': except for very small sizes which may fit in cache, out-of-place should be just as fast as in-place, as it corresponds to the same amount of read&write operations. It is only consuming more memory, so in that regard it is less efficient, and yes it matters if you have large amounts of memory.

Answer 7 · 2022-02-18T15:18:35.000Z

Sorry, I meant in terms of just GPU RAM. So out-of-place requires an extra array the size of the image and an extra one the size of the OTF. So in-place could result in significant savings, making it possible to process larger images without tiling.

Answer 8 · 2022-02-20T20:50:22.000Z

Yes - however note that beyond some array sizes (e.g. nx>8192, see the VkFFT doc for details), in-place transforms will require an auto-allocated temporary buffer array which has the same size as the original array. This is still better than cufft which does this even for smaller arrays?

Answer 9 · 2022-02-21T22:43:26.000Z

It may be easier to use the out-of-place transform rather than the in-place transform for Richardson-Lucy, because the handling of the extra 2 real columns in the inplace case would complicate things.

You were right. it's not so clear cut, because every iteration needs:

the original image -- so this can't be changed via in-place rfftn
both the value and rfftn of the previous decon estimate -- so this can't be changed via in-place rfftn
So it seems like already have to make an out-of-place or make a copy so the memory savings are much reduced—plus gotta make sure the array shapes work.