fjarri/reikna

2d FFT with fast_math: roundtrip fails on GT 750M

maweigert opened this issue · 4 comments

Hi Bogdan,

I was porting some fft based code from pyfft to reikna and was experiencing some inaccuracies in the fft calculations with fast_math, depending on the hardware I am using.

I did the following simple roundtrip comparison

https://gist.github.com/maweigert/0bb5d16b3bb9a3d0659c7d48ee8fd32a

and got very different behaviour depending on the GPU:

Iris Pro:			0.0001746
GeForce GT 750M:		1.1197116

While pyfft on the same input (and fast_math = True) gave

GeForce GT 750M:		0.0000043

So it seems not to be GPU but reikna specific.

Did you ever see something similar, or can you reproduce this?

Cheers and thanks for the package!

M

Thanks for the report. That's strange, I cannot reproduce it on my old laptop with geForce 9400M (and I get an error of 9e-7). I will have access to my mac with GF750M next week when I'm at my office, so I can try it out too. Could you tell me which version/revision of reikna you are using?

Also, try the following reikna-only code:

from __future__ import print_function
import numpy as np
from reikna import cluda
from reikna.fft import FFT

dshape = (128,)*2
np.random.seed(0)
input = (np.random.uniform(-1,1,dshape)).astype(np.complex64)

thr = cluda.ocl_api().Thread.create(interactive=True)
buf_g = thr.to_device(input)
fft = FFT(buf_g).compile(thr, fast_math = True)

fft(buf_g, buf_g)
fft(buf_g, buf_g, inverse = True)

output = buf_g.get()

print("{}:\t\t{}".format(thr._device.name, np.amax(np.abs(input-output))))

Also, could you do some more tests?

  • if you have pycuda installed, replace ocl_api with cuda_api and check if the bug is still there;
  • does the bug still occur with fast_math=False?
  • does the bug still occur for a 1D array? Smaller than 128 elements?

Thanks for looking into that!

  • system/version: Mac OSX 10.11.6, python 2.7.12, reikna 0.6.7

  • the bug persists with the reikna-only code
    GeForce GT 750M: shape = (128, 128) fast_math = True 1.1197116

  • using pycuda/cuda_api works fine
    GeForce GT 750M: shape = (128, 128) fast_math = True 7.882949e-07

  • switching to fast_math = False works fine
    GeForce GT 750M: shape = (128, 128) fast_math = False 4.618963e-07

  • strangely enough, it fails for some 1D shapes, too:

GeForce GT 750M:	shape = (64,)	fast_math = True	2.27608268233e-07
GeForce GT 750M:	shape = (128,)	fast_math = True	2.01620565576e-07
GeForce GT 750M:	shape = (256,)	fast_math = True	3.03235623278e-07
GeForce GT 750M:	shape = (512,)	fast_math = True	3.45577376493e-07
GeForce GT 750M:	shape = (1024,)	fast_math = True	0.976445674896
GeForce GT 750M:	shape = (2048,)	fast_math = True	3.66386217365e-07
GeForce GT 750M:	shape = (4096,)	fast_math = True	1.1036427021
GeForce GT 750M:	shape = (8192,)	fast_math = True	1.1503098011
  • it seems the culprit is in the native_sin/cos function, as removing the following ifdef switch
    #ifdef COMPILE_FAST_MATH
        res.x = native_cos(theta);
        res.y = native_sin(theta);
    #else

in cluda/functions.mako restores normal behaviour. Yet this is strange, as pyfft does almost the same in the kernels with fast_math=True but runs fine on the same GPU.

Yes, it is quite strange. Removing the natice_cos()/sin() usage pretty much negates any performance benefit from fast_math=True, so would rather not do that.

I suspect there may be some bug in Apple's OpenCL driver (I have found several over the years myself). It is usually some kind of strange interplay between the exact GPU operations invoked and the global/local size. My general approach in such cases is to isolate the offending kernel and start removing parts until I end up with something that reproduces the bug and is small enough to open an issue in the Apple's tracker. It is a quite lengthy process, though, and I completely understand if you don't want to go through it.

I have tested the code on OSX 10.11.3, and could not reproduce the bug, but it was a FirePro video card, so the local sizes used could be different. Could you do several more things:

  1. Comment the #if block in cluda/kernel.mako starting from #if defined(cl_khr_fp64). This seems to be one of the differences from pyfft, which only enables it when the array has a double-precision datatype.

  2. Check and tell me which global/local sizes reikna and pyfft use (let's say for the smallest array when the bug is reproduced, that is the 1D one with 1024 elements). For the reikna code, add the following lines:

     for call in fft._kernel_calls:
         print(call._kernel.global_size, call._kernel.local_size)
    

    For pyfft code (add after the actual call, since the kernels are created on the first invocation):

     for k in plan._kernels:
         print k._func_forward._global_size, k._func_forward._block_size
    

I suspect there may be some bug in Apple's OpenCL driver (I have found several over the years myself)

Indeed, that was it!!
After installing the Nvidia Web drivers (346.03.15f02) everything was fine again.
So it seem the default drivers on El Capitan (310.42.25f01) have a bug in the native_sin/cos functions.

Thanks for your help!