UCBerkeleySETI/rawspec

cufftXtSetCallback returns cufft_rc = CUFFT_NOT_SUPPORTED - CUDA 11.4.1

radonnachie opened this issue · 12 comments

A segmentation fault was encountered when running rawspec (compiled with a prior CUDA version) on a machine with CUDA 11.4.1:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6244e2380c in cufftResult_t Visitors::Callback::set_dims_cb<fftDimensionClass*>(Config::Legacy const&, device const&, Operation::CallbackInfo*, cufftXtCallbackType_t, fftDimensionClass**, void**, void**, bool, bool) ()
   from /opt/mnt/lib/librawspec.so
[Current thread is 1 (Thread 0x7f6244751000 (LWP 264117))]
(gdb) backtrace
#0  0x00007f6244e2380c in cufftResult_t Visitors::Callback::set_dims_cb<fftDimensionClass*>(Config::Legacy const&, device const&, Operation::CallbackInfo*, cufftXtCallbackType_t, fftDimensionClass**, void**, void**, bool, bool) ()
   from /opt/mnt/lib/librawspec.so
#1  0x00007f6244e21b54 in Visitors::Callback::Set::process(Operation::LegacyFFT::CT_C2C&) ()
   from /opt/mnt/lib/librawspec.so
#2  0x00007f6244e1e754 in Visitors::Callback::Set::operator()(Operation::Queue&) () from /opt/mnt/lib/librawspec.so
#3  0x00007f6244d169aa in cufftXtSetCallback () from /opt/mnt/lib/librawspec.so
#4  0x00007f6244d0e1bb in rawspec_initialize (ctx=0x7ffc3574d580) at rawspec_gpu.cu:1024
#5  0x000055ff73add680 in ?? ()
#6  0x0000000000000000 in ?? ()

After recompiling with CUDA 11.4.1, the segmentation fault is replaced with a less helpful cuda_rc error:
got error CUFFT_NOT_SUPPORTED at rawspec_gpu.cu:1029

This rc comes from the cufftXtSetCallback call for the h_cufft_load_callback, and is defined as CUFFT_NOT_SUPPORTED = 16 // Operation is not supported for parameters given..

I can't immediately see any information under the latest Toolkit Dev page that might help.

FYI: we upgraded our NVIDIA drivers to accommodate the new A6000s, and we did this on all our seti-nodes (i.e. also those with RTX3090s).
Having said that, @RocketRoss did see the same above issue when running the same rawspec on the RTX3090.

Have you rebuilt rawspec with the new CUDA version?

Yes, rawspec was rebuilt with the new nvidia driver

Are you sure you're getting matching rawspec and librawspec versions? What does rawspec -v show? If that looks good, can you please cherry-pick f95cb8d and check rawspec -v again after rebuilding?

The cufftXtSetCallback function is still supported and the only recent(ish) thing that seems like it might affect it was introduced in CUDA 11.2, but I can run rawspec built with CUDA 11.3.0 so I don't think that's relevant here.

This is the output of rawspec -v

rawspec 2.3.1+90@gab74e5c-dirty
librawspec 2.3.1+90@gab74e5c-dirty

Those match, which is good, but please also cherry-pick the commit I mentioned and run that again. That commit adds the cuFFT version to the librawspec output.

Alternatively, you could run ldd $(type -p rawspec) (or ldd /full/path/to/rawspec) to make sure that you are picking up the librawspec.so file that you are expecting.

Is it possible this error is actually occurring earlier, but only gets reported after calling cufftXtSetCallback?

There is an if cuda_rc != SUCCESS just before the cufftXtSetCallback, so it's not a delayed catch.

I'll cherry pick and find some 8bit raw files soon. Thanks!

So, f96cb8d produced rawspec using librawspec 2.4.1+dirty cuFFT 10.5.1.100
And running it on an 8bit file showed no issues. Then, running my latest branch on the 8bit file, also showed no issues. So, I'll be tracking down what the 4bit capability introduces that causes the issue and sorting it out on that branch on my fork.

I don't see a way to migrate this issue to my fork (possibly by design) so I'll keep track of progress here.

After scratching around and not seeing how the 4bit capabilities could possibly affect the plans, I found that the issue is more definitely linked to the FFT size passed in. Re-approaching the original 4bit raw file:

working stem: /mnt/buf1/ENDURANCETEST/GUPPI/./guppi_59447_17004_6377981_frb180916_0001
opening file: /mnt/buf1/ENDURANCETEST/GUPPI/./guppi_59447_17004_6377981_frb180916_0001.0000.raw
number of bits per sample must be 8 or 16 (not 4), using 8 bps
Nc:             6400
Np:             2
Ntpb:           10480
Nbps:           8
buf_size:       268304384
buf_size >> 15: 8188
No:             1
  Nts[0]:       5240
  Nss[0]:       2
got error CUFFT_NOT_SUPPORTED at rawspec_gpu.cu:1056
rawspec initialization failed
output product 0: 0 spectra
(base) sonata@seti-node4:~$ /opt/mnt/bin/rawspec -f 2620 -t 2 -d /mnt/buf1/rawspec/guppi_59447_16475_6365362_frb180916_0001/ /mnt/buf1/ENDURANCETEST/GUPPI/./guppi_59447_17004_6377981_frb18
0916_0001
working stem: /mnt/buf1/ENDURANCETEST/GUPPI/./guppi_59447_17004_6377981_frb180916_0001
opening file: /mnt/buf1/ENDURANCETEST/GUPPI/./guppi_59447_17004_6377981_frb180916_0001.0000.raw
number of bits per sample must be 8 or 16 (not 4), using 8 bps
Nc:             6400
Np:             2
Ntpb:           10480
Nbps:           8
buf_size:       268304384
buf_size >> 15: 8188
No:             1
  Nts[0]:       2620
  Nss[0]:       4
CUDA memory initialised for 8 bits per sample,
        will expand header specified 4 bits per sample.
... Success

The raw file converted to 8bit showed the same limitation.

So, I think that a question to answer is whether or not this was introduced in CUDA 11.4 (highly unlikely)... the follow up is how/why the values are an issue.

The PIPERBLK value of the RAW files that provoked the CUFFT_NOT_SUPPORTED error had prime factors of 2 and 2621, while the RAW files that were successfully processed had a PIPERBLK value with prime factors of 2, 7 and 13.

I recall reading somewhere that the largest prime factor allowed is 13, but cannot find such a statement anywhere in documentation. The CuFFT documentation mentions only that optimisation is made for prime factors up to and including 7, but that powers of 2 are fastest.

Why -f 2621 worked on the former and not -f 5242 is beyond me, but I sought to remove the prime factors other than 2 from the PIPERBLK by adjusting the data-acquisition pipeline and haven't looked back since.

I think the cuFFT reference you're searching for is https://docs.nvidia.com/cuda/cufft/index.html#accuracy-and-performance

Basically, FFT lengths with prime factors up to and including 127 will be (relatively) efficient. For FFT lengths with prime factors larger than that, cuFFT will resort to using Bluestein's algorithm, which is less efficient in terms of both time and memory requirements. As you noted, 2621 is prime and 2620 == 2^2 * 5 * 131, so those (and 5240) will (attempt to) use Bluestein's algorithm.

FWIW, at MeerKAT we calculate an "effective block size" that is the largest block size with a power-of-two number of time samples that fits within the Hashpipe data buffer's block size (128 MiB for hpguppi_daq). This becomes the BLOCSIZE in the resultant GUPPI RAW files. The effective block size varies depending on how many antennas and channels the radio telescope subarray is configured with for any given observation. This avoids dimensions with unfortunate factorizations.