Inferred limit on FFT size: res_desc.res.pitch2D.width = 1<<15;

Question

Inferred limit on FFT size: res_desc.res.pitch2D.width = 1<<15;

radonnachie opened this issue 4 years ago · 4 comments

Increasing the FFT size past the PIPERBLK of the RAW data to be processed increases the number of blocks that are processed at a time. At a point though the buf_size exceeds 2^15*2^15, which results in an error being raised in the creation of the load_callback's texture object:

L611-637

buf_size = ctx->Nb*ctx->Ntpb*ctx->Np*ctx->Nc* 2/*complex*/ *(ctx->Nbps/8);
...
memset(&res_desc, 0, sizeof(res_desc));
res_desc.resType = cudaResourceTypePitch2D;
res_desc.res.pitch2D.devPtr = gpu_ctx->d_fft_in;
res_desc.res.pitch2D.desc.f = cudaChannelFormatKindSigned;
res_desc.res.pitch2D.desc.x = ctx->Nbps; // bits per sample
res_desc.res.pitch2D.width = 1<<15;         // elements
res_desc.res.pitch2D.height = buf_size>>15; // elements
res_desc.res.pitch2D.pitchInBytes = (1<<15) * (ctx->Nbps/8);  // bytes!

Error:

working stem: ./guppi_59229_47368_006379_Unknown_0001
opening file: ./guppi_59229_47368_006379_Unknown_0001.0000.raw
Splitting output per 2 antennas
number of bits per sample must be 8 or 16 (not 4), using 8 bps
ctx->Nb: 8
buf_size: 2147483648
buf_size: 2147483648
buf_size>>15: 65536
1<<15: 32768
got error cudaErrorInvalidValue at rawspec_gpu.cu:654

Why was the width chosen to be a static 2^15?
Would there be any issue in making this dynamic to the number of blocks require: it would mean changing the width and the static adjustments in the load_callback?
Otherwise, it'd be nice to have this warned about.

Answer 1 · 2021-02-20T01:10:26.000Z

The "free" conversion from 8-bit integer to 32-bit float happens via the "texture object" (aka "texture reference") feature of the CUDA GPU. The texture object needs to cover the entire 8-bit input buffer. Using 1D textures limited the input buffer size to 2^28 (or 2^27 depending on the GPU's compute capability), which was too small so I switched to 2D textures. The dimensionality of the texture mapping is independent from the input buffer dimensionality. I'm not sure why I chose 2^15 as the first dimension. The max first dimension is 2^16 (or 2^17 depending on the GPU's compute capability).

That all said, I'm not sure why 32768 x 65536 is giving an error. That should be acceptable from what I can tell. Maybe it's actually considered a "layered texture" which apparently has a smaller max size. What does deviceQuery show for your GPU? Here's what it shows for an RTX 2080 Ti:

  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers

The first line suggests that the texture can be four times larger (up to 131072 x 65536) than what you're trying to do (32768 x 65536). The "Layered 2D Texture Size" does has a limit of 32768 x 32768, but up to 2048 layers. Maybe that's another dimension that could be tapped into???

You could try changing the "1<<15" to "1<<16" to see if that helps/works (and make the other corresponding changes you alluded to).

Answer 2 · 2021-02-22T14:31:06.000Z

The 3090 I'm working with has the same limitations as the 2080Ti.

My load_texture_width branch off of master has the texture's width dependent on one line of code.

I confirmed that the implementation outputs an equivalent filterbank file.
Bumping the width up to 1<<16 did enable the previously troublesome FFT size, with an odd printout just before conclusion, highlighting some peculiarity I don't quite understand:

~/dev/rawspec/rawspec -f 65536 -t 4 guppi_59229_47368_006379_Unknown_0001 -d ./
working stem: guppi_59229_47368_006379_Unknown_0001
opening file: guppi_59229_47368_006379_Unknown_0001.0000.raw
number of bits per sample must be 8 or 16 (not 4), using 8 bps
buf_size: 8
buf_size: 65536
buf_size: 131072
buf_size: 1073741824
ctx->Nb: 8
ctx->Ntpb: 8192
ctx->Nc: 8192
ctx->Np: 2
ctx->Nbps: 8
buf_size: 2147483648
buf_size: 2147483648
buf_size>>16: 32768
1<<16: 65536
CUDA memory initialised for 8 bits per sample,
        will expand header specified 4 bits per sample.
opening file: guppi_59229_47368_006379_Unknown_0001.0001.raw [No such file or directory]

Message from syslogd@seti-node4 at Feb 22 10:30:42 ...
 kernel:[593486.411905] watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [rawspec:824681]
output product 0: 4 spectra

I then pushed for a still larger FFT as the 2D limits weren't being met, but the 1<<16 configuration produced the same cudaErrorInvalidValue. So I moved to 1<<17 which instead produced a cudaErrorMemoryAllocation error, with a similar soft lockup warning:

~/dev/rawspec/rawspec -f 131072 -t 4 guppi_59229_47368_006379_Unknown_0001 -d ./
working stem: guppi_59229_47368_006379_Unknown_0001
opening file: guppi_59229_47368_006379_Unknown_0001.0000.raw
number of bits per sample must be 8 or 16 (not 4), using 8 bps
ctx->Nb: 16
ctx->Ntpb: 8192
ctx->Nc: 8192
ctx->Np: 2
ctx->Nbps: 8
buf_size: 4294967296
buf_size: 4294967296
buf_size>>17: 32768
1<<17: 131072
got error cudaErrorMemoryAllocation at rawspec_gpu.cu:1104

Message from syslogd@seti-node4 at Feb 22 10:36:42 ...
 kernel:[593846.413082] watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [rawspec:825862]
rawspec initialization failed
output product 0: 0 spectra

The 1<<17 configuration did succeed with the originally troublesome FFT size, so 2^17 is a valid width, while 1<<18 always produced the cudaErrorInvalidValue: of course this is because of the dimension limits on 2D textures.

I think that the single line control point is worth merging as it controls something that is architecture dependent. The soft-lockup seems to correspond to when the buf_size exceeds the gpu's Maximum memory pitch, though I don't think that there is a logical connection there: Maximum memory pitch: 2147483647 bytes.

I'd like to implement a warning to explain the imminent failure around, but I'm still not sure of what is causing it, it's not completely linked to the the dimension limits... Wondering if there is a finer error code to be had...

Answer 3 · 2021-09-21T16:31:12.000Z

I've been revisiting this issue in order achieve -f 262144 with a RAW file of 6400 channels (5 antennae of 1280 coarse channels).

My findings are that the cudaErrorMemoryAllocation arises when the GPU doesn't have enough memory (of course!), and that the CPU soft lockups have been mitigated by disabling IOMMU in BIOS.

As a footnote, the above mentioned rawspec run (with ICS) uses up 43 GB of device memory.

Answer 4 · 2021-11-06T07:34:32.000Z

#22 not only lessens the memory usage but also provides premonitory error messages when detecting that memory usage exceeds the device memory.