CUDA 11.3+ Register Usage

Question

CUDA 11.3+ Register Usage

ptheywood opened this issue 4 years ago · 4 comments

Register usage for CUDA 11.3 appears to be significantly higher than previous cuda versions, especially for the iteration kernel in boids bruteforce.

This is probably worth promoting to an nvbug report.

Steps to reproduce

git clone git@github.com:FLAMEGPU/FLAMEGPU2.git
cd FLAMEGPU2
git checkout c3524e6
# Ensure the correct CUDA version to check is on the path / use module load
cmake .. -DCUDA_ARCH=70 -DSEATBELTS=OFF -DUSE_NVTX=ON
make -j 8 flamegpu2 boids_bruteforce
# Inspect the values generated for `Z22agent_function_wrapperI14inputdata_impl...` 
# I.e. ptxas info    : Used 170 registers, 408 bytes cmem[0], 4 bytes cmem[2]

Or to generate a profile:

ncu --set=full -f -o 11-x ./bin/linux-x64/Release/boids_bruteforce -t -s 1

CUDA Version	Reg/thread
11.0	60
11.1	60
11.2	70
11.3	170

The above results are built for SM 70, as of 70c2e17, although the results should be the same as when using c3524e6 which just adds verbose ptxas so profiling is not required.

Enabling LTO brings it down a little, but not significanlty (~156 ish).

When built for SM 61 instead, 162 registers are used.

CUDA 11.3 introduces a way to dump the device callgraph at link time (the following cmake). This doesn't provide any useful information, just showing that the kernel is using 170 reg/thread (its 2 sub-calls both use < 30 reg, so its not a sub call issue.

    add_link_options("$<DEVICE_LINK:SHELL:-Xnvlink -dump-callgraph>")

By commenting out sections of the intputdata method in examples/boids_bruteforce/src/main.cu some more insight can be gained to why this kernel has higher register use, and where the main source of difference comes from.

Commenting	11.0	11.3
Just a return statement	8	8
Message loop commented out	46	46
Message loop with no body	56	112
Message loop with just getVariable	40	142
Perceived Count being updated	50	160
Global velocity being udpated in loop	50	160
Fully enabled (collision update/check)	60	170

An experimental build of the bods_spatial3D model in the rdc_off branch which builds without relocatable device code uses 157 registers / thread rather than 170, so it has an impact but the register use regression is not RDC specific.

This uses the following to enable compiler output of register use

add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:SHELL:-Xptxas -v>")

Answer 1 · 2021-06-29T19:50:26.000Z

CUDA 11.4 also uses 170 reg/thread.

Answer 2 · 2021-06-30T09:49:19.000Z

The experimental shared memory curve implementation (cineca-experimental-smcurve) appears to reduce the register usage back to much more sane levels.
boids_bruteforce uses only 64 reg / thread with CUDA 11.4 when in that branch (unsure why cmem is still being reported)

ptxas info    : Used 64 registers, 440 bytes cmem[0]

11.3 and 11.2 both use 65 reg / thread.

Answer 3 · 2021-10-21T12:25:28.000Z

CUDA 11.5, seatbelts=OFF is 168 reg/thread for SM 70.

Seatbelts=ON, SM70 is 218 reg/thread.
Seatbelts=ON SM61 is 175 reg/thread.

Shared mem curve will still be required to improve perf.

Answer 4 · 2022-05-12T15:38:02.000Z

CUDA 11.7, SM_86, Seatbelts=OFF is 145 reg/thread. 160 for SM_70, so still poor.