CUDA 11.3+ Register Usage
ptheywood opened this issue · 4 comments
Register usage for CUDA 11.3 appears to be significantly higher than previous cuda versions, especially for the iteration kernel in boids bruteforce.
This is probably worth promoting to an nvbug report.
Steps to reproduce
git clone git@github.com:FLAMEGPU/FLAMEGPU2.git
cd FLAMEGPU2
git checkout c3524e6
# Ensure the correct CUDA version to check is on the path / use module load
cmake .. -DCUDA_ARCH=70 -DSEATBELTS=OFF -DUSE_NVTX=ON
make -j 8 flamegpu2 boids_bruteforce
# Inspect the values generated for `Z22agent_function_wrapperI14inputdata_impl...`
# I.e. ptxas info : Used 170 registers, 408 bytes cmem[0], 4 bytes cmem[2]
Or to generate a profile:
ncu --set=full -f -o 11-x ./bin/linux-x64/Release/boids_bruteforce -t -s 1
CUDA Version | Reg/thread |
---|---|
11.0 | 60 |
11.1 | 60 |
11.2 | 70 |
11.3 | 170 |
The above results are built for SM 70, as of 70c2e17, although the results should be the same as when using c3524e6 which just adds verbose ptxas so profiling is not required.
Enabling LTO brings it down a little, but not significanlty (~156 ish).
When built for SM 61 instead, 162 registers are used.
CUDA 11.3 introduces a way to dump the device callgraph at link time (the following cmake). This doesn't provide any useful information, just showing that the kernel is using 170 reg/thread (its 2 sub-calls both use < 30 reg, so its not a sub call issue.
add_link_options("$<DEVICE_LINK:SHELL:-Xnvlink -dump-callgraph>")
By commenting out sections of the intputdata method in examples/boids_bruteforce/src/main.cu some more insight can be gained to why this kernel has higher register use, and where the main source of difference comes from.
Commenting | 11.0 | 11.3 |
---|---|---|
Just a return statement | 8 | 8 |
Message loop commented out | 46 | 46 |
Message loop with no body | 56 | 112 |
Message loop with just getVariable | 40 | 142 |
Perceived Count being updated | 50 | 160 |
Global velocity being udpated in loop | 50 | 160 |
Fully enabled (collision update/check) | 60 | 170 |
An experimental build of the bods_spatial3D
model in the rdc_off
branch which builds without relocatable device code uses 157
registers / thread rather than 170
, so it has an impact but the register use regression is not RDC specific.
This uses the following to enable compiler output of register use
add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:SHELL:-Xptxas -v>")
CUDA 11.4 also uses 170 reg/thread.
The experimental shared memory curve implementation (cineca-experimental-smcurve
) appears to reduce the register usage back to much more sane levels.
boids_bruteforce
uses only 64 reg / thread with CUDA 11.4 when in that branch (unsure why cmem is still being reported)
ptxas info : Used 64 registers, 440 bytes cmem[0]
11.3 and 11.2 both use 65 reg / thread.
CUDA 11.5, seatbelts=OFF is 168 reg/thread for SM 70.
Seatbelts=ON, SM70 is 218 reg/thread.
Seatbelts=ON SM61 is 175 reg/thread.
Shared mem curve will still be required to improve perf.
CUDA 11.7, SM_86, Seatbelts=OFF is 145 reg/thread. 160 for SM_70, so still poor.