acts-project/traccc

Incompatibility with CUDA 11.3

stephenswat opened this issue · 6 comments

This is a quick continuation of #113, where we find that traccc is currently not compatible with CUDA 11.3, and I would like to know why. I'll keep this as a running log of my findings.

This is the compatibility matrix of CUDA versions installed on atspot01:

CUDA toolkit version Works
10.1.243 ❌ (expected)
10.2.89 ❌ (expected)
11.0.3 ✔️
11.1.1 ✔️
11.2.2 ✔️
11.3.1
11.4.3 ✔️
11.5.0 ✔️
11.5.1 ✔️

Okay, there is some extremely odd behaviour happening inside nvcc, and I think inside cudafe++. It seems that for CUDA toolkit 11.3.1, the cudafe1.cpp translations of counting_grid_capabilities.cu and populating_grid.cu are incorrectly referencing spacepoint_t:

zone_t(scalar v, const spacepoint_t< neighbor_t, 2>  &nhood) const {
dindex_sequence zone(scalar v, const spacepoint_t< dindex, 2U>  &nhood) const {
dindex_sequence zone(scalar v, const spacepoint_t< dindex, 2U>  &nhood) const {

Here are the corresponding lines for CUDA 11.4.3:

zone_t(scalar v, const array_type< neighbor_t, 2>  &nhood) const {
dindex_sequence zone(scalar v, const array_type< dindex, 2U>  &nhood) const {
dindex_sequence zone(scalar v, const array_type< scalar, 2U>  &nhood) const {

The corresponding lines from detray/core/include/detray/grids/axis.hpp are:

zone_t(scalar v, const array_type<neighbor_t, 2> &nhood) const {
dindex_sequence zone(scalar v, const array_type<dindex, 2> &nhood) const {
dindex_sequence zone(scalar v, const array_type<scalar, 2> &nhood) const {

These files are generated from the corresponding .cpp4.ii files by cudafe++.

I can confirm that the .cpp4.ii files have identical versions of these lines.

Invoking the two versions of cudafe++ (11.3.1 and 11.4.3) on exactly the same input (the cudafe1.stub.cpp generated by cicc 11.3.1, and the .cpp4.ii by the 11.3.1 preprocessor) results in the same behaviour: the 11.3.1 version erroneously inserts spacepoint_t where it shouldn't be. Running the 11.2.2 version of cudafe++ produces the same correct output that 11.4.3 does.

Okay, I am sufficiently convinced that this is a bug in cudafe++.

Okay, I can't really debug this any further, because cudafe++ is opaque as hell, and as far as I know there aren't really any changelogs or documentation for it. However, I have boiled down the issue to the detray::axis::regular type. My guess is that cudafe++ can't cope with the complex kind (* → uint → *) → (*^n → *) → *, which I suspect is either due to the n-ary nature of the kind of the second type parameter, or because the first type parameter accepts a non-* kind.

The symptom of this is that it starts substituting (seemingly) random (incompatible) types, such as spacepoint_t, where it expects the array type or the vector type. I presume that this might be some kind of indexing error happening at template resolution time, but I don't have enough evidence to make any concrete claims.

To conclude, CUDA 11.3.1 is completely bat-shit insane. The only next steps might be to investigate CUDA 11.3.0 and CUDA 11.4.0, the directly preceding and following versions.