GPUOpen-LibrariesAndSDKs/Orochi

oroOccupancyMaxActiveBlocksPerMultiprocessor returns hipErrorInvalidDeviceFunction always

AtsushiYoshimura0302 opened this issue · 7 comments

Hi, the API oroOccupancyMaxActiveBlocksPerMultiprocessor always fails. I tried the "main" branch and also "release/hip6.0_cuda12.2" with RX7900XTX and RTX4090 but it always failed.

This can be reproduced by just adding this to "SimpleDemo".

			int numBlocks = 0;
			oroError_t error = oroOccupancyMaxActiveBlocksPerMultiprocessor( &numBlocks, function, 128, 0 );
			printf( "occupancy api %d %d\n", error, numBlocks ); // shows occupancy api 98 0

can anyone help?

The function is correctly bound to HIP ( hipOccupancyMaxActiveBlocksPerMultiprocessor ) so I don't think the bug is related to Orochi.
I confirm I'm reproducing the same error code: hipErrorInvalidDeviceFunction (98).

@RichardGe
Thank you for checking but I found out the reason.
Orochi API have to use hipModuleOccupancyMaxActiveBlocksPerMultiprocessor/cuOccupancyMaxActiveBlocksPerMultiprocessor instead of hipOccupancyMaxActiveBlocksPerMultiprocessor / cudaOccupancyMaxActiveBlocksPerMultiprocessor since orochi uses runtime compilation. There is a difference in the pointer treatment between driver API and runtime API. The current binding is for runtime API.

note: https://forums.developer.nvidia.com/t/using-cudaoccupancymaxactiveblockspermultiprocessor-with-function-acquired-with-cumodulegetfunction/184191

I think there are some more incorrect bindings e.g. oroFuncGetAttributes()

I confirmed the behavior with HIP SDK6.1 and https://github.com/ROCm/rocm-examples.git (92786e2 - Add source format linting to the GitHub workflows (#140)) and https://github.com/NVIDIA/cuda-samples

Some clarification here:

The selection of these two functions depends on where the function pointer, which is used as one of the params, is coming from.

For example, in CUDA case:

If the function pointer is originally from something like cuModuleGetFunction() , the rest should be bound to "cu" instead of "cuda" and we cannot mix them.

Note: In CUDA, runtime API functions start with "cuda" and driver API functions start with "cu"

The same applies to HIP.

Hi @AtsushiYoshimura0302 we investigated with @KaoCC ,
in SimpleDemo, the function is taken from oroModuleGetFunction , so you need to use oroModuleOccupancyMaxActiveBlocksPerMultiprocessor instead of oroOccupancyMaxActiveBlocksPerMultiprocessor.
I tested, it worked.
So, I think we can close this ticket.

ah, thanks you for finding it out and checking.