oroOccupancyMaxActiveBlocksPerMultiprocessor returns hipErrorInvalidDeviceFunction always

Question

oroOccupancyMaxActiveBlocksPerMultiprocessor returns hipErrorInvalidDeviceFunction always

AtsushiYoshimura0302 opened this issue 5 months ago · 7 comments

AtsushiYoshimura0302 commented 5 months ago

Hi, the API oroOccupancyMaxActiveBlocksPerMultiprocessor always fails. I tried the "main" branch and also "release/hip6.0_cuda12.2" with RX7900XTX and RTX4090 but it always failed.

This can be reproduced by just adding this to "SimpleDemo".

			int numBlocks = 0;
			oroError_t error = oroOccupancyMaxActiveBlocksPerMultiprocessor( &numBlocks, function, 128, 0 );
			printf( "occupancy api %d %d\n", error, numBlocks ); // shows occupancy api 98 0

can anyone help?

Answer 1 · 2024-07-23T13:18:45.000Z

The function is correctly bound to HIP ( hipOccupancyMaxActiveBlocksPerMultiprocessor ) so I don't think the bug is related to Orochi.
I confirm I'm reproducing the same error code: hipErrorInvalidDeviceFunction (98).

Answer 2 · 2024-07-24T02:23:44.000Z

@RichardGe
Thank you for checking but I found out the reason.
Orochi API have to use hipModuleOccupancyMaxActiveBlocksPerMultiprocessor/cuOccupancyMaxActiveBlocksPerMultiprocessor instead of hipOccupancyMaxActiveBlocksPerMultiprocessor / cudaOccupancyMaxActiveBlocksPerMultiprocessor since orochi uses runtime compilation. There is a difference in the pointer treatment between driver API and runtime API. The current binding is for runtime API.

note: https://forums.developer.nvidia.com/t/using-cudaoccupancymaxactiveblockspermultiprocessor-with-function-acquired-with-cumodulegetfunction/184191

Answer 3 · 2024-07-24T02:27:34.000Z

I think there are some more incorrect bindings e.g. oroFuncGetAttributes()

Answer 4 · 2024-07-24T02:36:38.000Z

I confirmed the behavior with HIP SDK6.1 and https://github.com/ROCm/rocm-examples.git (92786e2 - Add source format linting to the GitHub workflows (#140)) and https://github.com/NVIDIA/cuda-samples

Answer 5 · 2024-10-17T11:16:56.000Z

Some clarification here:

The selection of these two functions depends on where the function pointer, which is used as one of the params, is coming from.

For example, in CUDA case:

If the function pointer is originally from something like cuModuleGetFunction() , the rest should be bound to "cu" instead of "cuda" and we cannot mix them.

Note: In CUDA, runtime API functions start with "cuda" and driver API functions start with "cu"

The same applies to HIP.

Answer 6 · 2024-10-17T14:02:13.000Z

Hi @AtsushiYoshimura0302 we investigated with @KaoCC ,
in SimpleDemo, the function is taken from oroModuleGetFunction , so you need to use oroModuleOccupancyMaxActiveBlocksPerMultiprocessor instead of oroOccupancyMaxActiveBlocksPerMultiprocessor.
I tested, it worked.
So, I think we can close this ticket.

Answer 7 · 2024-10-17T14:39:30.000Z

ah, thanks you for finding it out and checking.