Unique env with mixed # of threads/block and chained CUDA kernels. Is Warp-Drive appropriate?

Question

Unique env with mixed # of threads/block and chained CUDA kernels. Is Warp-Drive appropriate?

Closed this issue 6 months ago · 2 comments

Hello! I have a weird environment which I am having difficulty implementing in warp-drive. Essentially, the environment has N agents place M units on their own boards. After all agents are done placing units, then the boards are matched vs each other and intensive computations are performed to determine per-agent rewards.

I was thinking I could have a CUDA Step function with N agents (threads) per environment (1 block per env) which would handle overall state/action. When the agents are done performing actions, then a CUDA BoardStep function with M units (threads) per board (1 block per board) would run being fed the mapped state -> board_state input (mapping would be done by a separate CUDA function).
I essentially am attempting the below:

step():  # 4 agents per env
    CudaEnvStep(_state_, _action_, _done_)  # 4 agents per block
    if (_done_ && !board_done):
        CudaMapEnvToBoard(_state_, board_state)
    while (_done_ && !board_done):
        CudaBoardStep(board_state, board_done, board_reward)   # 24 units per block
    if (_done_ && board_done):
        CudaCombineRewards(board_reward, _reward_)   # 4 agents per block again

I have implemented the CudaBoardStep(). I am not sure if Warp-Drive's Trainer can handle multiple CUDAFunctionManagers with different threads/block and if this impacts Warp-Drive's performance. Looking at the example environments, I do not see a mixed-thread or chained CUDA kernels environment.

Questions:

Does warp-drive support chained CUDA kernels? Can I make every operation in my step a separate CUDA kernel if necessary and warp-drive will chain them together similar to CUDA Graphs?
Can I have CUDA functions with a different # of threads per block (aka different # of "agents" per environment) mixed within a step() without expecting a significant performance loss?
Would branch/loop operations like if/while run on GPU? I am not sure if the if/while operations are running within PyTorch GPU context or not.

Answer 1 · 2024-06-20T04:15:36.000Z

Does warp-drive support chained CUDA kernels? Can I make every operation in my step a separate CUDA kernel if necessary and warp-drive will chain them together similar to CUDA Graphs?

Yes, CUDA is naturally supporting the chain of kernel functions, actually in WarpDrive we have sequential calling of step(), reset() and many kernel functions one by one.

Can I have CUDA functions with a different # of threads per block (aka different # of "agents" per environment) mixed within a step() without expecting a significant performance loss?

Yes, there is no need that each block has the same number of threads. It can be easily achieved by checking the validity of both thread id and block id, if it is out of the range of your setting, it will be regarded as out of scope. There should be no performance loss in this case, as GPU always execute all the threads the same code in one block in a group, it does not matter how many threads in your block that are actually useful.

Would branch/loop operations like if/while run on GPU? I am not sure if the if/while operations are running within PyTorch GPU context or not.

Yes, conditional and loop operations are the same as a CPU program

Answer 2 · 2024-06-20T05:09:08.000Z

Thank you so much for the quick response! I just found the Slack link, so I will use that for any future questions I have, sorry for creating this issue. Closing this issue.