Question regarding block launch order in CUDA
Snektron opened this issue · 0 comments
The CUDA C programming guide mentions on page 13:
This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system needs to know the physical multiprocessor count.
However, in the code for agent scan there is this comment:
Lines 408 to 412 in 5d12837
Does this mean that scan is using undefined behavior here, or have I missed some specification of CUDA?