unordered map creation freezes async processes
Opened this issue · 10 comments
Describe the bug
unordered map creation freezes async processes
Steps to reproduce
runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);
// The line below would only complete when runBuldKernel is done
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Expected behavior
The map creation and memory allocation should complete right away, without waiting for runBuldKernel to complete
Actual behavior
The map creation and memory allocation completes only after runBuldKernel is done
System (please complete the following information):
- OS: Windows 11 x64
- Compiler: MSVC Visual Studio 2022
- Backend: CUDA 12
- Library version: master
runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);
printf("1\n");
//stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Pointer* p;
cudaMalloc(&p, 1 * sizeof(Pointer));
printf("2\n")
This in contrast works in async. Allocation happens without waiting for runBuldKernel
to complete
This is a known limitation. Although the required parallel algorithms from thrust used in stdgpu as well as the intermediate interface in stdgpu all support arbitrary execution_policy
s (where a CUDA stream can be encapsulated), most functions (which also includes stdgpu::unordered_map::createDeviceObject
) fall back to the default stream. Thus, the behavior you observe primarily comes from how the default stream is handled in CUDA, which by default is "legacy" behavior and forces synchronization.
I think adding explicit support for asynchronous streams would be a good enhancement. Until this feature lands in stdgpu, as a workaround you could possibly 1. move the creation of the map to an earlier stage if this is possible, or 2. enable "per-thread" behavior for the default stream which can be set with the --default-stream
compiler flag.
1 is not possible. And I am not sure what 2 does, need to read about it, so it doesn't brake something else.
For reference, #351 tracks all affected functions which currently do not have proper support for custom execution_policy
s such as thrust::device.on(stream)
.
@stotko doesn't seem like default stream is the issue. This below works in async..
runBuldKernel << < block_size_x, thread_size_x >> > (ng, object_size_ui);
printf("1\n");
//stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Pointer* p;
cudaMalloc(&p, 1 * sizeof(Pointer));
printf("2\n");
cudaMalloc
and printf("2\n")
runs right away, without waiting for runBuldKernel
(made it infinite one) to finish. As I did not specify stream, this all goes to default one.
When I uncomment the map part, its blocked. No matter what comes after it.
Thanks for further testing. I still believe that the issue is related to the default stream. Just to make sure, could be try calling another kernel on the default stream (could be anything), while runBuldKernel
uses build_stream
as done before, so that you have the same setup described in the legacy default stream section.
In contrast to a pure cudaMalloc
which does not block, stdgpu::unordered_map::createDeviceObject
additionally also has to initialize its internal arrays which is done by calling kernels on the default stream.
runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);
printf("1\n");
k_2 << <1, 1 >> > ();
printf("2\n");
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
printf("3\n");
k_2 is executed without waiting. An then it blocks in map creation, i.e. 2 is printed
I have reproduced your observations. In fact, there are two issues:
- The legacy default stream forces a strict order on the execution of the kernels and performs implicit (weak) synchronization of the involved scheduled kernels but leaves the CPU asynchronous, hence the non-blocking CPU
printf
statements. thrust
's algorithms are all synchronous since CUDA 10.1 unless thethrust::async
versions are used. More recent versions, i.e. thrust 1.16 and newer, introduced the asynchronous policythrust::cuda::par_nosync.on(stream)
which would make the called (by default synchronous) algorithms behave like custom CUDA kernels without CPU blocking.
In that sense, you are right that my initial explanation was not sufficient. Fortunately, adding support for custom execution_policy
s would still resolve the issue as above policy could be used to force the correct behavior on thrust
's side.
So there is currently no solution to make this happen in async?
If you are only concerned about the CPU blocking part and the stream ordering behavior is acceptable, then a workaround could be to create the unordered_map
object in a different CPU thread, for instance using std::async
. Then, the creation would block in the newly created thread while the main thread would continue normally.