Should Parla tasks/contexts also setup stream handles for numba?

Question

Should Parla tasks/contexts also setup stream handles for numba?

wlruys opened this issue 4 years ago · 17 comments

Inspired by the recent discussion on Slack around https://gist.github.com/leofang/4a043e5d94b4702d04fde2b9e7dcebbd
and passing the current stream to numba kernels.

Looking forward, would it make sense to provide a local variable within a task for this as well to prevent the user from creating and destroying these stream wrappers themselves?

Answer 1 · 2021-04-06T22:42:50.000Z

If there is a way to set the default numba CUDA stream, then Parla should have a component that can set that automatically before entering each task. Then the same stream can be automatically used everywhere you would expect it to be.

Answer 2 · 2021-04-06T23:07:03.000Z

From what I'm seeing on this, they follow the CUDA API quite closely which afaik doesn't have a way of setting a default. (aside from the usual two defaults: per-thread and not)

They just have the pattern of passing it into each kernel call. But that would be ideal if it existed.

Answer 3 · 2021-04-12T17:38:56.000Z

See also: numba/numba#5137

Answer 4 · 2021-04-12T17:42:07.000Z

Actually that's only semi-related. That's about using the per-thread default streams introduced in CUDA 7 in their CUDA module. It's not suggesting interop between numba kernel calls and cupy stream context management idioms.

Answer 5 · 2021-04-12T17:43:29.000Z

numba/numba#4797 is semi-related. I'll try creating an issue upstream to see what they say though.

Answer 6 · 2021-04-12T17:48:30.000Z

Submitted upstream as numba/numba#6921. We'll see if they can give us some guidance.

Answer 7 · 2021-04-14T16:40:23.000Z

They suggested using the per-thread default stream as a potential workaround. I initially didn't think that'd work, but since we already are synchronizing the stream at the end of each task and since CUDA's per-thread default stream mode runs those default streams in asynchronous mode, I think that may actually do what we need. The catch is that it requires that libraries be compiled with per-thread default stream enabled. CuPy claims to support this if CUPY_CUDA_PER_THREAD_DEFAULT_STREAM is set to 1 (I'm not sure how they do that with the same binary...). HIP doesn't have an equivalent though. Does anyone else have any thoughts on this?

Answer 8 · 2021-04-14T17:27:41.000Z

I worry about requiring specific features in the libraries we use. But without VECs it's kind of hard to avoid, so I think this would be reasonable.

You know, this is what Component where made for. We could have a per-thread default component and a newly created stream component and users who care can setup environments for tasks to run in that have both or either and can then specify which tasks should run in which context with tags. Tags are a feature (that may or may not be fully implemented, I cannot remember) that allows arbitrary hashable objects to be attached to environments and to tasks and a task can only use a given environment if the environment has a superset of the tags the task has. So it provides a form of ad-hoc selector.

Answer 9 · 2021-04-14T17:49:26.000Z

Right, and I'd be somewhat surprised if HIP didn't support this eventually. It's actually a very natural idiom. OTOH, who knows. Sometimes upstream projects still do surprising things.

Answer 10 · 2021-04-14T18:21:28.000Z

Honestly they may already support it and just never have mentioned it. To me per-thread default streams is the obviously correct option and a shared stream really feels incorrect to me. So it may be that HIP built per-thread streams into their framework from the beginning. It seems to have tried to solve some of the problems that CUDA has. CUDA is not a very good API since it grew rather carelessly and was never designed as far as I can tell.

Answer 11 · 2021-04-14T18:38:48.000Z

Yah, I don't know to what extent HIP is mimicking CUDA vs fixing its failings.

Answer 12 · 2021-04-14T18:40:05.000Z

https://github.com/cupy/cupy/blob/596d1af53b5793d3d52994c8b493ff42be453a8d/cupy_backends/cuda/stream.pyx#L10 seems to imply that this is something we'd have to set before importing cupy. Other than that, it seems like a reasonable short-term fix until we can get numba/numba#6921 fixed. I suspect that that may take some cross-library coordination though (like Graham mentions there).

Answer 13 · 2021-04-14T18:44:16.000Z

Just having to import parla.gpu before cupy to fix this isn't perfect, but it's way better than having to shuttle around stream objects manually.

Answer 14 · 2021-04-14T20:22:31.000Z

Actually it looks like they still don't support per-thread default streams. For some reason I misread their issue as a pull request. It looks like PTDS is the easier fix for them and is just more likely to happen in the short term.

Answer 15 · 2021-04-15T17:31:12.000Z

Relevant discussion in numba/numba#5137. PTDS may be available upstream very soon!

Answer 16 · 2021-04-15T18:10:58.000Z

So you see. I read this "PTSD" and I thought "I've been on open source projects like that". Why can't we have nice things. ;-)

I'm worried about this convergence on per thread default streams since as far as I know there is no way to set the default stream on thread to a specific stream. I think this is a bad choice. But whatever, not my problem. In the end this is just a result of the fact that the languages we are working in lack proper abstractions for context. I wish more languages had implicit parameters of some kind.

Answer 17 · 2021-04-15T18:18:48.000Z

I agree that the per-thread default stream idea isn't ideal in general without a way to set it. On the other hand, it'll be good enough for our use-case regardless of whether or not we can configure the default, so I'm happy.