ut-parla/Parla.py

Should Parla tasks/contexts also setup stream handles for numba?

wlruys opened this issue · 17 comments

Inspired by the recent discussion on Slack around https://gist.github.com/leofang/4a043e5d94b4702d04fde2b9e7dcebbd
and passing the current stream to numba kernels.

Looking forward, would it make sense to provide a local variable within a task for this as well to prevent the user from creating and destroying these stream wrappers themselves?

If there is a way to set the default numba CUDA stream, then Parla should have a component that can set that automatically before entering each task. Then the same stream can be automatically used everywhere you would expect it to be.

From what I'm seeing on this, they follow the CUDA API quite closely which afaik doesn't have a way of setting a default. (aside from the usual two defaults: per-thread and not)

They just have the pattern of passing it into each kernel call. But that would be ideal if it existed.

Actually that's only semi-related. That's about using the per-thread default streams introduced in CUDA 7 in their CUDA module. It's not suggesting interop between numba kernel calls and cupy stream context management idioms.

numba/numba#4797 is semi-related. I'll try creating an issue upstream to see what they say though.

Submitted upstream as numba/numba#6921. We'll see if they can give us some guidance.

They suggested using the per-thread default stream as a potential workaround. I initially didn't think that'd work, but since we already are synchronizing the stream at the end of each task and since CUDA's per-thread default stream mode runs those default streams in asynchronous mode, I think that may actually do what we need. The catch is that it requires that libraries be compiled with per-thread default stream enabled. CuPy claims to support this if CUPY_CUDA_PER_THREAD_DEFAULT_STREAM is set to 1 (I'm not sure how they do that with the same binary...). HIP doesn't have an equivalent though. Does anyone else have any thoughts on this?

I worry about requiring specific features in the libraries we use. But without VECs it's kind of hard to avoid, so I think this would be reasonable.

You know, this is what Component where made for. We could have a per-thread default component and a newly created stream component and users who care can setup environments for tasks to run in that have both or either and can then specify which tasks should run in which context with tags. Tags are a feature (that may or may not be fully implemented, I cannot remember) that allows arbitrary hashable objects to be attached to environments and to tasks and a task can only use a given environment if the environment has a superset of the tags the task has. So it provides a form of ad-hoc selector.

Right, and I'd be somewhat surprised if HIP didn't support this eventually. It's actually a very natural idiom. OTOH, who knows. Sometimes upstream projects still do surprising things.

Honestly they may already support it and just never have mentioned it. To me per-thread default streams is the obviously correct option and a shared stream really feels incorrect to me. So it may be that HIP built per-thread streams into their framework from the beginning. It seems to have tried to solve some of the problems that CUDA has. CUDA is not a very good API since it grew rather carelessly and was never designed as far as I can tell.

Yah, I don't know to what extent HIP is mimicking CUDA vs fixing its failings.

https://github.com/cupy/cupy/blob/596d1af53b5793d3d52994c8b493ff42be453a8d/cupy_backends/cuda/stream.pyx#L10 seems to imply that this is something we'd have to set before importing cupy. Other than that, it seems like a reasonable short-term fix until we can get numba/numba#6921 fixed. I suspect that that may take some cross-library coordination though (like Graham mentions there).

Just having to import parla.gpu before cupy to fix this isn't perfect, but it's way better than having to shuttle around stream objects manually.

Actually it looks like they still don't support per-thread default streams. For some reason I misread their issue as a pull request. It looks like PTDS is the easier fix for them and is just more likely to happen in the short term.

Relevant discussion in numba/numba#5137. PTDS may be available upstream very soon!

So you see. I read this "PTSD" and I thought "I've been on open source projects like that". Why can't we have nice things. ;-)

I'm worried about this convergence on per thread default streams since as far as I know there is no way to set the default stream on thread to a specific stream. I think this is a bad choice. But whatever, not my problem. In the end this is just a result of the fact that the languages we are working in lack proper abstractions for context. I wish more languages had implicit parameters of some kind.

I agree that the per-thread default stream idea isn't ideal in general without a way to set it. On the other hand, it'll be good enough for our use-case regardless of whether or not we can configure the default, so I'm happy.