Question: Are cuModules shared between kernels from same program

Question

Question: Are cuModules shared between kernels from same program

mondus opened this issue 4 years ago · 7 comments

I.e. If I create mutiple jitify kernels from the same program which have a shared device symbol does get_global_ptr return the same address for each?

Would be good to know before I do some refactoring of some code.

Answer 1 · 2020-04-17T12:00:59.000Z

No, currently they are not shared, each kernel instantiation has its own cuModule, so the addresses will be different (I confirmed with a test).

This is arguably a design flaw in the Jitify API, and I'd been wondering if/when it would become a problem. I'd be interested to know how important it is for your application.

A (hypothetical) new Jitify API that better matched the underlying CUDA APIs would allow (/require) you to provide multiple name expressions for a single program (e.g., template instantiations of multiple kernels, globals etc.), then compile it once to a single module and extract all of the kernels and global addresses. This is doable, but would take a bit of refactoring and would be a slightly less intuitive API for common use-cases. Let us know if you think something like this would be of value.

Answer 2 · 2020-04-20T10:47:38.000Z

Thanks for the reply @benbarsdell. This is certainly an issue for us, particularly when it comes to constant memory. We have a number of large constant and statically sized device symbols which we can compile within the same unit but which need to be accessed by separate kernels in the same compilation unit. Your suggestion would be very helpful for our use case but also for any use case where there are multiple kernels in the same compilation unit. Would it not be possible to simply change the internals so that the cuModule was created by the program and shared with each kernel object?

We can work around the device symbols but I cant see a clear way to work around our use of constant memory. Although I am unclear if the constant memory limitations are per module/context/device.

Answer 3 · 2020-04-20T17:59:43.000Z

For the constants, could this be a good use jitify's new found linking ability: declare the __constant__ in the offline source code, e.g., in a .cu file and JIT compile the kernel and link against that object file?

Answer 4 · 2020-04-21T07:47:19.000Z

@maddyscientist Yes this might work so long as you can link multiple kernels against the same module (containing the constant definition). Presumably this is fine as they are in the same context?

Answer 5 · 2020-04-21T08:37:17.000Z

I think linking will have the same issue because there will still be multiple modules, unless I'm misunderstanding.

Would it not be possible to simply change the internals so that the cuModule was created by the program and shared with each kernel object?

The problem is that we currently have:

program.kernel(name).instantiate(template args).launch(...)

but what we would need is (roughly speaking):

program.instantiate(list of name expressions).kernel(name expression).launch(...).

In particular, the call to instantiate() is when the program gets compiled. Changing that means changing the fundamental flow of the API. This is doable, but not a small change.

Answer 6 · 2020-04-21T10:13:01.000Z

@benbarsdell Yes I imagine that you are right as after linking there would be multiple modules with duplicate definitions of the constant. To set the constant value would require doing this for each instaciation. I see now how this would be a significant change (but one which I would very much support!). Could you support both options? E.g.

program.instantiate_program(list of name expressions).instanciated_kernel(name expression).launch(...)

Supposedly this would then support things like.

program.get_global_ptr(...)

Which would solve all of my problems...

What I am currently still unclear on is how constant memory is allocate don the device. The following SO question points to the ISA docs suggesting "There is an additional 640 KB of constant memory, organized as ten independent 64 KB regions. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters.". Does this mean I could have a maximum of 10 jiffy kernels/modules each using 64KB of constant space, or could I have any number and some driver magic would take take or mapping these to regions at kernel launch?

Answer 7 · 2020-04-30T16:28:27.000Z

@benbarsdell We have a work around for this for now but it would be a nice feature to enable instantiation of multiple kernels from the same module.