Question: Linking RTC code with executing program

Question

Question: Linking RTC code with executing program

Robadob opened this issue 5 years ago · 4 comments

If the executing program has been compiled with -rdc=true, is it possible to link RTC compiled code with this so that it has access to shared symbols (__device__, __constant__)?

Small example of what I mean here: (Tested in Visual Studio 2015)
https://gist.github.com/Robadob/b2b7704a36e2f679942e854a0f41082a

I've found that cuLinkAddFile() would need to be used.

However it's not clear how that could be used to link with the executing program (and whether also building the executing program to fatbin/similar for linking would be useful).

Answer 1 · 2020-03-02T14:48:27.000Z

Have just found CU_JIT_GLOBAL_SYMBOL enums, which looks like they might be a less dynamic way of achieving this.

I've modified jitify::experimental::Kernel so that I can pass in arguments nopts, opts, optsvals to the constructor of jtifiy::detail::CUDAKernel.

On passing the 3 CU_JIT_GLOBAL_SYMBOL options, the call to cuModuleLoadDataEx() within the constructor throws CUDA_ERROR_NOT_SUPPORTED (I've added checks to ensure it's not coming from a previous call).

This is not documented as a possible error from that method, so I'm not sure if their lack of support is undocumented or I've done something incorrectly.

I've tested this using Windows, Visual Studio 2015 & 2017, CUDA 10.2, Titan X Pascal (Building for sm_61).

My changes to the above example, and modified Jitify header can be found here, potentially of note I forked the header from the hotfix/more-msvc-issues branch.

Answer 2 · 2020-03-03T21:49:35.000Z

Unfortunately it is not possible for runtime code to gain access to symbols in the host program.

Jitify does support two related things that may be useful:

Linking external device libraries using -Lpath and -lname options. We could potentially extend this to support linking with ptx/cubin/object files as well if that would be useful.
Accessing __device__ and __constant__ variables declared inside the runtime code via kernel_instantiation.get_constant_ptr() (we will be extending this to support accessing __device__ variables as well soon).

I'm not very familiar with the CU_JIT_GLOBAL_SYMBOL flags, but it looks like they're used to make extern symbols in the runtime code use host addresses instead of looking for device symbols during linking. Is there a reason this is useful for your application vs. say using non-extern symbols in your runtime code and accessing them via get_constant_ptr?

Answer 3 · 2020-03-03T22:30:11.000Z

I'm working with @mondus.

We're developing a new framework for developing complex system simulations, whilst abstracting away most of the CUDA (FLAMEGPU2, to improve on usability over FLAMEGPU). He's also interested in developing a runtime compiled Python interface to the library, to make it more accessible, which is where this comes in. (I'm not sure if he told you this much the other day)

Accessing __device__ and __constant__ variables ...

Several common reused items/pointers are stored behind __device__ and __constant__ variables, hence automatic linking with the executing runtime would be convenient. There are several workarounds, such as reducing the number of symbols (by packing items inside a struct) and then manually setting the symbols inside the RTC units as each is compiled (using get_constant_ptr or similar, I hadn't noticed this, but it will require some changes to reduce the number of items we need to set).

One potential downside of this would be that duplicating symbols would inflate constant cache usage, as we're using a large portion of constant cache for a read-only hashtable central to accessing variables within models. For potentially large models with many RTC device functions, this could lead to capacity issues.

Edit: Given we'd be duplicating part/all of that hash table, this could actually fill up constant cache very fast. (The hash table is currently given half the constant cache, simply because it can't be scaled at runtime. We'd need to do some parsing of the RTC code to adapt the size per kernel presumably. This obviously requires some more though on our part.)

Edit2: I suppose we could actually not have the hash table in the main runtime (for a special Python/RTC build), and cat all of the RTC code so it builds as a single unit so we're not duplicating constant memory!

~~But that's a long way off at the stage, so I don't think it's a killer problem.~~

Linking external device libraries using -Lpath and -lname options. ...

At this point I'm not sure how viable it would be to load the entire library from PTX at runtime, as from my understanding that would necessitate require changing large parts of the codebase that interact with CUDA.

I'm assuming, if we built the library a second time to ptx/cubin, and linked against that, it wouldn't also link against the runtime. ~~So your second suggestion is likely the way to go.~~

Edit 3: I suppose the more technically adept approach would be to combined your suggestions, and build a users RTC kernel's separately and link them to achieve similar to edit 2 above.

Thanks for the advice.

Answer 4 · 2020-03-03T22:34:19.000Z

@benbarsdell
P.S. could you please add #include <cctype> to jitify.hpp, it doesn't build without this under Visual Studio 2015, due to isdigit and isalpha (removing std:: from them also works but that's probably a less portable solution).

Seems too minor of an issue for it's own ticket, (and the fact it's a deprecated IDE).