nvptx64-nvidia-cuda target does not support self-referential statics

Question

nvptx64-nvidia-cuda target does not support self-referential statics

Opened this issue 2 months ago · 10 comments

When compiling this code for the nvptx64-nvidia-cuda target, an LLVM error will be thrown:

#![no_std]
struct Bar(&'static Bar);
static FOO: Bar = Bar(&FOO);

rustc-LLVM ERROR: Circular dependency found in global variable set
Compiler returned: 101

This appears to be a fundamental limitation of PTX(self-referential globals are not allowed). This issue seems to be present in all versions of rust which support this target(from the newest nightly back to at least 1.45.2).

I am opening this issue mostly to document this limitation, and I have no idea how it could get fixed.

Answer 1 · 2025-09-19T18:16:55.000Z

uh... cc @rust-lang/opsem If you want an excuse to call PTX a nonconforming implementation of Rust, we've got one, I guess?

Answer 2 · 2025-09-19T19:01:35.000Z

I would be very surprised if this is the only non-conformity of that target. ;) This one is fairly harmless since it just reliably errors.

Answer 3 · 2025-09-19T19:49:01.000Z

We need to patch this to obtain a post-monomorphization error for this target so that we error before we reach LLVM, as LLVM may change whether or not this errors for reasons that may not be related to whether or not it is correct. It would be best if we did not produce IR with LLVM or hardware-level UB on a target.

I guess the hardware is virtual, but

Answer 4 · 2025-09-19T20:20:06.000Z

We need to patch this to obtain a post-monomorphization error for this target so that we error before we reach LLVM, as LLVM may change whether or not this errors for reasons that may not be related to whether or not it is correct.

What would be the best way to detect this? And would people mind some kind of fix-up for this instead?

For Rust-CUDA(which I am currently working on), we are fine generating some kind of lazy-initialized static here(if that is possible, and does not violate the semantics of the Rust language).

Of course, I'd prefer if I could contribute something like that to the upstream(so the issue is fixed there too). The bug affects both the ordinary cg_llvm & cg_nvvm, so fixing it generally would be ideal.

Answer 5 · 2025-09-19T20:22:29.000Z

Moreover, I must note this issue is caused by any kind of cyclical reference - not just a self-reference.

Answer 6 · 2025-09-19T20:42:30.000Z

For Rust-CUDA(which I am currently working on), we are fine generating some kind of lazy-initialized static here(if that is possible, and does not violate the semantics of the Rust language).

Oh dear. "Lazily-initialized" how, exactly?

Answer 7 · 2025-09-19T21:27:27.000Z

For Rust-CUDA(which I am currently working on), we are fine generating some kind of lazy-initialized static here(if that is possible, and does not violate the semantics of the Rust language).

Oh dear. "Lazily-initialized" how, exactly?

Before starting execution(all CUDA entrypoints are clearly marked), we check an atomic variable. If it is the first time this kernel is excecuted, we execute a fixup function.

Nothing crazy(like recreating LazyLock in PTX), all statics get fixed up at once. Perhaps "Lazy-initialized" is not the best term for it. Just a fixup on first execution of a loaded program.

Ideally, we would not need this whole song and dance, but it seems like some crates(like bumpalo) use self-referential statics. bumpalo is decently common, so it would be neat if it worked here.

Answer 8 · 2025-09-19T23:51:44.000Z

/cc @kjetilkjeka as you may have opinions here.

Answer 9 · 2025-10-08T13:53:15.000Z

This is a bit of an annoyance. I guess it's one of things I would hope that LLVM or the CUDA runtime handled, but it's also not that much of a surprise when they don't. I'm not an expert on solving these kind of problems, but here's my view of things.

I guess it's possible to also call a ptx function from a different ptx module (e.g. by using cuLinkAddFile to link them) . I think to produce correct code it would require us to emit the "fixup code" in all ptx functions? But we might also want to avoid the code for this impossible branch when the functions are being inlined (which is practically always in ptx) as they are already run on the "outer function" or kernels the functions are being inlined into.

This makes me think the init code should probably be added later than (llvm) codegen. Ideally at ptx loading/jitting. Since jittign is done by the CUDA driver, practically speaking it needs to happen somewhere in the linking/lto stages of things. Do we have sufficient control of things in cg_nvvm and what comes after to make that feasible?

Is this analog to the things going on with wasm with __wasm_apply_global_relocs? If so, can we learn something from the approach taken there?

Unless this materializes incredibly quickly then emitting the kind of error that @workingjubilee suggests will be an important first step.

Answer 10 · 2025-10-25T20:52:46.000Z

Unless this materializes incredibly quickly then emitting the kind of error that @workingjubilee suggests will be an important first step.

I have a viable solution for this(which I have tested in very small demo programs), with some limitations: it works with GPU object files, but does not work on the PTX level(it fails to link with unresolved references if somebody attempts to use that PTX).

Architecture-specific compiled GPU code is stored inside ELF files, which do support this kind of self-referential global. My solution is to do some fixup at link-time. The fixup step is needed since the semantics of self-referential globals can't be directly expressed in PTX.

The idea is that if a global FOO wants to refer to itself, it will refer to a fake extern called FOO_UNRESOLVED_CYCLICREF(or something like this). This allows us to express self-refernetial globals on the PTX level.

Then, after the PTX is compiled into GPU-specifc ELF object files, the reference to FOO_UNRESOLVED_CYCLICREF can be replaced with a reference to FOO.

If the fixup is not applied, then the file will just fail to link with unresolved externs(which prevents people from loading broken files). The compiled CUDA kernel will only load successfully if the fixup step is applied(this is something we can ensure happens in Rust-CUDA & cg_nvvm, since we control the entire compilation process there).

Supporting this on PTX level(which is required for upstream LLVM support) would require some actions on the NVIDIA side(they would have to extend PTX to be able to handle such a construct). Fingers crossed, that happens sooner than later.