rust-lang/rust

Tracking Issue for NVPTX shared memory

Opened this issue · 32 comments

Feature gate: #![feature(stdarch_nvptx)] (probably; see #111199)

This is a tracking issue for access to shared memory from NVPTX device code.

There are two flavors of shared memory in NVPTX (see this NVIDIA Technical Blog (2013) for a friendly introduction):

  • static shared memory provides a fixed size amount of shared memory independent of the block size. There can be multiple static shared arrays (declared in CUDA C++ using the __shared__ attribute).
  • dynamic shared memory provides one base pointer, with a length (in bytes) that must be specified in the kernel launch parameters. If multiple arrays are needed, the single buffer must be manually partitioned (in CUDA C++).

In practice, the required amount of shared memory for GPU algorithms will depend on the block size. Since many domains expect to choose the block size at run-time, static shared memory is often considered to be of limited use.

Public API

Note that other core::arch::nvptx intrinsics are covered by #111199.

use core::arch::nvptx;

#[no_mangle]
pub unsafe extern "ptx-kernel" fn reverse_kernel(n: u32, a: *mut f32) {
    let t = nvptx::_thread_idx_x() as usize;
    let tr = n as usize - t - 1;
    let (s, sn) = nvptx::_dynamic_smem(); // <--- this issue
    let s = s as *mut f32;
    assert_eq!(sn as u32, n * 4); // requirement for the algorithm below to be correct
    *s.add(t) = *a.add(t);
    nvptx::_syncthreads();
    *a.add(t) = *s.add(tr);
}

A possible implementation for dynamic shared memory

Shared memory can be exposed using inline assembly (reference).

core::arch::global_asm!(".extern .shared .align 16 .b8 _shared_data[];");

#[inline(always)]
pub fn _dynamic_smem() -> (*mut u8, u32) {
    // Dynamic shared memory.
    let size: u32;
    let saddr: u64;
    let ptr: *mut u8;
    unsafe {
        asm!("mov.u32 {}, %dynamic_smem_size;", out(reg32) size);
        asm!("mov.u64 {}, _shared_data;", out(reg64) saddr);
        asm!("cvta.shared.u64 {ptr}, {saddr};", ptr = out(reg64) ptr, saddr = in(reg64) saddr);
    }
    (ptr, size)
}

Steps / History

  • Implementation: #...
  • Final comment period (FCP)1
  • Stabilization PR

Unresolved Questions

  • Do we prefer the _dynamic_smem() -> (*mut u8, u32) or should we have two separate intrinsics for accessing the base pointer and the size?
  • Do we want to use usize for the length even though NVPTX does not (yet, and maybe ever) support that? I'm not sure about other vendors, if we're looking for the intrinsics to be as close as possible.
  • Is it desirable to expose static shared memory or shall we focus on dynamic (which is less intrusive to implement, but may require more thinking to launch kernels with valid parameters)?
  • It has also been suggested that this be implemented as an LLVM intrinsic by introducing llvm.nvvm.read.ptx.sreg.dynamic_smem_size. That could in principle allow accessing the shared memory base pointer without introducing a new symbol (_shared_data above). I lean toward using the inline assembly for now and possibly migrating it later, but no other inline assembly is used in core::arch::nvptx and I don't know if it should be kept that way.
  • Do we want to tackle launch bounds in this issue? If so, it should not be under the stdarch_nvptx feature gate since it would need to verify launch parameters on the host (better user experience) or handle failure on the device (i.e., prevent UB if an insufficient amount of shared memory is provided). I prefer to handle only the unsafe primitives here since there are many open questions about such ergonomics.
  • Should this issue consider partitioning dynamic shared memory into parts (e.g., multiple arrays or more general data structures, parametrized by block size)? This interacts with more types and would also make it inappropriate for the stdarch_nvptx feature gate.

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@rustbot label O-nvptx

One of the concerns here is that we likely will want to consider an API that is amenable to being used on both amdgpu and ptx targets. The concept is identical to both, as far as I am aware. In that sense, the main issue with simply using inline assembly or intrinsics is that it might be too architecture-specific for what we want.

In my crate rust-cuda I have wrapper types for both static (https://github.com/juntyr/rust-cuda/blob/main/src/utils/shared/static.rs) and dynamic (https://github.com/juntyr/rust-cuda/blob/main/src/utils/shared/slice.rs) shared memory, perhaps some of its Assembly hacks might be useful :)

@workingjubilee Does that comment apply to the rest of core::arch::nvptx? Do you envision something like a core::arch::gpu that has vendor-agnostic ways to access such things? Note that Intel's GPU execution model is a bit more different than NVIDIA to AMD.

Hmm. No, I don't think it applies to the rest of the architectural intrinsics, because many of them are about more-specific ideas about how the GPU's execution is organized, as opposed to a more common concept of "there's areas of memory that threads 0..N get access to but N..N*2 do not (because it has its own area)", which also may require more special consideration for how it will be elaborated relative to our opsem.

There are quite a few intrinsics that exist on both nvptx and amdgpu (and surely also intel). After all, they are built to run the same applications (thinking of Vulkan, DirectX, OpenCL, etc.). 🙂

I guess it would be nice to have the “common” intrinsics like _thread_idx_x/y/z, group and launch size, syncthreads/barrier.

Of course, then the question comes up what the naming scheme should be 😉
The schemes that come to my mind:

  • Vendor specific: Cuda/Hip/Sycl (Sycl is technically a Khronos thing, but mostly supported by intel)
  • API specific: DirectX/Vulkan/OpenCL (I think the Khronos ones, Vulkan/OpenCL are internally consistent)
  • Something else, mixing existing schemes

I lean towards open standards, which would be the Khronos nomenclature.
(Should these be the only implementations, “replacing” the arch-specific ones, or should they only re-expose the arch-specific intrinsics?)

Sorry, I misspoke, the reason to not pursue those immediately in this issue is because ones that are not immediately about shared memory would be yet-another topic.

It's nigh impossible to use shared memory (or write any useful kernel) without the concept of thread indices, which are in core::arch::nvptx. I had in mind that there would be vendor-specific intrinsics using that vendor's preferred terminology (thus my initial suggestion that the feature gate would be stdarch_nvptx, same as #111199) and we could later define a common subset using a single consistent nomenclature. I think it would be confusing if users of vendor-specific features had to deal with two nomenclatures, so I'd be in favor of the the "re-expose" strategy.

Note that operational semantics (e.g., with respect to forward progress) are different between vendors. Suppose one writes a kernel using "ptx-kernel" and core::arch::nvptx that relies on forward progress. If a generic "gpu-kernel" and core::arch::gpu were simple aliases, one could port the kernel by changing the use statement (substituting nomenclature as needed) and the kernel would continue to run equivalently on NVIDIA hardware, but would deadlock with other vendors. I think that would be bad. So it seems to me that the generic side should specify the weaker model that allows conforming kernels to be portable (and unsafe aspects of that could be checked via statically safe wrappers, tools like MIRI, or other formal methods).

I agree that an ambition for a future user friendly safe way to deal with shared memory between GPU targets (that ideally would also be consistent on targets without thread groups) should not come in the way for exposing nvptx intrinsics related to shared memory.

I think there's also precedence for this already. As in how std::simd will be a killer feature in the future but arch simd intrinsics is still useful and accepted in the core::arch module.

I don't think anyone relying on forward progress guarantees in the language without forward progress guarantees in its operational semantics is relying on anything to do about Rust's operational semantics.

Regarding implementing shared memory for nvptx, LLVM exposes it as addrspace(3) globals (inline assembly is quite hacky, I think we should use globals instead in the Rust implementation).

LLVM nvptx docs: https://llvm.org/docs/NVPTXUsage.html#address-spaces
LLVM amdgpu docs (also uses addrspace(3)): https://llvm.org/docs/AMDGPUUsage.html#address-spaces

For static shared allocations, we can declare a global in LLVM IR:

%MyStruct = type { i32, i32, float }
@Gbl = internal addrspace(3) global [1024 x %MyStruct] zeroinitializer

Results in e.g. this nvptx (compiled with llc):

.shared .align 8 .b8 Gbl[12288];
ld.shared.f32 	%f2, [Gbl+8];

For dynamic shared allocations, we declare the global as external:

%MyStruct = type { i32, i32, float }
@ExternGbl = external addrspace(3) global [1024 x %MyStruct]

Resulting nvptx:

.extern .shared .align 8 .b8 ExternGbl[12288];

Test in LLVM that uses it: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/NVPTX/globals_lowering.ll
Test that demonstrates that in cannot be initialized: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/NVPTX/gvar-init.ll

If getting the pointer is implemented as a separate intrinsics, I think it could work like this:

// Declare an intrinsic
#[rustc_nounwind]
#[rustc_intrinsic]
#[rustc_intrinsic_must_be_overridden]
pub fn _dynamic_smem() -> *mut u8 {
    unreachable!()
}

And the implementation for LLVM creates a @_shared_data = external addrspace(3) global [0 x i8] align(16) and returns an addrspacecast addrspace(3) ptr @_shared_data to ptr.

The thing I’m a little unhappy about is hard-coding the alignment. 16 seems ok for most things, but I guess many use-cases would also be fine with align(4).

A different way that let’s the user specify the alignment would be a generic intrinsic:

// Declare an intrinsic
#[rustc_nounwind]
#[rustc_intrinsic]
#[rustc_intrinsic_must_be_overridden]
pub fn _dynamic_smem<T>() -> *mut T {
    unreachable!()
}

And the implementation for LLVM creates a @_shared_data = external addrspace(3) global [0 x i8] align(T).

For reference, because I wondered about this, the LLVM language reference says that external globals can be larger than their specified type, so dereferencing them outside the declared type is valid (as long as the definition/implementation is large enough).

For global variable declarations […] the allocation size and alignment of the definition it resolves to must be greater than or equal to that of the declaration or replaceable definition, otherwise the behavior is undefined.

I wrote some code that implements an intrinsic for dynamic shared memory: Flakebi@026563a
I went with the second approach, using a generic function, so the user can specify the alignment.

pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T {}

A lot of cleanup is missing (docs, tests, …), but from a quick try it seems to work on amdgpu. I expect that it works on nvptx as well, though I don’t have the hardware to try that.

It's great that you've been able to make progress here. I had the chance to test your solution with nvptx.

Unfortunately the created global identifier contains dots which are not allowed in PTX: .extern .shared .align 1 .b8 anon.b7795e05be2e8d54ba0fbc19742a6530.0[];
PTXAS therefore raises a parsing error. If I understand correctly, LLVM renames symbols with local linkage (symbols flagged as internal or private). So the created external flagged global is not renamed.

If I change the name of the global when invoking declare_global_in_addrspace to something like DYN_SMEM your solution is working for me on nvptx.
For the intrinsic I am wondering if a fixed identifier wouldn't be sufficient anyway. After all, we only want to create one global symbol per .ptx for it.

I took the opportunity to extend your prototype and implement an intrinsic for static shared memory. Since the attributes necessary for intrinsics have also changed in the meantime, I performed a rebase on the current main branch and resolved the conflicts. Additionally, I set $_dynamic_shared_memory_$ as a fixed identifier for the dynamic shared memory global.
I am working in this branch.

For testing with my NVIDIA GPU I am currently using this repository. Two of the kernels for matrix multiplication use static shared memory, the stencil and reduction kernels use dynamic shared memory. So far I have not encountered any problems.

Nice!

For static shared memory, I was thinking it makes more sense to declare them as static variables in Rust:

// Same in all executions (existing)
static GLOBAL_I: i32 = 0;

// Available only in the current thread (existing)
#[thread_local]
static THREAD_I: i32 = 0;

// Same for all threads in the current group (new)
#[group_shared]
static SHARED_I: i32 = 0;

There is a bit more to this, like currently neither nvptx nor amdgpu allow initializing shared variables, so we want to enforce this being MaybeUninit, and we need a good name for the attribute (and at some point before stabilization, we need to think about effects on Rust’s memory model). But I think declaring a static variable makes more sense from a user’s perspective.

$_dynamic_shared_memory_$

I guess a simple dynamic_shared_memory (or _dynamic_shared_memory if we prefer) does as well :)

No matter which name we pick, one can always write a Rust program that creates a symbol collision, though only through the use of unsafe (since the 2024 edition).

#[unsafe(export_name = "$_dynamic_shared_memory_$")]
static CONFLICTING_SYMBOL: i32 = 0;

For static shared memory, I was thinking it makes more sense to declare them as static variables in Rust:

There is a bit more to this, like currently neither nvptx nor amdgpu allow initializing shared variables, so we want to enforce this being MaybeUninit, and we need a good name for the attribute (and at some point before stabilization, we need to think about effects on Rust’s memory model). But I think declaring a static variable makes more sense from a user’s perspective.

That is definitely valid. I was curious if static shared memory could be handled similar to dynamic shared memory, but I agree that we actually have already static as a language construct and should not deviate from it if possible.

To expose it to the user and also handle the MaybeUninit necessity, there is already a similar API in std that does type transformation: thread_local!{..}. We could create something similar:

group_shared! {
    static BUFFER : [i32; 256];
}

which expands to

#[group_shared]
static BUFFER : MaybeUninit<[i32; 256]> = MaybeUninit::uninit();

Another way would be to let the attribute do the type transformation. However, I do not know of any other attribute that does something like this so it doesn‘t feel right to me.

$_dynamic_shared_memory_$

I guess a simple dynamic_shared_memory (or _dynamic_shared_memory if we prefer) does as well :)

No matter which name we pick, one can always write a Rust program that creates a symbol collision, though only through the use of unsafe (since the 2024 edition).

Yes it was the intention to prevent naming conflicts, but I didn‘t think about export_name. We could do some mangling from rustc or just accept that it is possible to create naming conflicts.

I think these would need to be static mut to be useful (or use atomics/interior mutability) and presumably use raw pointers to avoid UB from violating AXM with references.

But also, it's typical to have several kernels in the same module. Would these static declarations only be allowed within an extern "ptx-kernel" or would they be allowed at module scope? We need to ensure that kernels only require the static shared memory that they will actually use.

I had envisioned just having an intrinsic return *mut T where T: Sized and let experiments with ergonomics and safety take place in libraries.

I think these would need to be static mut to be useful (or use atomics/interior mutability) and presumably use raw pointers to avoid UB from violating AXM with references.

static muts are not very discourages since they have far too many footguns. So a non-mut static with interior mutability would be the way to go.

So Cell and UnsafeCell are not Sync. (Unscoped Sync isn't correct for what we need either, since shared memory can only be shared among a thread block, and we don't currently have a scoped Sync to express that.) However, the implementation of Cell really is unsound if shared between threads. Is it better for the user to work with MaybeUninit<[UnsafeCell<T>; N]> than (*mut T, usize)?

Atomics are nontrivial overhead and don't apply to general types.

But also, it's typical to have several kernels in the same module. Would these static declarations only be allowed within an extern "ptx-kernel" or would they be allowed at module scope? We need to ensure that kernels only require the static shared memory that they will actually use.

A quick test with ptxas shows that this should not be a problem. Running

ptxas -arch sm_61 -v -ias ".version 5.0
.target sm_61
.address_size 64

.shared .align 4 .b8 static_shared_mem[2048];
.visible .entry no_shared_kernel()
{
	ret;
}

.visible .entry shared_kernel_1()
{
	.reg .b64	%rd<1>;

	cvta.shared.u64	%rd0,  static_shared_mem;
	ret;
}

.visible .entry shared_kernel_2()
{
	.reg .b64	%rd<1>;

	cvta.shared.u64	%rd0, static_shared_mem;
	ret;
}"

shows that only shared_kernel_1 and shared_kernel_2 actually use shared memory, while no_shared_kernel does not.

So Cell and UnsafeCell are not Sync. (Unscoped Sync isn't correct for what we need either, since shared memory can only be shared among a thread block, and we don't currently have a scoped Sync to express that.) However, the implementation of Cell really is unsound if shared between threads. Is it better for the user to work with MaybeUninit<[UnsafeCell<T>; N]> than (*mut T, usize)?

Hm, yes the user would probably need to create a pointer anyway, so it would not change much, except being a little more descriptive in the first place.
In the end creating static shared memory is an allocation so I would also be fine going with

fn alloc_static_smem<T : Sized>() -> *mut T;

or something similar. I also think it would be easier to implement. My static shared memory implementation, which is based on Flakebi's dynamic shared memory implementation is already basically working.

I opened a PR for dynamic shared memory: #146181

@eddyb you likely have opinions here

Nice!

For static shared memory, I was thinking it makes more sense to declare them as static variables in Rust:

// Same in all executions (existing)
static GLOBAL_I: i32 = 0;

// Available only in the current thread (existing)
#[thread_local]
static THREAD_I: i32 = 0;

// Same for all threads in the current group (new)
#[group_shared]
static SHARED_I: i32 = 0;
There is a bit more to this, like currently neither nvptx nor amdgpu allow initializing shared variables, so we want to enforce this being MaybeUninit, and we need a good name for the attribute (and at some point before stabilization, we need to think about effects on Rust’s memory model). But I think declaring a static variable makes more sense from a user’s perspective.

$_dynamic_shared_memory_$

I guess a simple dynamic_shared_memory (or _dynamic_shared_memory if we prefer) does as well :)

No matter which name we pick, one can always write a Rust program that creates a symbol collision, though only through the use of unsafe (since the 2024 edition).

#[unsafe(export_name = "$dynamic_shared_memory$")]
static CONFLICTING_SYMBOL: i32 = 0;

This is roughly how it is handled in Rust-CUDA FWIW: https://github.com/Rust-GPU/Rust-CUDA/blob/ac2674f03208462709f538f07b10e3f8153c93b6/examples/cuda/gemm/kernels/src/gemm_tiled.rs#L42

In Rust CUDA I've been working on modeling this with allocator types (with the default being global). So SharedAllocator, LocalAllocator, etc. The codegen backend can then track what was allocated where (and perhaps enforce it at call sites) and std collections work via new_in. Has that been discussed previously?

Thanks for the Rust-CUDA reference!
Surely we want to be able to implement allocators for shared memory. As it adds performance overhead though, I don’t think this is the right level of abstraction that Rust should expose.
I assume that most kernels have a static layout of how they use shared memory, depending only on the workgroup/block size.

For static shared memory, I was thinking it makes more sense to declare them as static variables in Rust:

// Same in all executions (existing)
static GLOBAL_I: i32 = 0;

// Available only in the current thread (existing)
#[thread_local]
static THREAD_I: i32 = 0;

// Same for all threads in the current group (new)
#[group_shared]
static SHARED_I: i32 = 0;

Does anything prevent using that same pattern for dynamic group_shared allocations and putting them in an extern block because we don't control the actual allocation?

unsafe extern "gpu-kernel" {
  #[group_shared]
  static SHARED_WITH_LAUNCH_CONTROLLED_SIZE: [i32];
}

Like the intrinsic proposed by Flakebi, the type controls the alignment; unlike the intrinsic which has surprising behavior when called multiple times (same pointer returned, largest alignment wins), this could maybe error out if multiple external group_shared statics are defined.

That would also mirror more closely the CUDA C++ syntax:

__shared__ int32_t shared_with_static_size;
// vs
extern __shared__ int32_t shared_with_launch_controlled_size[];

Thanks for the Rust-CUDA reference! Surely we want to be able to implement allocators for shared memory. As it adds performance overhead though, I don’t think this is the right level of abstraction that Rust should expose. I assume that most kernels have a static layout of how they use shared memory, depending only on the workgroup/block size.

I'm in the middle of implementing but can't see why it won't be zero cost? It just uses allocators as a convenient API to rendevous between user code and the codegen backend. I guess you are talking about known size at compiletime vs runtime?

Again, allocators enable re-using GPU-unaware code. With what you are proposing (I think!) only code that is GPU aware can allocate shared memory. To me that is a big limitation....there is a lot of code out there 😄

You can also see our existing shared memory support here:

https://github.com/Rust-GPU/Rust-CUDA/blob/ac2674f03208462709f538f07b10e3f8153c93b6/crates/cuda_std/src/shared.rs#L1

We support both static and dynamic

Thanks for the Rust-CUDA reference! Surely we want to be able to implement allocators for shared memory. As it adds performance overhead though, I don’t think this is the right level of abstraction that Rust should expose. I assume that most kernels have a static layout of how they use shared memory, depending only on the workgroup/block size.

I'm in the middle of implementing but can't see why it won't be zero cost? It just uses allocators as a convenient API to rendevous between user code and the codegen backend. I guess you are talking about known size at compiletime vs runtime?

I assumed you mean implementing the Allocator trait for managing. If so, then surely this needs some instructions (atomics?) to allocate and free?

From my understanding of GPU kernels, many kernels do not need dynamic memory management for shared memory. The available shared memory is often divided statically based on workgroup/block size, therefore the only runtime overhead would be getting the workgroup/block size and the thread id within a workgroup/block. Based on that, indices/offsets for accessing shared memory can be computed.
The “dynamic” shared memory is used for cases where the workgroup/block size is specified dynamically at kernel launch time. The dynamic shared memory size is adjusted for that and also specified at launch time.

Again, allocators enable re-using GPU-unaware code.

Sure, the amdgpu target can support alloc and I assume it’s similar for nvptx.
And I think one should be able to write an allocator for shared memory is well. Though at the core, Rust should expose something more low-level than that.

Does anything prevent using that same pattern for dynamic group_shared allocations and putting them in an extern block because we don't control the actual allocation?
[…]
this could maybe error out if multiple external group_shared statics are defined.

I think nothing technical prevents us from doing that, but let’s look at how this would work:

unsafe extern "gpu-kernel" {
  #[group_shared]
  static SHARED_MEM: [i32; 0];
}

fn foo(i: u32) {
  unsafe {
    let ptr: *mut u32 = &SHARED_MEM as *const _ as *mut _;
    // Needs 4 byte alignment
    ptr.add(4).write(i);
  }
}

fn bar() {
  unsafe {
    let ptr: *mut u8 = &SHARED_MEM as *const _ as *mut _;
    // Needs 1 byte alignment
    ptr.add(5).write(42u8);
  }
}

pub extern "gpu-kernel" k1() {
  foo(42);
}

pub extern "gpu-kernel" k2() {
  foo(3);
}

pub extern "gpu-kernel" k3() {
  bar();
}

We have three kernels, k1 and k2 need the dynamic shared memory to be aligned to 4 bytes, k3 needs only 1 byte alignment.
Apart from the way to get a pointer being a bit cumbersome, we are not able to express the different alignment requirements for different kernels in the same program.

Getting a pointer doesn't have to be cumbersome, using slice::as[_mut]_ptr. You took the longest way around in your example because the static isn't mut and not using the same type. We may not even need to go through the pointer API if the type is well-behaved (eg MaybeUninit).

In this example I'd expect to have 2 definitions for the extern static. The error would be trying to use them both from the same kernel:

unsafe extern "gpu-kernel" {
    #[group_shared]
    static mut SHARED_MEM_FOR_1_AND_2: [u32; 0]; // 4-byte alignment
}

fn foo(i: u32) {
    unsafe {
        let ptr = SHARED_MEM_FOR_1_AND_2.as_mut_ptr();
        ptr.add(4).write(i);
    }
}

pub extern "gpu-kernel" fn k1() {
    foo(42);
}

pub extern "gpu-kernel" fn k2() {
    foo(3);
}

unsafe extern "gpu-kernel" {
    #[group_shared]
    static mut SHARED_MEM_FOR_3: [MaybeUninit<u8>; 0]; // 1-byte alignment
}

fn bar() {
    unsafe {
        SHARED_MEM_FOR_3.get_unchecked_mut(5).write(42);
    }
}

pub extern "gpu-kernel" fn k3() {
    bar();
}

// errors out because it reaches 2 group_shared extern statics
pub extern "gpu-kernel" fn k4() {
    foo(1);
    bar();
}

If we can provide some guarantee that a kernel doesn't have conflicting expectations about the shared memory, that's a huge safety gain.

I believe @FractalFir is poking around in this space as well.