m4rs-mt/ILGPU

[POTENTIAL BUG]: CopyToCpu is using refs in unsafe way but there is no indication of that.

Opened this issue · 3 comments

Describe the bug

This routine calls CopyToCPUUnsafeAsync which does not pin/fix refs and simply calls Unsafe.AsPointer on them. This could potentially lead to situations where pointer could end up pointing at invalid memory region if GC kicks in while copy operation is ongoing.

I guess there is an implicit assumtion that all refs should be pinned before use. But I personally would have expected ref taking api to pin those manually, e.g. fixed p = ref ...

I think there are few ways to deal with that:

  1. Simply tell people that there are NO ref safety guarantees so it becomes a global implicit rule
  2. Explicitly state this in the comment
  3. Find such apis and actually use fixed then route to pointer-taking apis
  4. (I guess not possible) rename api to CopyToCPUUnsafe
public static void CopyToCPU<T, TView>(
      this TView source,
      AcceleratorStream stream,
      ref T cpuData,
      long length)
      where T : unmanaged
      where TView : IContiguousArrayView<T>
    {
      source.CopyToCPUUnsafeAsync<T, TView>(stream, ref cpuData, length);
      stream.Synchronize();
    }

Environment

  • ILGPU version: main

Steps to reproduce

In the description

Expected behavior

In the description

Additional context

No response

Hi @En3Tho, thank you for raising this potential bug report and welcome to the ILGPU community! Indeed, these functions are considering access to either pinned managed/or even unmanaged pointers to be targeted directly from GPU drivers at the moment. I agree that we should improve documentation of these functions to help people understand potential side effects (as in undefined behavior...) and potentially enhance this function with fixed p = ... while providing a native overload using IntPtr or void* to avoid additional pinning attempts. What do you think?

I think it would be good to have different ref T and T* overloads, former being "safer oner". I understand that basically once you go into gpu land there is no such thing as "safety" but still it would be good to eliminate a possible cause of clr crash.

My main concern is that someone might try using this on web server for example where some tasks are a perfect fit for gpu. Active memory hogs like webservers are triggering gc quite often and it could really lead to the whole app crashing. Usually using ref T means that operation is guaranteed to be safe at least memory wise.

So an ideal solution in my opinion would be 2 overloads: ref T with a pinning (which is considered super cheap as cpu operation but it might introduce gc fragmentation) and T* for people who "know what they are doing". There could be as well a convinient Span<T> overload just to utilize the same overload for basically any kind of contigious memory (stack, gcheap, native heap etc` with at least some kind of bound checking etc just like it works in bcl.

Sounds reasonable to me. Please note, that we can add an additional overload without any limitations. However, changing the semantics of an existing function is not possible due to backwards compatibility (since it has been used for several years now). We could deprecate the current function, while keeping its semantics. Additionally, we can introduce an overload featuring T* and add another function CopyToCPUWithPinning (or something like this) accepting ref T while making it clear that this reference will be automatically pinned. Alternatively, we can use your Span<T> proposal while marking the "old" function as obsolete.