data-apis/array-api

Support for DLPack exchange of hot buffers

oleksandr-pavlyk opened this issue · 4 comments

Based on discussion in dmlc/dlpack#57, array.__dlpack__ method supports stream keyword, implementing @tqchen's suggestion for synchronization semantics.

The from_dlpack command in array API does not support it at present though, requiring users to explicitly synchronize at the exporter's side.

The from_dlpack function has acquired device keyword (see #626) to permit porting device data to host, e.g. from_dlpack(gpu_arr, device=(kDLCPU, 0)) and, potentially, to allow exchanges between different APIs targeting the same device (e.g., between kDLOneAPI and kDLCUDA for oneAPI devices targeting NVidia GPUs) down the road.

It is tempting to want to add support for stream keyword to from_dlpack, but this keyword value must make sense for importing library, while stream value passed to __dlpack__ must make sense for exporting library. So it seems that from_dlpack should only allow specifying non-default value stream keyword when it can make sense for both. Maybe when device type of requested device is the same as the type indicated in arr.__dlpack_device__.

@kkraus14 @fcharras @leofang @ogrisel

Sorry for late reply @oleksandr-pavlyk. Thinking about this more I am not sure it makes sense to add a stream argument to from_dlpack. Like all other array constructors in the standard, I think we've implicitly assumed (which would have been better to shout out, that I agree) that there's a global, implicit "current stream/queue" to which we submit the workloads. It is this implicit queue that we'd use inside from_dlpack (by passing it as stream to __dlpack__). Do you think we should add a stream argument to all constructors (in addition to already-existing device)?

No, I am not advocating for adding stream keywords everywhere, not even to from_dlpack.

I am pointing out though that Array API standard currently leaves those who need to use stream keyword argument in __dlpack__ stranded, as there is no function in the standard to consume the DLPack capsule produced by __dlpack__ call.

I was saying that adding stream keyword to from_dlpack does not really work out, as a non-default value of stream is only pertinent to data interchanges that do not change the device data reside on.

I think the problem is now clearer to me (thanks @oleksandr-pavlyk and @leofang). The basic problem for the SYCL-based dpctl library is that the return value from __dlpack_device__ is not enough, because SYCL can target multiple device types and hence the return value can be ONE_API and CUDA if the data lives on a CUDA device managed by SYCL.

The device enum was supposed to have mutually exclusive values: data lives either on CPU, or on CUDA, or on ROCm, etc. - but never on two device types at once. Hence I think having ONE_API there was a mistake.

A potential way to handle this would be to return two device values ("this is a SYCL OneAPI array backed by a CUDA device"), and then let the consumer ask for either OneAPI or CUDA (e.g., CuPy would ask for CUDA, while a user of torch.xpu could ask for OneAPI instead).

I think the stream part of the problem statement in the issue description may not be relevant. First, the non-uniqueness of the device enum must be sorted out. A stream value is connected to device type, and stream seems to become confusing here because the device uniqueness is a problem. If we know we're on CUDA, stream should work as designed.

Indeed. Here's the way I understand it, using cupy and dpctl as a concrete example

  • Case 1: cupy.from_dlpack(dpctl.tensor.usm_ndarray):
    • cupy is consumer, dpctl is producer
    • dpctl.tensor.__dlpack_device__ is queried, returning kDLOneAPI and an (opaque?) device ID
    • cupy does not know how to interpret kDLOneAPI (even if it's backed by CUDA), so it raises a BufferError
      • this is compliant with the spec ✅
      • but not an expected behavior by dpctl users ❌
  • Case 2: dpctl.tensor.from_dlpack(cupy.ndarray):
    • cupy is producer, dpctl is consumer
    • cupy.ndarray.__dlpack_device__ is queried, returning kDLCUDA and a device ID
    • dlctl can query the OneAPI runtime and see if CUDA is in use under the hood
      • if not, raise BufferError
      • if yes, dpctl pass a meaningful stream integer to cupy.ndarray.__dlpack__ to handshake, establish stream order, and get a capsule ✅