Support for DLPack exchange of hot buffers
oleksandr-pavlyk opened this issue · 4 comments
Based on discussion in dmlc/dlpack#57, array.__dlpack__
method supports stream
keyword, implementing @tqchen's suggestion for synchronization semantics.
The from_dlpack
command in array API does not support it at present though, requiring users to explicitly synchronize at the exporter's side.
The from_dlpack
function has acquired device
keyword (see #626) to permit porting device data to host, e.g. from_dlpack(gpu_arr, device=(kDLCPU, 0))
and, potentially, to allow exchanges between different APIs targeting the same device (e.g., between kDLOneAPI
and kDLCUDA
for oneAPI devices targeting NVidia GPUs) down the road.
It is tempting to want to add support for stream
keyword to from_dlpack
, but this keyword value must make sense for importing library, while stream
value passed to __dlpack__
must make sense for exporting library. So it seems that from_dlpack
should only allow specifying non-default value stream
keyword when it can make sense for both. Maybe when device type of requested device is the same as the type indicated in arr.__dlpack_device__
.
Sorry for late reply @oleksandr-pavlyk. Thinking about this more I am not sure it makes sense to add a stream
argument to from_dlpack
. Like all other array constructors in the standard, I think we've implicitly assumed (which would have been better to shout out, that I agree) that there's a global, implicit "current stream/queue" to which we submit the workloads. It is this implicit queue that we'd use inside from_dlpack
(by passing it as stream
to __dlpack__
). Do you think we should add a stream
argument to all constructors (in addition to already-existing device
)?
No, I am not advocating for adding stream
keywords everywhere, not even to from_dlpack
.
I am pointing out though that Array API standard currently leaves those who need to use stream
keyword argument in __dlpack__
stranded, as there is no function in the standard to consume the DLPack capsule produced by __dlpack__
call.
I was saying that adding stream
keyword to from_dlpack
does not really work out, as a non-default value of stream
is only pertinent to data interchanges that do not change the device data reside on.
I think the problem is now clearer to me (thanks @oleksandr-pavlyk and @leofang). The basic problem for the SYCL-based dpctl
library is that the return value from __dlpack_device__
is not enough, because SYCL can target multiple device types and hence the return value can be ONE_API
and CUDA
if the data lives on a CUDA device managed by SYCL.
The device enum was supposed to have mutually exclusive values: data lives either on CPU, or on CUDA, or on ROCm, etc. - but never on two device types at once. Hence I think having ONE_API
there was a mistake.
A potential way to handle this would be to return two device values ("this is a SYCL OneAPI array backed by a CUDA device"), and then let the consumer ask for either OneAPI or CUDA (e.g., CuPy would ask for CUDA, while a user of torch.xpu
could ask for OneAPI instead).
I think the stream
part of the problem statement in the issue description may not be relevant. First, the non-uniqueness of the device enum must be sorted out. A stream value is connected to device type, and stream seems to become confusing here because the device uniqueness is a problem. If we know we're on CUDA, stream
should work as designed.
Indeed. Here's the way I understand it, using cupy
and dpctl
as a concrete example
- Case 1:
cupy.from_dlpack(dpctl.tensor.usm_ndarray)
:cupy
is consumer,dpctl
is producerdpctl.tensor.__dlpack_device__
is queried, returningkDLOneAPI
and an (opaque?) device IDcupy
does not know how to interpretkDLOneAPI
(even if it's backed by CUDA), so it raises aBufferError
- this is compliant with the spec ✅
- but not an expected behavior by
dpctl
users ❌
- Case 2:
dpctl.tensor.from_dlpack(cupy.ndarray)
:cupy
is producer,dpctl
is consumercupy.ndarray.__dlpack_device__
is queried, returningkDLCUDA
and a device IDdlctl
can query the OneAPI runtime and see if CUDA is in use under the hood- if not, raise
BufferError
✅ - if yes,
dpctl
pass a meaningfulstream
integer tocupy.ndarray.__dlpack__
to handshake, establish stream order, and get a capsule ✅
- if not, raise