[core][experimental] Support broadcast NCCL ops in accelerated DAG

Question

Opened this issue a month ago · 0 comments

When the same GPU tensor is sent to multiple readers, we should use ncclBroadcast under the hood to reduce transfer time.

No response