[core][experimental] Support broadcast NCCL ops in accelerated DAG
Opened this issue · 0 comments
stephanie-wang commented
Description
When the same GPU tensor is sent to multiple readers, we should use ncclBroadcast under the hood to reduce transfer time.
Use case
No response