[core][experimental] Handle NCCL errors in accelerated DAGs
Opened this issue · 0 comments
stephanie-wang commented
Description
Handle:
- Application errors (python exceptions)
- Peer actor failure
- Network errors
Ideally, actors participating in the DAG should still be usable after the error is thrown.
Use case
No response