ray-project/ray

[core][experimental] Handle NCCL errors in accelerated DAGs

Opened this issue · 0 comments

Description

Handle:

  • Application errors (python exceptions)
  • Peer actor failure
  • Network errors

Ideally, actors participating in the DAG should still be usable after the error is thrown.

Use case

No response