tensorchord/envd

feat(distributed): Support distributed training debug

Opened this issue · 1 comments

Description

The CUJ looks like:

envd run --image xx --replicas 20

Then there will be one interactive shell, and users can type a command, which will run in all replicas.

Then the STDOUT will be aggregated in the terminal.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

It is blocked by #1283