EleutherAI/gpt-neox

My servers used for multi-node training do not have ssh. How can I launch multi-node training using the torchrun command?

Opened this issue · 1 comments

My machines used for multi-node training do not allow ssh service.
How can I launch multi-node training using the most basic torchrun command (torch.distributed.launch) ?

The servers which I use do not have slurm. And I found both openmpi and pdsh rely on ssh service.
So I run out of all the ways provided in this repo's README to start a multi-node training job.

I also encountered the same problem. Have you found a solution?