How do you conduct distributed training?

Question

Polymorphy12 opened this issue a year ago · 1 comments

Hello, I'm using 2 GPUs with a single node to train Atlas.

However, even if I set the local_rank to 0, the training doesn't start.
It still requires MASTER_ADDR, MASTER_PORT, etc.

Is there any additional information to notice?

Answer 1 · 2023-08-21T15:05:42.000Z

I am afraid I can't help if you don't add more context of how exactly are you trying to do this.