OptimalScale/LMFlow

Distributed training parameter settings

Yummy416 opened this issue · 1 comments

Hello, I want to do multi-machine distributed training. What deepspeed parameters should I set? I set the IP address of the master node, but during the operation, I found that the IP address changed, causing the connection to fail and the program was interrupted.

Thanks for your interest in LMFlow! Multinode training requires a fixed master IP. We are wondering if the inet IP is fixed and applicable? It can be viewed via ifconfigs. Thanks very much 😄