Distributed training parameter settings
Yummy416 opened this issue · 1 comments
Yummy416 commented
Hello, I want to do multi-machine distributed training. What deepspeed parameters should I set? I set the IP address of the master node, but during the operation, I found that the IP address changed, causing the connection to fail and the program was interrupted.
research4pan commented
Thanks for your interest in LMFlow! Multinode training requires a fixed master IP. We are wondering if the inet IP is fixed and applicable? It can be viewed via ifconfigs
. Thanks very much 😄