master node 172.17.0.5 (container master) slave node 172.17.0.3 (container slave) all the port is avaliable bridge node 172.17.0.1 (docker0, special node)
The multi_node bash attempt to build an docker LAN, which contains two node with each of node have 2 gpus.
multi_node bash will be placed under the root folder of this project.
At the first, ssh into the master docker with the forward port 3300
ssh root@localhost -p 3300
with lab candy passwd.
/opt/conda/bin/init ; source ~/.bashrc
to init conda env
cd pytorch_lightning_distributed_training ; ./node1_bash &
to background exec
cat /etc/hosts
to confirm the slave node ip-addr
ssh root@172.17.0.3 -p 22
to ssh in slave node
cd pytorch_lightning_distributed_training ; ./node2_bash &
then the distributed learning is begin !! have fun & good luck
- Update SOON
-
Configureation and Consideration
-
Preparing Dataset for Training
-
Training Aggregate update Gradient
-
Training Loss update
-
Example the Training Loop for Single Machine
-
Configuration and Consideration
-
Preparing Dataset for Training Across multi-Machine
-
Training Aggregate Gradient multi-Machine with Synchronize training
-
Optimization (Communicate + Mixpercision Training)
-
Training loss Update