PyTorch Lightning for Distributed Training

This repo contains sample code for distributed training for different configurations.

Distributed Data Parallel

Data used: MNIST

Multi-Node, Single GPUs

Code
T4 GPU on 2 instances Command:

MASTER_ADDR={IP of RANK 0} MASTER_PORT=29500 NODE_RANK=0 python main.py
MASTER_ADDR={IP of RANK 0} MASTER_PORT=29500 NODE_RANK=1 python main.py

Single-Node, Multiple GPUs

Code
Command:

pip install torch_tb_profiler if profiling needed; else disable it
MASTER_ADDR=localhost MASTER_PORT=29500 WORLD_SIZE=2 NODE_RANK=0 python main.py
tensorboard --logdir=./tensorboard/ --host=0.0.0.0 # to view tensorboard

Distributed Model Parallel

Data used: Intel Image Classification from

Strategy: FSDP

Multi-Node, Single GPUs

Code

ViT Model is used
Caveats:
Number of devices = 2 (1 per node; 2 in total). This is in contrast with DDP training.
PTL checkpointing doesn't work. The weights are not stored.
Models are saved manually and then the best model is copied to the root folder.

export MASTER_PORT=29500
export MASTER_ADDR=172.31.10.239
export WORLD_SIZE=2
export NODE_RANK=0 # and 1 respectively

python -m torch.distributed.run \
    --nnodes=$WORLD_SIZE \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --node_rank $NODE_RANK \
    main.py

Single Node, Multiple GPUs

Code

MASTER_ADDR=localhost MASTER_PORT=29500 WORLD_SIZE=1 NODE_RANK=0 python main.py

mmg10/ptl_dist

PyTorch Lightning for Distributed Training

Distributed Data Parallel

Multi-Node, Single GPUs

Single-Node, Multiple GPUs

Distributed Model Parallel

Strategy: FSDP

Multi-Node, Single GPUs

Single Node, Multiple GPUs