Repository contains examples of distributed training jobs with Pytorch.
The models can be trained on single GPU instances as well as in production environment, e.g. SLURM.
The model is a simple minGPT model that supports the following features:
- DDP trainig on a single or multiple nodes
- Checkpointing
- Metrics logging with tensorboard
- Profiling support
- Job configuration via Hydra
Executing single process:
pip install -r requirements.txt
python charnn/main.py
Running on multiple GPUs on a single host:
torchrun --nnodes 1 --nproc_per_node 4 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500 apps/charnn/main.py
Run with checkpoint:
mkdir -p logs/tb
torchrun --nnodes 1 --nproc_per_node 4 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500 apps/charnn/main.py \
+trainer.checkpoint_path=./logs/model.pt \
+trainer.log_dir=./logs/tb
Setting up SLURM cluster and executing job in SLURM