Distributed-training

FSDP BERT

To run BERT example with FSDP

  • downlaod the IMDB dataset
  • Run the script
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 
tar -xf aclImdb_v1.tar.gz
python FSDP_BERT.py

For running BERT with Torchrun

torchrun --nnodes 1 --nproc_per_node 4  FSDP_BERT_torchrun.py

FSDP T5

To run T5 example with FSDP and DDP(just need to uncomment the DDP wrapping in the script) for text_summerization

python FSDP_T5.py

For running T5 with Torchrun

torchrun --nnodes 1 --nproc_per_node 4  FSDP-T5-torchrun.py