/torch_fsdp_example

A usage example showing the benefit FSDP(ZeRO3) over default DDP.

Primary LanguagePython

A minial usage of torch FSDP(Fully shared data parrallel https://pytorch.org/docs/stable/fsdp.html, which implements ZeRO3 https://arxiv.org/pdf/1910.02054.pdf)

Requires 4 GPUs but you can tune that to the number you have. Shows linear memory reduction regarding the number of GPUs compared to default torch DDP. Also shows why you should pass in cache_enabled=False in torch.autocastwhen there's only one forward pass.