FSDP returns different loss value with zero stage 2 and 3

Question

FSDP returns different loss value with zero stage 2 and 3

dongsungkim opened this issue 2 years ago · 1 comments

How to reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --nproc_per_node=2  ./tests/torch/nn/parallel/data_parallel/test_fsdp.py --zero-stage 2

Environment

OS : ubuntu18.04
Python version : python3.7
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

Answer 1 · 2022-10-16T19:51:28.000Z

No optimiser implementation in oslo/torch/nn/parallel/data_parallel/data_parallel.py.
It will be added for zero-stage 2 and 3.

In addition to that, Need to check cpu_offload in FSDP code.