FSDP returns different loss value with zero stage 2 and 3
dongsungkim opened this issue · 1 comments
dongsungkim commented
How to reproduce
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --nproc_per_node=2 ./tests/torch/nn/parallel/data_parallel/test_fsdp.py --zero-stage 2
Environment
- OS : ubuntu18.04
- Python version : python3.7
- Transformers version : 4.21.2
- Whether to use Docker:
- Misc.:
dongsungkim commented
No optimiser implementation in oslo/torch/nn/parallel/data_parallel/data_parallel.py.
It will be added for zero-stage 2 and 3.
In addition to that, Need to check cpu_offload in FSDP code.