FSDP consumes the same amount of memory as DPP，why？

Hello.
I am using FSDP to replace DDP, but I find that the memory consumption is the same. Is it because FSDP only supports nn.Sequential model （my model is not a nn.Sequential model ）?

Can you provide more details? There is a number of reasons:

maybe you didn't measure the memory correctly. Pytorch does not research memory from CUDA in an eager fashion so you may not see the memory saving.
your model may be small and main memory consumption might be from activation tensors saved for the backward. In that case, you won't see memory saving from sharding
you are only sharding among 2 GPUs, which doesn't show the saving significantly for your model size
you didn't use nested wrapping of FSDP to make model memory sharding happen on all layers that are not being actively computing.

There are other reasons I may not be thinking about. Overall, FSDP will save significant amount of memory when model is big and sharding is done over a large number of GPUs

@min-xu-ai
Thanks，
I replaced the DPP in the code with the FSDP by the following，and uses the AMP model of pytorch.：

Testing on two GPUs, we found that FSDP consumes more memory:
DDP：

FSDP：

Try reshard after forward = true. Also, make sure you use more gpus and activation is not the main memory consumer

@min-xu-ai
Thanks。
I suspected that the model was too small, causing the FSDP to wrap the entire model, so I set min_num_params to 20000 and the model was able to have multiple FSDP units, but the training got stuck at：

That’s interesting. Do you have a small reproducible case?

If I have to guess it might be related to different rank compute on different code paths. An example is layerdrop, I.e. dropping layers on forward pass randomly

@min-xu-ai
Thanks，the reason is the layer drop op.
I have another question. When I tried reshard after forward = true, the following error occurred:
WFPB: incorrect saved_grad_shard device CPU vs CUDA :0

Are you using no_sync() context and doing gradient accumulation? Or are you doing cpu_offloading? You might need to disable one of them.

@min-xu-ai
I have replaced no_sync() with ExitStack() and turned off cpu_offloading, but I still have this error.

@min-xu-ai
In addition, when I use cpu_offload I get the following error：

If you have a small reproducible test case (a single script that I can run to reproduce the issue), I can try to look into it.

Actually, no_sync with FSDP is a bit different from that of DDP. With no_sync, you actually won't get memory benefit because it would hold on to full params and full gradients, which makes it the same as DDP. This has the benefit of reduced network communication.

However, if you want to save memory in the expense of the more network communication, you can do gradient accumulation directly with FSDP, without no_sync, i.e.

for i in range(n):
  loss = model(inputs[i])
  loss.backward()

optim.step()

FSDP will internally do gradient accumulation above and you can do a step() after your accumulation steps are done.

See source here:

fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Line 1103 in 6f03e41

.. note:: Gradient accumulation can be done without this context,

Will close this and please reopen if you have additional questions.