Address frozen parameter warning with FSDP on nightly torch
Opened this issue · 2 comments
carmocca commented
PEFT finetuning (LoRA, adapter) raises the following warning for each FSDP-wrapped layer (transformer block in our case):
The following parameters have requires_grad=True:
['transformer.h.0.attn.attn.lora_A', 'transformer.h.0.attn.attn.lora_B']
The following parameters have requires_grad=False:
['transformer.h.0.norm_1.weight', 'transformer.h.0.norm_1.bias', 'transformer.h.0.norm_2.weight', 'transformer.h.0.norm_2.bias', 'transformer.h.0.attn.attn.linear.weight', 'transformer.h.0.attn.attn.linear.bias', 'transformer.h.0.attn.proj.linear.weight', 'transformer.h.0.attn.proj.linear.bias', 'transformer.h.0.mlp.fc.linear.weight', 'transformer.h.0.mlp.fc.linear.bias', 'transformer.h.0.mlp.proj.linear.weight', 'transformer.h.0.mlp.proj.linear.bias']
warnings.warn(msg)
/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py:174: UserWarning: transformer.h.1 has both parameters with requires_grad=True and False. We do not recommend wrapping such modules since the gradient memory usage will be higher than expected (201510912 numel instead of 131072 numel before sharding via reduce-scatter). If possible, wrap the frozen parameters with FSDP separately.
This should be looked at or silenced if we don't want to action on it
RuABraun commented
Is changing the code so the lora parameters are in a separate module an option? I don't see how you can otherwise wrap the lora parameters into a separate FSDP unit. I might be able to help.
MaxGonzalezSaez-Diez commented
Still occuring.