LoRA results in 4-6% lower performance compared to full fine-tuning
digvijayingle016 opened this issue · 7 comments
I am working on fine-tuning LLMs (6B to 40B parameters) using the LoRA framework on an instruction tuning dataset comprising of instructions corresponding to ~20 tasks (a mix of factual as well as open-ended tasks). The input to the model consists of a conversation snippet between two individuals along with a task-specific prompt. The results I am observing do not align with the performance improvements reported in the paper. Specifically, the paper reports that fine-tuning using LoRA generally results in performance at par with or better than full fine-tuning of the model, however, throughout our experiments I observe a performance lower than full fine-tuning by an absolute margin of ~4-6% in terms of RougeL score.
Sharing some of the training details below:
[Framework versions]
Python: 3.8
PyTorch: 1.13.1
Transformers: 4.27.4
PEFT: 0.3.0
[Infrastructure]
8 X A100 40 GB GPUs
[Hyper-parameter Range]
Learning rate: 5e-5 to 3e-3
Learning rate scheduler: [Constant, Linear]
Epochs: [1, 2]
Batch size: [2, 4, 8]
Weight decay: 0.0
Precision: bf16
Specifically, I tried fine-tuning of google/flan-t5-xxl
model in following two scenarios:
-
Scenario 1
Full fine-tuning with constantlearning rate = 5e-5
,batch size = 8
,epochs = 1
-
Scenario 2
Fine-tuning using LoRA with constantlearning rate = 1e-3
,batch size = 8
,epochs = 1
and LoraConfig as follows:
LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias='none', task_type="SEQ_2_SEQ_LM")
Observation: Scenario 2 resulted in 4% lower RougeL as compared to scenario 1. I have also tried tuning the hyper-parameters in Scenario 2 as per the range specified above, however, the best I could get is to a gap of ~4% RougeL.
Thank you very much for your time and consideration. Looking forward to any relevant insights here.
Thanks for brining up this discussion, what I would try first is probably increasing the r
argument to 16
or 32
to add more trainable parameters to the LoRA modules
Also note that in the QLoRA paper: https://arxiv.org/abs/2305.14314 the authors suggest to adapt all linear layers including FFN layers and not the attention layers only (default)
@younesbelkada - Thank you for the suggestion above. I did try experimenting with higher rank (16
, 32
), however the performance does not seem to change drastically. This is in alignment with the observations in the QLoRA paper stating that that the projection dimension does not impact the performance.
Although, based on the above suggestion, I also tried adapting all other linear layers (including FFN layers). The performance improved by 1%, however, there is still a gap of ~3% compared to full fine-tuning.
I see thanks for the experiments, can you also double check that the LoRA weights are set on the encoder as well?
Also what is the task you are trying to fine-tune on, and do you quantize the base model in 8bit?
Yes, the LoRA weights are set on both the encoder and decoder as well. Here is the list of modules for first block in encoder
and decoder
Encoder
base_model.model.encoder.block.0.layer.0.SelfAttention.q
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_dropout
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_dropout.default
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_A
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_A.default
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_B
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_B.default
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_embedding_A
base_model.model.encoder.block.0.layer.0.SelfAttention.q.lora_embedding_B
base_model.model.encoder.block.0.layer.0.SelfAttention.k
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_dropout
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_dropout.default
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_A
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_A.default
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_B
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_B.default
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_embedding_A
base_model.model.encoder.block.0.layer.0.SelfAttention.k.lora_embedding_B
base_model.model.encoder.block.0.layer.0.SelfAttention.v
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_dropout
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_dropout.default
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_A
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_A.default
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_B
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_B.default
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_embedding_A
base_model.model.encoder.block.0.layer.0.SelfAttention.v.lora_embedding_B
base_model.model.encoder.block.0.layer.0.SelfAttention.o
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_dropout
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_dropout.default
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_A
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_A.default
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_B
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_B.default
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_embedding_A
base_model.model.encoder.block.0.layer.0.SelfAttention.o.lora_embedding_B
base_model.model.encoder.block.0.layer.0.SelfAttention.relative_attention_bias
base_model.model.encoder.block.0.layer.0.layer_norm
base_model.model.encoder.block.0.layer.0.dropout
base_model.model.encoder.block.0.layer.1
base_model.model.encoder.block.0.layer.1.DenseReluDense
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_dropout
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_dropout.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_A
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_A.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_B
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_B.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_embedding_A
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_0.lora_embedding_B
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_dropout
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_dropout.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_A
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_A.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_B
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_B.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_embedding_A
base_model.model.encoder.block.0.layer.1.DenseReluDense.wi_1.lora_embedding_B
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_dropout
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_dropout.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_A
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_A.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_B
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_B.default
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_embedding_A
base_model.model.encoder.block.0.layer.1.DenseReluDense.wo.lora_embedding_B
base_model.model.encoder.block.0.layer.1.DenseReluDense.dropout
base_model.model.encoder.block.0.layer.1.DenseReluDense.act
base_model.model.encoder.block.0.layer.1.layer_norm
base_model.model.encoder.block.0.layer.1.dropout
Decoder
base_model.model.decoder.block.0.layer.0.SelfAttention.q
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_dropout
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_dropout.default
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_A
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_A.default
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_B
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_B.default
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_embedding_A
base_model.model.decoder.block.0.layer.0.SelfAttention.q.lora_embedding_B
base_model.model.decoder.block.0.layer.0.SelfAttention.k
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_dropout
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_dropout.default
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_A
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_A.default
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_B
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_B.default
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_embedding_A
base_model.model.decoder.block.0.layer.0.SelfAttention.k.lora_embedding_B
base_model.model.decoder.block.0.layer.0.SelfAttention.v
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_dropout
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_dropout.default
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_A
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_A.default
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_B
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_B.default
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_embedding_A
base_model.model.decoder.block.0.layer.0.SelfAttention.v.lora_embedding_B
base_model.model.decoder.block.0.layer.0.SelfAttention.o
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_dropout
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_dropout.default
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_A
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_A.default
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_B
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_B.default
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_embedding_A
base_model.model.decoder.block.0.layer.0.SelfAttention.o.lora_embedding_B
base_model.model.decoder.block.0.layer.0.SelfAttention.relative_attention_bias
base_model.model.decoder.block.0.layer.0.layer_norm
base_model.model.decoder.block.0.layer.0.dropout
base_model.model.decoder.block.0.layer.1
base_model.model.decoder.block.0.layer.1.EncDecAttention
base_model.model.decoder.block.0.layer.1.EncDecAttention.q
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_dropout
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_dropout.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_A.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_B.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_embedding_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.q.lora_embedding_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.k
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_dropout
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_dropout.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_A.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_B.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_embedding_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.k.lora_embedding_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.v
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_dropout
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_dropout.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_A.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_B.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_embedding_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.v.lora_embedding_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.o
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_dropout
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_dropout.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_A.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_B
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_B.default
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_embedding_A
base_model.model.decoder.block.0.layer.1.EncDecAttention.o.lora_embedding_B
base_model.model.decoder.block.0.layer.1.layer_norm
base_model.model.decoder.block.0.layer.1.dropout
base_model.model.decoder.block.0.layer.2
base_model.model.decoder.block.0.layer.2.DenseReluDense
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_dropout
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_dropout.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_A
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_A.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_B
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_B.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_embedding_A
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_0.lora_embedding_B
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_dropout
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_dropout.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_A
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_A.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_B
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_B.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_embedding_A
base_model.model.decoder.block.0.layer.2.DenseReluDense.wi_1.lora_embedding_B
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_dropout
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_dropout.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_A
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_A.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_B
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_B.default
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_embedding_A
base_model.model.decoder.block.0.layer.2.DenseReluDense.wo.lora_embedding_B
base_model.model.decoder.block.0.layer.2.DenseReluDense.dropout
base_model.model.decoder.block.0.layer.2.DenseReluDense.act
base_model.model.decoder.block.0.layer.2.layer_norm
base_model.model.decoder.block.0.layer.2.dropout
Also, I am not quantizing the base model in 8bit, instead using bf16 precision for fine-tuning.
Additionally, the training dataset consists of instructions for ~20 tasks (a mixture of summarization, classification, paraphrasing, etc.)
I am wondering if the setup is sensitive to hyperparameters. Specifically,
- Is there a thumb-rule that one can use to determine what range of hyper-parameters generally work best for a given class of models?
- Any specific stopping criteria that one should use while fine-tuning using LoRA? If full fine-tuning leads to performance
X
withN
epochs, is LoRA also expected to result in similar performance inN
epochs. Or should one be training longer while using LoRA?
Hello @digvijayingle016, have you tried also making biases as trainable via bias
config param ('all' or 'lora_only'). Also, try training longer with LoRA. In LoRA paper, they train for 10-30 epochs on GLUE with Roberta-large while in RoBerta paper, the max epochs for GLUE dataset was 10. So, indeed training longer seems important with PEFT methods.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Closing the issue, feel free to-reopen if you have more questions