OptimalScale/LMFlow

Weird Loss with LISA

Opened this issue · 1 comments

Hi ,

so giving a background -
I am using Mistral 7b along with HF trainer for finetuning on domain specific data.
Where the task is CausalLM ie next word prediction.
Using datacollatorfor Causal LM for data prep using context size is 1000 tokens per data point and I have 9k total dataset. which includes 5-10% of Wiki data for mixing it with Domain data for avoiding Catastrophic Forgetting.
Test data is a part of train to make it learn on the specific data

I am utilizing the DynamicLayerActivationCallback from LMFlow in my trainer as Training Callbacks.

I tried multiple experiments with -

  • lisa_activated_layers- 2 , lisa_interval_steps - 50 epoch 8
  • lisa_activated_layers- 2 , lisa_interval_steps - 50 epoch 10
    for both of the runs the loss starts around 8 ad goes around 5-6 but it goes into plateau . and doesnt come below 5 .

I find it little strange, maybe need other experimentation on -

  • changing lisa_activated_layers
  • changing interval steps (I think this can be important factor too)

Also would like to get the idea, whats the ideal or recommended hyperparams for such type of finetuning with around 10K datapoints.

Thanks in Advance

Thanks for your interest in LMFlow! We have fixed several bugs of the LISA implementation in LMFlow, it would be nice if you could check whether the implementation matches our latest version.

If the implementation is correct, it is worth trying:

  • A smaller lisa_interval_steps, like 5 instead of 50, since more frequent sampling allows more layers to be covered.
  • If that doesn't work, you may try larger lisa_activated_layers. We have observed that in some cases, such as llama-2-70b, deeper structure requires larger lisa_activated_layers. This may be also the case when data distribution is not that easy to learn.

Hope this information can be helpful. Please feel free to let us know if further problems are encountered 😄