[BUG] LISA: same loss regardless of lisa_activated_layers
geronimi73 opened this issue · 17 comments
Describe the bug
I think there might be something wrong with the current LISA implementation. There is no difference in training loss, no matter how many layers are active.
Not using LMFlow but HF Trainer with DynamicLayerActivationCallback
from https://github.com/OptimalScale/LMFlow/blob/main/src/lmflow/pipeline/finetuner.py
To Reproduce
class DynamicLayerActivationCallback(TrainerCallback):
def __init__(self, n_layers, interval_steps, model):
super().__init__()
self.n_layers = n_layers
self.interval_steps = interval_steps
self.model = model
# Determine the way to access layers based on the model type
if self.model.__class__.__name__ == 'LlamaForCausalLM':
self.layers_attribute = 'model.model.layers' # Layer access path for LlamaForCausalLM
else:
self.layers_attribute = 'model.transformer.h' # General access path
self.total_layers = len(eval('self.' + self.layers_attribute)) # Dynamically execute to get the number of layers
# Freeze all layers upon initialization
self.freeze_all_layers()
self.active_layers_indices = []
def freeze_all_layers(self):
layers = eval('self.' + self.layers_attribute) # Dynamically execute to get layers
for layer in layers:
for param in layer.parameters():
param.requires_grad = False
def on_step_begin(self, args, state, control, **kwargs):
# Check if it's time to switch active layers, including at step 0
if state.global_step % self.interval_steps == 0 or state.global_step == 1:
self.switch_active_layers()
def switch_active_layers(self):
# First, disable gradients for all layers
self.freeze_all_layers()
# Randomly select n_layers to activate
layers = eval('self.' + self.layers_attribute) # Re-fetch layer references
self.active_layers_indices = np.random.choice(range(self.total_layers), self.n_layers, replace=False)
print(f"Activating layers at indices: {self.active_layers_indices} for the next steps.")
# Enable gradients only for the selected layers
for idx in self.active_layers_indices:
for param in layers[idx].parameters():
param.requires_grad = True
# Instantiate the callback
dynamic_layer_activation_callback = DynamicLayerActivationCallback(
n_layers = lisa_activated_layers, # Number of layers to activate
interval_steps = lisa_interval_steps, # Step interval to update active layers
model = model
)
trainer.add_callback(dynamic_layer_activation_callback)
model llama2-7b
Expected behavior
- different loss for different
lisa_activated_layers
- same loss (and VRAM usage) for
lisa_activated_layers==32
and full finetune (without LISA) - loss curves are different, they diverge after a few steps
Setup
2x 3090
torch==2.2.1
transformers==4.39.2
Python 3.10.12
Thanks for your interest in LMFlow! You may change the first self.freeze_all_layers()
in __init__
to self.switch_active_layers()
to avoid this problem. It may incur a slight increase in memory cost, which we are figuring out a better implementation to further reduce. Thanks 😄
@research4pan why would that make a huge difference since it's only a difference during the first step?
That's a very good question. We conjectured that's because pytorch
, transformers
or accelerate
compile the computation graph in a different way, where dynamically adding more activation layers might lead to strange behaviors. But we are still investigating that and would appreciate any helpful feedback or better implementations regarding this issue.
The experiments in the paper are conducted differently by running T/K
times of separated runs, so this problem does not affect the reported result.
@geronimi73 , we just did some tests, in your script without deepspeed
, removing the first self.freeze_all_layers()
in __init__
can be helpful. It seems deepspeed
introduces some extra overheads during initialization, which we are currently investigating.
You may change the first
self.freeze_all_layers()
in__init__
toself.switch_active_layers()
to avoid this problem
I tried this and the loss-curve was still exactly the same for me.
removing the first
self.freeze_all_layers()
in__init__
can be helpful
this works, i finally have a distinct loss curve after removing any call to freeze from init
But we are still investigating that and would appreciate any helpful feedback or better implementations regarding this issue.
It's the optimizer I think. When you hit train()
, the optimizer is instantiated (code) with a dict of parameter groups. optimizer_grouped_parameters
contains only those parameters which requires_grad
at the time you start training (code). If you now freeze and unfreeze layers (by changing requires_grad
) during training, it will not change the optimizers behaviour because it is still working on the same parameters you passed when training started.
This explains why the suggested change of self.freeze_all_layers()
in __init__
to self.switch_active_layers()
works. With freeze_all_layers()
in init you will train the embeddings and lm_head only. If you change freeze
to switch
you will have n_layers
active. But they will be static. No matter what other layers you unfreeze during training, the optimizer will only work on the initial n_layers
that had requires_grad=T
when you started training. But the loss will be different for each run because the inital n_layers
are chosen random.
Hi @geronimi73 I think you are right.
I also think the main reason is from the optimizer. Directly using requires_grad
to dynamically select layers cannot achieve dynamically training. I'm also curious about the state in the optimizer since this algorithm continuously changes the activated layer and how to update the momentum, just reinitializing the momentum of each layer when the layer is activated.
Hi @geronimi73 I think you are right.
I also think the main reason is from the optimizer. Directly using
requires_grad
to dynamically select layers cannot achieve dynamically training. I'm also curious about the state in the optimizer since this algorithm continuously changes the activated layer and how to update the momentum, just reinitializing the momentum of each layer when the layer is activated.
i am wondering if we update optimizer
in on_step_begin
, will it work? like below:
def on_step_begin(self, args, state, control, **kwargs):
if state.global_step % self.step_interval == 0:
self.switch_active_layers()
kwargs['optimizer'] = self.create_optimizer(args)
UPDATE: checked failed. HOPE find other solutions, especially how to pass trainer to callaback? like Lightning-AI/pytorch-lightning#3095 (comment)
Hello,
Just stumbled into this while trying to figure out how I could change the optimizer / scheduler in the middle of the training. I have managed to do that by calling the method setup_optmizers()
from trainer.strategy. Maybe it can work for you? Concretely, what I came up with was:
from lightning.pytorch.callbacks import Callback
class MyCallback(Callback):
def on_train_start(self, trainer: "L.Trainer", pl_module: TabularModule) -> None: # or on_SOMETHING
# change something on my pl_module configure_optimizers method, in my case I use the attribute
# self.optimizer_fn to create the optimizer
pl_module.optimizer_fn = torch.optim.Adam
# setup the optimizer, this will call configure_optimizers()
# and attach the optimizers/schedulers to the trainer
trainer.strategy.setup_optimizers(trainer)
I do not know if this could mess up the trainer in some mysterious ways, but it seems to work for me.
@BrunoBelucci yes that would be a solution but i'm not sure if trashing the optimizer state every few steps is a good idea. probably not.
Hello, Just stumbled into this while trying to figure out how I could change the optimizer / scheduler in the middle of the training. I have managed to do that by calling the method
setup_optmizers()
from trainer.strategy. Maybe it can work for you? Concretely, what I came up with was:from lightning.pytorch.callbacks import Callback class MyCallback(Callback): def on_train_start(self, trainer: "L.Trainer", pl_module: TabularModule) -> None: # or on_SOMETHING # change something on my pl_module configure_optimizers method, in my case I use the attribute # self.optimizer_fn to create the optimizer pl_module.optimizer_fn = torch.optim.Adam # setup the optimizer, this will call configure_optimizers() # and attach the optimizers/schedulers to the trainer trainer.strategy.setup_optimizers(trainer)
I do not know if this could mess up the trainer in some mysterious ways, but it seems to work for me.
this answer is the same to what i have suggested in #726 (comment). Maybe that is a solution, but it needs pytorch-lightning package installed. Needs further test.
Thanks for the fruitful discussion! We discovered a different way to let trainer recreate its optimizer (i.e. trainer.create_optimizer()
or trainer.optimizer, _ = deepspeed_init()
) during training, it seems to be working without deepspeed
(so working in single GPU). But model parallelism is still not working with deepspeed
. Looks like deepspeed
has ignored this reinitialization, we are currently investigating that.
We discovered a different way to let trainer recreate its optimizer (i.e.
trainer.create_optimizer()
are you sure about this? it seems create_optimizer()
will not do anything on second invocation?
I came up with the code below. The id
of optimizer
changes when call on_train_epoch_start
. BAD THINGS: Still it needs lightning
package installed and can only be preformed in a separate project, something like https://lightning.ai/lightning-ai/studios/code-lora-from-scratch
def on_train_epoch_start(self, trainer: "L.Trainer", pl_module: "pl.LightningModule"):
if trainer.current_epoch % self.epoch_interval == 0:
self.switch_active_layers()
pl_module.optimizer_fn = torch.optim.Adam
trainer.strategy.setup_optimizers(trainer)
We discovered a different way to let trainer recreate its optimizer (i.e.
trainer.create_optimizer()
are you sure about this? it seems
create_optimizer()
will not do anything on second invocation?
Yes. In our coming implementation next version, we overwrite the function as well by inheriting the Trainer
so this reinitialization of the optimizer becomes possible.
Yes. In our coming implementation next version, we overwrite the function as well by inheriting the
Trainer
so this reinitialization of the optimizer becomes possible.
I tried it like this:
self.trainer.optimizer = None
self.trainer.create_optimizer()
but when I tried training the loss never decreases.
Another idea is to keep the same optimizer, but reset the internal state. This is similar to ReLoRA technique: https://github.com/OpenAccess-AI-Collective/axolotl/pull/1414/files