OptimalScale/LMFlow

[BUG] LISA: same loss regardless of lisa_activated_layers

geronimi73 opened this issue · 17 comments

Describe the bug
I think there might be something wrong with the current LISA implementation. There is no difference in training loss, no matter how many layers are active.

Not using LMFlow but HF Trainer with DynamicLayerActivationCallback from https://github.com/OptimalScale/LMFlow/blob/main/src/lmflow/pipeline/finetuner.py

To Reproduce

class DynamicLayerActivationCallback(TrainerCallback):
    def __init__(self, n_layers, interval_steps, model):
        super().__init__()
        self.n_layers = n_layers
        self.interval_steps = interval_steps
        self.model = model
        # Determine the way to access layers based on the model type
        if self.model.__class__.__name__ == 'LlamaForCausalLM':
            self.layers_attribute = 'model.model.layers'  # Layer access path for LlamaForCausalLM
        else:
            self.layers_attribute = 'model.transformer.h'  # General access path
        self.total_layers = len(eval('self.' + self.layers_attribute))  # Dynamically execute to get the number of layers

        # Freeze all layers upon initialization
        self.freeze_all_layers()
        self.active_layers_indices = []

    def freeze_all_layers(self):
        layers = eval('self.' + self.layers_attribute)  # Dynamically execute to get layers
        for layer in layers:
            for param in layer.parameters():
                param.requires_grad = False

    def on_step_begin(self, args, state, control, **kwargs):
        # Check if it's time to switch active layers, including at step 0
        if state.global_step % self.interval_steps == 0 or state.global_step == 1:
            self.switch_active_layers()

    def switch_active_layers(self):
        # First, disable gradients for all layers
        self.freeze_all_layers()

        # Randomly select n_layers to activate
        layers = eval('self.' + self.layers_attribute)  # Re-fetch layer references
        self.active_layers_indices = np.random.choice(range(self.total_layers), self.n_layers, replace=False)
        print(f"Activating layers at indices: {self.active_layers_indices} for the next steps.")

        # Enable gradients only for the selected layers
        for idx in self.active_layers_indices:
            for param in layers[idx].parameters():
                param.requires_grad = True

# Instantiate the callback
dynamic_layer_activation_callback = DynamicLayerActivationCallback(
    n_layers = lisa_activated_layers,                     # Number of layers to activate
    interval_steps = lisa_interval_steps,               # Step interval to update active layers
    model = model
)

trainer.add_callback(dynamic_layer_activation_callback)

model llama2-7b

Expected behavior

  • different loss for different lisa_activated_layers
  • same loss (and VRAM usage) for lisa_activated_layers==32 and full finetune (without LISA) - loss curves are different, they diverge after a few steps

Screenshots
W B Chart 31_03_2024, 07_01_16

W B Chart 31_03_2024, 07_02_45

Setup
2x 3090

torch==2.2.1
transformers==4.39.2
Python 3.10.12

Thanks for your interest in LMFlow! You may change the first self.freeze_all_layers() in __init__ to self.switch_active_layers() to avoid this problem. It may incur a slight increase in memory cost, which we are figuring out a better implementation to further reduce. Thanks 😄

@research4pan why would that make a huge difference since it's only a difference during the first step?

That's a very good question. We conjectured that's because pytorch, transformers or accelerate compile the computation graph in a different way, where dynamically adding more activation layers might lead to strange behaviors. But we are still investigating that and would appreciate any helpful feedback or better implementations regarding this issue.

The experiments in the paper are conducted differently by running T/K times of separated runs, so this problem does not affect the reported result.

@geronimi73 , we just did some tests, in your script without deepspeed, removing the first self.freeze_all_layers() in __init__ can be helpful. It seems deepspeed introduces some extra overheads during initialization, which we are currently investigating.

tmm1 commented

You may change the first self.freeze_all_layers() in __init__ to self.switch_active_layers() to avoid this problem

I tried this and the loss-curve was still exactly the same for me.

tmm1 commented

removing the first self.freeze_all_layers() in __init__ can be helpful

this works, i finally have a distinct loss curve after removing any call to freeze from init

But we are still investigating that and would appreciate any helpful feedback or better implementations regarding this issue.

It's the optimizer I think. When you hit train(), the optimizer is instantiated (code) with a dict of parameter groups. optimizer_grouped_parameters contains only those parameters which requires_grad at the time you start training (code). If you now freeze and unfreeze layers (by changing requires_grad) during training, it will not change the optimizers behaviour because it is still working on the same parameters you passed when training started.

This explains why the suggested change of self.freeze_all_layers() in __init__ to self.switch_active_layers() works. With freeze_all_layers() in init you will train the embeddings and lm_head only. If you change freeze to switch you will have n_layers active. But they will be static. No matter what other layers you unfreeze during training, the optimizer will only work on the initial n_layers that had requires_grad=T when you started training. But the loss will be different for each run because the inital n_layers are chosen random.

Hi @geronimi73 I think you are right.

I also think the main reason is from the optimizer. Directly using requires_grad to dynamically select layers cannot achieve dynamically training. I'm also curious about the state in the optimizer since this algorithm continuously changes the activated layer and how to update the momentum, just reinitializing the momentum of each layer when the layer is activated.

Hi @geronimi73 I think you are right.

I also think the main reason is from the optimizer. Directly using requires_grad to dynamically select layers cannot achieve dynamically training. I'm also curious about the state in the optimizer since this algorithm continuously changes the activated layer and how to update the momentum, just reinitializing the momentum of each layer when the layer is activated.

i am wondering if we update optimizer in on_step_begin , will it work? like below:

def on_step_begin(self, args, state, control, **kwargs):
    if state.global_step % self.step_interval == 0:
        self.switch_active_layers()        
        kwargs['optimizer'] = self.create_optimizer(args)

UPDATE: checked failed. HOPE find other solutions, especially how to pass trainer to callaback? like Lightning-AI/pytorch-lightning#3095 (comment)

Hello,
Just stumbled into this while trying to figure out how I could change the optimizer / scheduler in the middle of the training. I have managed to do that by calling the method setup_optmizers() from trainer.strategy. Maybe it can work for you? Concretely, what I came up with was:

from lightning.pytorch.callbacks import Callback

class MyCallback(Callback):
    def on_train_start(self, trainer: "L.Trainer", pl_module: TabularModule) -> None:  # or on_SOMETHING
        # change something on my pl_module configure_optimizers method, in my case I use the attribute 
        # self.optimizer_fn to create the optimizer
        pl_module.optimizer_fn = torch.optim.Adam
        # setup the optimizer, this will call configure_optimizers()
        # and attach the optimizers/schedulers to the trainer
        trainer.strategy.setup_optimizers(trainer)

I do not know if this could mess up the trainer in some mysterious ways, but it seems to work for me.

@BrunoBelucci yes that would be a solution but i'm not sure if trashing the optimizer state every few steps is a good idea. probably not.

Hello, Just stumbled into this while trying to figure out how I could change the optimizer / scheduler in the middle of the training. I have managed to do that by calling the method setup_optmizers() from trainer.strategy. Maybe it can work for you? Concretely, what I came up with was:

from lightning.pytorch.callbacks import Callback

class MyCallback(Callback):
    def on_train_start(self, trainer: "L.Trainer", pl_module: TabularModule) -> None:  # or on_SOMETHING
        # change something on my pl_module configure_optimizers method, in my case I use the attribute 
        # self.optimizer_fn to create the optimizer
        pl_module.optimizer_fn = torch.optim.Adam
        # setup the optimizer, this will call configure_optimizers()
        # and attach the optimizers/schedulers to the trainer
        trainer.strategy.setup_optimizers(trainer)

I do not know if this could mess up the trainer in some mysterious ways, but it seems to work for me.

this answer is the same to what i have suggested in #726 (comment). Maybe that is a solution, but it needs pytorch-lightning package installed. Needs further test.

Thanks for the fruitful discussion! We discovered a different way to let trainer recreate its optimizer (i.e. trainer.create_optimizer() or trainer.optimizer, _ = deepspeed_init()) during training, it seems to be working without deepspeed (so working in single GPU). But model parallelism is still not working with deepspeed. Looks like deepspeed has ignored this reinitialization, we are currently investigating that.

tmm1 commented

We discovered a different way to let trainer recreate its optimizer (i.e. trainer.create_optimizer()

are you sure about this? it seems create_optimizer() will not do anything on second invocation?

https://github.com/huggingface/transformers/blob/76fa17c1663a0efeca7208c20579833365584889/src/transformers/trainer.py#L1009

I came up with the code below. The id of optimizer changes when call on_train_epoch_start. BAD THINGS: Still it needs lightning package installed and can only be preformed in a separate project, something like https://lightning.ai/lightning-ai/studios/code-lora-from-scratch

def on_train_epoch_start(self, trainer: "L.Trainer", pl_module: "pl.LightningModule"):
    if trainer.current_epoch % self.epoch_interval == 0:
        self.switch_active_layers()
        pl_module.optimizer_fn = torch.optim.Adam
        trainer.strategy.setup_optimizers(trainer)

We discovered a different way to let trainer recreate its optimizer (i.e. trainer.create_optimizer()

are you sure about this? it seems create_optimizer() will not do anything on second invocation?

https://github.com/huggingface/transformers/blob/76fa17c1663a0efeca7208c20579833365584889/src/transformers/trainer.py#L1009

Yes. In our coming implementation next version, we overwrite the function as well by inheriting the Trainer so this reinitialization of the optimizer becomes possible.

tmm1 commented

Yes. In our coming implementation next version, we overwrite the function as well by inheriting the Trainer so this reinitialization of the optimizer becomes possible.

I tried it like this:

self.trainer.optimizer = None
self.trainer.create_optimizer()

but when I tried training the loss never decreases.

Another idea is to keep the same optimizer, but reset the internal state. This is similar to ReLoRA technique: https://github.com/OpenAccess-AI-Collective/axolotl/pull/1414/files