model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora

Question

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora

zetavg opened this issue 2 years ago · 15 comments

I recently found that when fine-tuning using alpaca-lora, model.save_pretrained() will save a adapter_model.bin that is only 443 B.

This seems to be happening after peft@75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495.

Normally adapter_model.bin should be > 16 MB. And while the 443 B adapter_model.bin is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected.

drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-400
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-600
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:07 checkpoint-800

I'm not sure if this is an issue to peft or not, or is this a duplication of other issues, but just leaving this for reference.

I've been testing with multiple versions of peft:

072da6d9d62 works
382b178911edff38c1ff619bbac2ba556bd2276b works
75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495 not working
445940fb7b5d38390ffb6707e2a989e89fff03b5 not working
1a6151b91fcdcc25326b9807d7dbf54e091d506c not working
1117d4772109a098787ce7fc297cb6cd641de6eb not working

Steps to reproduce:

conda create python=3.8 -n test
conda activate test
git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
pip install -r requirements.txt

# to workaround AttributeError: bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
cd /home/ubuntu/miniconda3/envs/test/lib/python3.8/site-packages/bitsandbytes/
mv libbitsandbytes_cpu.so libbitsandbytes_cpu.so.bak
cp libbitsandbytes_cuda121.so libbitsandbytes_cpu.so
cd -
conda install cudatoolkit

# alpaca_data_cleaned_first_100.json is alpaca_data_cleaned.json with only the first 100 items, setting --val_set_size 0 because there're not enough data to build the test set
python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/data/datasets/alpaca_data_cleaned_first_100.json' --output_dir './lora-alpaca' --val_set_size 0

$ ls -alh lora-alpaca
total 16K
drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin

(adapter_model.bin should normally be around 16 MB)

Running on Lambda Cloud A10 instance.

Answer 1 · 2023-04-10T09:46:52.000Z

The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Answer 2 · 2023-04-11T17:47:19.000Z

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

Answer 3 · 2023-04-11T18:35:22.000Z

Thanks @s4rduk4r, for suggesting removing the lines related to model.state_dict. I haven't confirmed it by myself, but as @richardklafter's confirmation and I found the author of alpaca-lora had also suggested removing those lines of code to fix another issue, I agree that we can close this and move the discussion to why those lines codes are added.

Answer 4 · 2023-04-13T12:05:05.000Z

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

Answer 5 · 2023-04-18T07:46:04.000Z

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Answer 6 · 2023-04-18T11:59:47.000Z

@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:

2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: 
UserWarning: Positional args are being deprecated, use kwargs instead. 
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.

I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.

BTW, I modified the built-in run_clm.py to support LoRA rather than using alpaca-lora. So it should not be the consequence of the following lines:

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Answer 7 · 2023-04-20T13:17:33.000Z

@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:
2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: 
UserWarning: Positional args are being deprecated, use kwargs instead. 
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.

BTW, I modified the built-in run_clm.py to support LoRA rather than using alpaca-lora. So it should not be the consequence of the following lines:
    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

With r=8 and alpha=16, I can save the LoRA weights of LLaMA-7b successfully. However, when increasing r to 32 and alpha to 64, we obtained the empty adapter_model.bin. This is really weird.

Answer 8 · 2023-04-23T20:41:04.000Z

@wxjiao, have you able to solved it? looks like large adapter lead to empty adapter_model.bin when saved. I got into this when using LoRA+Zero3 for 30B, 65B , same code works fine for 7B.

Answer 9 · 2023-04-24T01:18:11.000Z

@justinphan3110 No, just gave it up. It took me too much time to debug. At the beginning, I thought there should be something wrong with get_peft_model_state_dict, but it cannot explain the success of saving llama-7b lora. I printed the first elements in state_dict for both base model and lora in my training script, and found the keys were there but missing the values (i.e., only [ ]). I guess there is some incompatibility between PEFT and Zero3. I'll just wait.

Answer 10 · 2023-04-25T01:04:53.000Z

This looks good. Will try soon. Thanks!

…

On Tue, Apr 25, 2023 at 3:28 AM Long Phan ***@***.***> wrote: @wxjiao <https://github.com/wxjiao>, this may helps: https://github.com/lm-sys/FastChat/blob/ceeaaa40adb20790e6b08209250d35eb42cc8451/fastchat/train/train_lora.py#L64 — Reply to this email directly, view it on GitHub <#286 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMYL7IM3GDTAIYZGWGGVXLXC3H5FANCNFSM6AAAAAAWYDRDDA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 11 · 2023-04-25T01:07:08.000Z

I have just tried on a Llama30B+LoRA but still have an empty saved adapter_model model. Lmk if it works for you or any new insight you have from it.

Answer 12 · 2023-04-30T19:30:46.000Z

have you able to figured it out @wxjiao w? I can try open an issue

Answer 13 · 2023-05-12T05:55:26.000Z

I have the same problem(the value of adpater_model.bin is [ ] ) when use peftModel with zero3 after deepspeed.initialize.

Answer 14 · 2023-05-18T06:33:26.000Z

I dont know if this is the RIGHT way, but this simple modification at L275 gives me a fucntional adapter_model.bin with the correct size:

- model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict())

Answer 15 · 2023-11-18T19:29:06.000Z

The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal
    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Thanks man @s4rduk4r !!! You saved my day.