model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora
zetavg opened this issue ยท 15 comments
I recently found that when fine-tuning using alpaca-lora, model.save_pretrained()
will save a adapter_model.bin
that is only 443 B.
This seems to be happening after peft@75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495
.
Normally adapter_model.bin
should be > 16 MB. And while the 443 B adapter_model.bin
is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected.
drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr 9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr 9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu 350 Apr 9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu 443 Apr 9 12:55 adapter_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr 9 12:06 checkpoint-400
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr 9 12:06 checkpoint-600
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr 9 12:07 checkpoint-800
I'm not sure if this is an issue to peft
or not, or is this a duplication of other issues, but just leaving this for reference.
I've been testing with multiple versions of peft
:
072da6d9d62
works382b178911edff38c1ff619bbac2ba556bd2276b
works75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495
not working445940fb7b5d38390ffb6707e2a989e89fff03b5
not working1a6151b91fcdcc25326b9807d7dbf54e091d506c
not working1117d4772109a098787ce7fc297cb6cd641de6eb
not working
Steps to reproduce:
conda create python=3.8 -n test
conda activate test
git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
pip install -r requirements.txt
# to workaround AttributeError: bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
cd /home/ubuntu/miniconda3/envs/test/lib/python3.8/site-packages/bitsandbytes/
mv libbitsandbytes_cpu.so libbitsandbytes_cpu.so.bak
cp libbitsandbytes_cuda121.so libbitsandbytes_cpu.so
cd -
conda install cudatoolkit
# alpaca_data_cleaned_first_100.json is alpaca_data_cleaned.json with only the first 100 items, setting --val_set_size 0 because there're not enough data to build the test set
python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/data/datasets/alpaca_data_cleaned_first_100.json' --output_dir './lora-alpaca' --val_set_size 0
$ ls -alh lora-alpaca
total 16K
drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr 9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr 9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu 350 Apr 9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu 443 Apr 9 12:55 adapter_model.bin
(adapter_model.bin
should normally be around 16 MB)
Running on Lambda Cloud A10 instance.
The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal
old_state_dict = model.state_dict
model.state_dict = (
lambda self, *_, **__: get_peft_model_state_dict(
self, old_state_dict()
)
).__get__(model, type(model))
I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?
Thanks @s4rduk4r, for suggesting removing the lines related to model.state_dict
. I haven't confirmed it by myself, but as @richardklafter's confirmation and I found the author of alpaca-lora had also suggested removing those lines of code to fix another issue, I agree that we can close this and move the discussion to why those lines codes are added.
I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?
Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.
same issue with this comment [https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]
Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.
same issue with this comment [https://github.com/https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]
Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
class SavePeftModelCallback(TrainerCallback):
def on_save(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")
peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
kwargs["model"].save_pretrained(peft_model_path)
pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
if os.path.exists(pytorch_model_path):
os.remove(pytorch_model_path)
return control
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
# compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:
2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432:
UserWarning: Positional args are being deprecated, use kwargs instead.
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.
BTW, I modified the built-in run_clm.py
to support LoRA rather than using alpaca-lora
. So it should not be the consequence of the following lines:
old_state_dict = model.state_dict
model.state_dict = (
lambda self, *_, **__: get_peft_model_state_dict(
self, old_state_dict()
)
).__get__(model, type(model))
@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:
2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.
BTW, I modified the built-in
run_clm.py
to support LoRA rather than usingalpaca-lora
. So it should not be the consequence of the following lines:old_state_dict = model.state_dict model.state_dict = ( lambda self, *_, **__: get_peft_model_state_dict( self, old_state_dict() ) ).__get__(model, type(model))
With r=8 and alpha=16, I can save the LoRA weights of LLaMA-7b successfully. However, when increasing r to 32 and alpha to 64, we obtained the empty adapter_model.bin. This is really weird.
@wxjiao, have you able to solved it? looks like large adapter lead to empty adapter_model.bin when saved. I got into this when using LoRA+Zero3 for 30B, 65B , same code works fine for 7B.
@justinphan3110 No, just gave it up. It took me too much time to debug. At the beginning, I thought there should be something wrong with get_peft_model_state_dict
, but it cannot explain the success of saving llama-7b lora. I printed the first elements in state_dict
for both base model and lora in my training script, and found the keys were there but missing the values (i.e., only [ ]). I guess there is some incompatibility between PEFT and Zero3. I'll just wait.
I have just tried on a Llama30B+LoRA but still have an empty saved adapter_model model. Lmk if it works for you or any new insight you have from it.
have you able to figured it out @wxjiao w? I can try open an issue
I have the same problem(the value of adpater_model.bin is [ ] ) when use peftModel with zero3 after deepspeed.initialize.
I dont know if this is the RIGHT way, but this simple modification at L275 gives me a fucntional adapter_model.bin
with the correct size:
- model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict())
The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal
old_state_dict = model.state_dict model.state_dict = ( lambda self, *_, **__: get_peft_model_state_dict( self, old_state_dict() ) ).__get__(model, type(model))
Thanks man @s4rduk4r !!! You saved my day.