horseee/LLM-Pruner

Issue: Missing Generation of `pytorch_model.bin` File During Model Tuning

WilliamYi96 opened this issue · 5 comments

Thank you for sharing your interesting project!

Recently, when I ran bash ./script/llama_prune.sh, the pruning step worked perfectly fine. However, during the tuning step, although there were no error information, the generated structure only included the following:

  • checkpoints-200
    • model.safetensors
    • optimizer.pt
    • rng_state.pth
    • scheduler.pt
    • trainer_state.json
    • training_args.bin

I noticed that the pytorch_model.bin file was not saved. I haven't modified the code, and I am using PyTorch version 2.1.2+cu121. Could you suggest what the possible reason for this might be?

Issue resolved. The reason lies in the newer versions of the transformers library, where safetensors has become the default format, replacing pytorch_model.bin, starting from transformers>=4.33.0. This issue can be addressed by either downgrading to transformers==4.33.0 using pip install transformers==4.33.0, or by setting self_serialization=False in model.save_pretrained().

Tracking here: huggingface/transformers#28183

Two updates:

  1. pip install transformers==4.33.0 will lead to the following issue:
AttributeError: 'LlamaTokenizer' object has no attribute 'added_tokens_decoder'. Did you mean: '_added_tokens_decoder'?
  1. If using the latest transformers and setting self_serialization=False, there is still no pytorch_model.bin saved.

This issue still exists.

Issue resolved. The problem is that when constructing the trainer, save_safetensors=False should be set. Otherwise, the above safe_serialization=False will not work.

https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments.save_safetensors

@WilliamYi96 Can we recover pytorch.bin from the safe tensor representation ? I have already run the finetuning on a bigger dataset for some time and want to avoid triggering the learning. or can we resume from the checkpoint and save after running for some steps?