Issue: Missing Generation of `pytorch_model.bin` File During Model Tuning

Question

Issue: Missing Generation of `pytorch_model.bin` File During Model Tuning

WilliamYi96 opened this issue 9 months ago · 5 comments

Thank you for sharing your interesting project!

Recently, when I ran bash ./script/llama_prune.sh, the pruning step worked perfectly fine. However, during the tuning step, although there were no error information, the generated structure only included the following:

checkpoints-200
- model.safetensors
- optimizer.pt
- rng_state.pth
- scheduler.pt
- trainer_state.json
- training_args.bin

I noticed that the pytorch_model.bin file was not saved. I haven't modified the code, and I am using PyTorch version 2.1.2+cu121. Could you suggest what the possible reason for this might be?

Answer 1 · 2023-12-22T13:08:13.000Z

Issue resolved. The reason lies in the newer versions of the transformers library, where safetensors has become the default format, replacing pytorch_model.bin, starting from transformers>=4.33.0. This issue can be addressed by either downgrading to transformers==4.33.0 using pip install transformers==4.33.0, or by setting self_serialization=False in model.save_pretrained().

Tracking here: huggingface/transformers#28183

Answer 2 · 2023-12-22T18:27:58.000Z

Two updates:

pip install transformers==4.33.0 will lead to the following issue:

AttributeError: 'LlamaTokenizer' object has no attribute 'added_tokens_decoder'. Did you mean: '_added_tokens_decoder'?

If using the latest transformers and setting self_serialization=False, there is still no pytorch_model.bin saved.

This issue still exists.

Answer 3 · 2023-12-22T18:44:24.000Z

Issue resolved. The problem is that when constructing the trainer, save_safetensors=False should be set. Otherwise, the above safe_serialization=False will not work.

https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments.save_safetensors

Answer 4 · 2024-08-01T22:35:32.000Z

@WilliamYi96 Can we recover pytorch.bin from the safe tensor representation ? I have already run the finetuning on a bigger dataset for some time and want to avoid triggering the learning. or can we resume from the checkpoint and save after running for some steps?