Model saving issue after training
gothaleshubham opened this issue · 1 comments
Please check that this issue hasn't been reported before.
- I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Expected behaviour is to save model weight as in output directory
Current behaviour
Only saving tokenizer information in output directory and model weight are not even after training is completed. I am using 2 gpus for training the model. While saving and I see the utilization on gpu2 as shown in below. But not saving weights and stuck
error.log
Steps to reproduce
Data : train_data_1K.json
git clone https://github.com/axolotl-ai-cloud/axolotl
cd axolotl
pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'
cd ..
accelerate launch -m axolotl.cli.train model.yaml
Config yaml
base_model: gpt2
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
# huggingface repo
datasets:
- path: train_data_1K.jsonl
ds_type: json
type: chat_template
chat_template: gemma
field_messages: conversations
message_field_role: from
message_field_content: value
roles:
user:
- human
assistant:
- gpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.5
output_dir: output
sequence_len: 2048
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: false
warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
debug:
weight_decay: 0.0
special_tokens:
Possible solution
No response
Which Operating Systems are you using?
- Linux
- macOS
- Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
- My issue title is concise, descriptive, and in title casing.
- I have searched the existing issues to make sure this bug has not been reported yet.
- I am using the latest version of axolotl.
- I have provided enough information for the maintainers to reproduce and diagnose the issue.
a couple of things. I would recommend disabling sample packing with the gpt2 model as it does not really support it wince it doesn't have flash attention support. Also the max sequence length for gpt2 is 1024. I noticed when I tried it with the setting of 2048 you are using, it ran into a cuda/nccl issue which is likely what you were seeing.
I think once you make these changes, it should be fine, as it saved properly for me after that.