[BUG] 'autotrain llm' does not save checkpoints in the project folder
Closed this issue · 7 comments
Prerequisites
- I have read the documentation.
- I have checked other issues for similar problems.
Backend
Local
Interface Used
UI
CLI Command
autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft
UI Screenshots & Parameters
No response
Error Logs
Checkpoints are not saved
Additional Information
I'm fine-tuning "meta-llama/Llama-2-7b-hf" on a remote GPU using autotrain llm, but after completing the training, the checkpoints are not being saved. The project folder contains files like autotrain-data, adapter-config, tokenizer, training-params, etc., but the checkpoints are missing.
Here's the content of training-params file:
{ "model": "meta-llama/Llama-2-7b-hf", "project_name": "ftLlama", "data_path": "ftLlama/autotrain-data", "train_split": "train", "valid_split": null, "add_eos_token": true, "block_size": 1024, "model_max_length": 2048, "padding": "right", "trainer": "sft", "use_flash_attention_2": false, "log": "none", "disable_gradient_checkpointing": false, "logging_steps": -1, "eval_strategy": "epoch", "save_total_limit": 1, "auto_find_batch_size": false, "mixed_precision": null, "lr": 0.0002, "epochs": 5, "batch_size": 12, "warmup_ratio": 0.1, "gradient_accumulation": 4, "optimizer": "adamw_torch", "scheduler": "linear", "weight_decay": 0.0, "max_grad_norm": 1.0, "seed": 42, "chat_template": null, "quantization": "int4", "target_modules": "all-linear", "merge_adapter": false, "peft": true, "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "model_ref": null, "dpo_beta": 0.1, "max_prompt_length": 128, "max_completion_length": null, "prompt_text_column": "autotrain_prompt", "text_column": "autotrain_text", "rejected_text_column": "autotrain_rejected_text", "push_to_hub": false, "unsloth": false, "distributed_backend": null }
Am I missing something or is this a bug?
This is the command I'm using:
autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft
which checkpoints do you want to save? it saves only one checkpoint by default. you can change it by changing save_total_limit argument. llm doesnt use validation data.
information for all arguments are in docs btw: https://hf.co/docs/autotrain
Thanks abhishek for your response.
It doesn't save any checkpoints, and there is no checkpoint folder in my project.
Is there anything I can provide that would help resolve the issue??
I also tried the below code but it doesn't save checkpoints too.
from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.cli.utils import llm_munge_data
from autotrain.project import AutoTrainProject
params = LLMTrainingParams(
model="meta-llama/Llama-2-7b-hf",
project_name="autotrain-finetune-Llama",
data_path=".",
trainer="sft",
push_to_hub=False,
save_total_limit = 1,
lr=2e-4,
epochs=2,
batch_size=16,
peft=True,
)
params = llm_munge_data(params, local=True)
project = AutoTrainProject(params=params, backend="local")
project.create()
sorry, i overlooked and gave you incorrect response. in llm training validation data isnt used. so, checkpoints are not saved. ill take a look at this behaviour again and will come back to you.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 20 days since being marked as stale.