huggingface/autotrain-advanced

[BUG] 'autotrain llm' does not save checkpoints in the project folder

Closed this issue · 7 comments

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Local

Interface Used

UI

CLI Command

autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft

UI Screenshots & Parameters

No response

Error Logs

Checkpoints are not saved

Additional Information

I'm fine-tuning "meta-llama/Llama-2-7b-hf" on a remote GPU using autotrain llm, but after completing the training, the checkpoints are not being saved. The project folder contains files like autotrain-data, adapter-config, tokenizer, training-params, etc., but the checkpoints are missing.
Here's the content of training-params file:
{ "model": "meta-llama/Llama-2-7b-hf", "project_name": "ftLlama", "data_path": "ftLlama/autotrain-data", "train_split": "train", "valid_split": null, "add_eos_token": true, "block_size": 1024, "model_max_length": 2048, "padding": "right", "trainer": "sft", "use_flash_attention_2": false, "log": "none", "disable_gradient_checkpointing": false, "logging_steps": -1, "eval_strategy": "epoch", "save_total_limit": 1, "auto_find_batch_size": false, "mixed_precision": null, "lr": 0.0002, "epochs": 5, "batch_size": 12, "warmup_ratio": 0.1, "gradient_accumulation": 4, "optimizer": "adamw_torch", "scheduler": "linear", "weight_decay": 0.0, "max_grad_norm": 1.0, "seed": 42, "chat_template": null, "quantization": "int4", "target_modules": "all-linear", "merge_adapter": false, "peft": true, "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "model_ref": null, "dpo_beta": 0.1, "max_prompt_length": 128, "max_completion_length": null, "prompt_text_column": "autotrain_prompt", "text_column": "autotrain_text", "rejected_text_column": "autotrain_rejected_text", "push_to_hub": false, "unsloth": false, "distributed_backend": null }
Am I missing something or is this a bug?
This is the command I'm using:
autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft

which checkpoints do you want to save? it saves only one checkpoint by default. you can change it by changing save_total_limit argument. llm doesnt use validation data.

information for all arguments are in docs btw: https://hf.co/docs/autotrain

Thanks abhishek for your response.

It doesn't save any checkpoints, and there is no checkpoint folder in my project.
Is there anything I can provide that would help resolve the issue??

I also tried the below code but it doesn't save checkpoints too.

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.cli.utils import llm_munge_data
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="meta-llama/Llama-2-7b-hf",
    project_name="autotrain-finetune-Llama",
    data_path=".",
    trainer="sft",
    push_to_hub=False,
    save_total_limit = 1,
    lr=2e-4,
    epochs=2,
    batch_size=16,
    peft=True,
)
params = llm_munge_data(params, local=True)
project = AutoTrainProject(params=params, backend="local")
project.create()

sorry, i overlooked and gave you incorrect response. in llm training validation data isnt used. so, checkpoints are not saved. ill take a look at this behaviour again and will come back to you.

This issue is stale because it has been open for 30 days with no activity.

This issue was closed because it has been inactive for 20 days since being marked as stale.