[BUG] 'autotrain llm' does not save checkpoints in the project folder

Question

[BUG] 'autotrain llm' does not save checkpoints in the project folder

Closed this issue 2 months ago · 7 comments

Mobinapournemat commented 3 months ago

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Local

Interface Used

UI

CLI Command

autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft

UI Screenshots & Parameters

No response

Error Logs

Checkpoints are not saved

Additional Information

I'm fine-tuning "meta-llama/Llama-2-7b-hf" on a remote GPU using autotrain llm, but after completing the training, the checkpoints are not being saved. The project folder contains files like autotrain-data, adapter-config, tokenizer, training-params, etc., but the checkpoints are missing.
Here's the content of training-params file:
{ "model": "meta-llama/Llama-2-7b-hf", "project_name": "ftLlama", "data_path": "ftLlama/autotrain-data", "train_split": "train", "valid_split": null, "add_eos_token": true, "block_size": 1024, "model_max_length": 2048, "padding": "right", "trainer": "sft", "use_flash_attention_2": false, "log": "none", "disable_gradient_checkpointing": false, "logging_steps": -1, "eval_strategy": "epoch", "save_total_limit": 1, "auto_find_batch_size": false, "mixed_precision": null, "lr": 0.0002, "epochs": 5, "batch_size": 12, "warmup_ratio": 0.1, "gradient_accumulation": 4, "optimizer": "adamw_torch", "scheduler": "linear", "weight_decay": 0.0, "max_grad_norm": 1.0, "seed": 42, "chat_template": null, "quantization": "int4", "target_modules": "all-linear", "merge_adapter": false, "peft": true, "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "model_ref": null, "dpo_beta": 0.1, "max_prompt_length": 128, "max_completion_length": null, "prompt_text_column": "autotrain_prompt", "text_column": "autotrain_text", "rejected_text_column": "autotrain_rejected_text", "push_to_hub": false, "unsloth": false, "distributed_backend": null }
Am I missing something or is this a bug?
This is the command I'm using:
autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft

Answer 1 · 2024-09-30T14:36:53.000Z

which checkpoints do you want to save? it saves only one checkpoint by default. you can change it by changing save_total_limit argument. llm doesnt use validation data.

Answer 2 · 2024-09-30T14:37:26.000Z

information for all arguments are in docs btw: https://hf.co/docs/autotrain

Answer 3 · 2024-09-30T15:07:50.000Z

Thanks abhishek for your response.

It doesn't save any checkpoints, and there is no checkpoint folder in my project.
Is there anything I can provide that would help resolve the issue??

Answer 4 · 2024-09-30T15:50:44.000Z

I also tried the below code but it doesn't save checkpoints too.

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.cli.utils import llm_munge_data
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="meta-llama/Llama-2-7b-hf",
    project_name="autotrain-finetune-Llama",
    data_path=".",
    trainer="sft",
    push_to_hub=False,
    save_total_limit = 1,
    lr=2e-4,
    epochs=2,
    batch_size=16,
    peft=True,
)
params = llm_munge_data(params, local=True)
project = AutoTrainProject(params=params, backend="local")
project.create()

Answer 5 · 2024-09-30T15:52:21.000Z

sorry, i overlooked and gave you incorrect response. in llm training validation data isnt used. so, checkpoints are not saved. ill take a look at this behaviour again and will come back to you.

Answer 6 · 2024-10-31T15:01:50.000Z

This issue is stale because it has been open for 30 days with no activity.

Answer 7 · 2024-11-21T15:01:44.000Z

This issue was closed because it has been inactive for 20 days since being marked as stale.