ymcui/Chinese-LLaMA-Alpaca-2

loss comes to 0

Abolfazl-kr opened this issue · 5 comments

Check before submitting issues

  • Make sure to pull the latest code, as some issues and bugs have been fixed.
  • I have read the Wiki and FAQ section AND searched for similar issues and did not find a similar problem or solution
  • Third-party plugin issues - e.g., llama.cpp, LangChain, text-generation-webui, we recommend checking the corresponding project for solutions

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

I am experiencing an issue with training loss in my deep learning model, and I would like to ask for help in resolving it.
I'm training llama2 on another language and i faced problem of loss overflow. i use four distributed 16GB vram T4 and use fp16 (i couldn't use bf16 because of T4)

Specifically, I set loss_scale=0 (like the following deep speed config) , the loss scale overflowed in deep speed and bring me this error : "FloatingPointError: Minimum loss scale reached"

    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 1,
        "min_loss_scale": 1e-10
    },

The training loss came near to 4, in about 4500 steps (max step was 42127 and i was training 1GB text with run_clm_pt_with_peft), after 4500 steps, it increased very significantly, causing the model to break down.

    {
      "epoch": 0.11,
      "learning_rate": 0.0001945353187718296,
      "loss": 4.0682,
      "step": 4540
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019451583018681562,
      "loss": 4.0441,
      "step": 4550
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019449386523582728,
      "loss": 4.6709,
      "step": 4560
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019446940970948925,
      "loss": 4.9938,
      "step": 4570
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019444490153817877,
      "loss": 5.1778,
      "step": 4580
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019442034073555343,
      "loss": 5.5932,
      "step": 4590
    },
    {
      "epoch": 0.11,
      "learning_rate": 0.00019439819102472819,
      "loss": 6.5669,
      "step": 4600
    },
.
.
.

(the loss raise to 400)

I tried to solve this issue, but I could not. However, when I set the loss_scale to 1 in deepspeed config, then the model start training but the loss came to 0. I would appreciate any guidance on how to resolve this issue.

    "fp16": {
        "enabled": "auto",
        "loss_scale": 1,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 1,
        "min_loss_scale": 1e-10
    },

first, could you tell me the effect of loss_scale? i searched it and i found if it sets to 0 we would have dynamic loss and if we set a number it wouldn't be. but i didn't understanding.

    {
      "epoch": 0.07,
      "learning_rate": 0.0001987099956602297,
      "loss": 1.9179,
      "step": 2810
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.0001987099956602297,
      "loss": 1.9276,
      "step": 2820
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 2.8882,
      "step": 2830
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2840
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2850
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2860
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.1851,
      "step": 2870
    },
    {
      "epoch": 0.07,
      "learning_rate": 0.00019870880019146788,
      "loss": 0.0,
      "step": 2880
    },


### Dependencies (must be provided for code-related issues)

Name: peft
Version: 0.5.0

Name: torch
Version: 2.0.1

cuda 11.8

python
version: 3.10.13

Name: transformers
Version: 4.35.0.dev0



### Execution logs or screenshots

torchrun --nnodes 1 --nproc_per_node 4 run_clm_pt_with_peft.py
--deepspeed ds_zero2_no_offload.json
--model_name_or_path /home/hadoop/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/
--tokenizer_name_or_path /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/scripts/tokenizer/merged_tokenizer_hf
--dataset_dir /home/hadoop/abolfazl/parvin2
--data_cache_dir /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/scripts/training/cache
--validation_split_percentage 0.001
--per_device_train_batch_size 8
--do_train
--seed $RANDOM
--fp16
--num_train_epochs 1
--lr_scheduler_type cosine
--learning_rate 2e-4
--warmup_ratio 0.001
--weight_decay 0.001
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--save_steps 1000
--gradient_accumulation_steps 1
--preprocessing_num_workers 8
--block_size 128
--output_dir /home/hadoop/abolfazl/Chinese-LLaMA-Alpaca-2/out_pt_secondtry
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank 64
--lora_alpha 16
--trainable "q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
--lora_dropout 0.05
--modules_to_save "embed_tokens,lm_head"
--torch_dtype float16
--load_in_kbits 4
--gradient_checkpointing
--ddp_find_unused_parameters False

The combination of 4bit, fp16, and lora may be the reason for this phenomenon. There is nothing more effective than bf16. Or you could try increasing total_batch_size to alleviate the problem.

thank you so much @iMountTai !
I will try it right now.

I couldn't increase bach_size because of OOM.
now i'm trying to use mixed precision or fp32. maybe it works.

I couldn't increase bach_size because of OOM. now i'm trying to use mixed precision or fp32. maybe it works.
@Abolfazl-kr may I ask you how you solve this issue?

@hank0316 unfortunately I change my GPU cards.
with A5000 24 Gb, the time decreased and I control GPU usage.