OptimalScale/LMFlow

[BUG] constant loss during fine-tuning

hellkonig opened this issue · 4 comments

Describe the bug
I use run_finetune_with_lora.sh to fine-tune the model robin-7b-v2-delta, but the problem is the loss remains constant in every iteration. I am not sure if it is a bug or if there is anything I haven't done right. I would appreciate all suggestions or insights. If you need further information to spot the problem, I am happy to provide it. Thank you very much!

Environement
OS:

NAME="Rocky Linux"
VERSION="8.8 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

Python version: 3.9.18
CUDA version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

g++ version: g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
torch version: 2.0.0

To Reproduce
I run the following command line on the server's computing node:

./scripts/run_finetune_with_lora.sh \
  --model_name_or_path robin-7b-v2-delta \
  --dataset_path data/alpaca/train \
  --output_lora_path output_models/fintuned-robin

In the .bashrc, I have the following line:

export CUDA_HOME=/apps/cuda/11.7.0
export PATH=/apps/cuda/11.7.0/bin:${PATH}
export LD_LIBRARY_PATH=/apps/cuda/11.7.0/lib64:${LD_LIBRARY_PATH}
export WANDB_MODE=disabled
export TRANSFOEMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1

Screenshots
To check the loss for every iteration, I set --logging-steps 1, and the screenshot for the fine-tuning is:
loss

@hendrydong Could you take a look at this problem?

Hi, thank you for the issue.
One potential error is that the delta weights should be merged in to the original model. Due to the copyright issue, we cannot upload the full model.
Can you try other models, such as openlm-research/open_llama_3b?

Moreover, please check the zero strategy and gradient checkpointing. You may try zero2 without gradient checkpointing for the plain setting.

Thank you so much, @hendrydong and @research4pan. I think it is the issue of the model I downloaded. When I tried to use the gpt2-large and open_llama_3b, both run_finetune.sh and run_finetune_with_lora.sh showed a decreasing loss in fine-tuning.