[Question] LLaVA DPO training loss increases

Question

[Question] LLaVA DPO training loss increases

Closed this issue 4 months ago · 4 comments

Required prerequisites

I have read the documentation https://align-anything.readthedocs.io.
I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Questions

Hi everyone,

When I tried using the DPO algorithm to train LLaVA, I observed an abnormal increase in loss, but the reward accuracy was slightly above 0.5 after training and seemed to be continuously increasing. The reward margin was fluctuating upwards, with the better sample reward and worse sample reward oscillating with the same period. My biggest confusion is why the loss keeps increasing. Is there something wrong? I haven't changed any training code.

Loss figure

Reward figure

Reward margin figure

Better/Worse sample reward figure

Below is my training script. I am using 8 A100 GPUs for training with a batch size of 4 per GPU (unmodified):

# You can replace it with a local model path
MODEL_NAME_OR_PATH="pretrained_models/llava-hf/llava-1.5-7b-hf"
# You can replace it with a local dataset path
TRAIN_DATASETS="pretrained_models/openbmb/RLAIF-V-Dataset"
# You can replace it with a new path with correct permission
OUTPUT_DIR="./output/dpo"
# For wandb online logging
export WANDB_API_KEY=""

# Source the setup script
source ./scripts/setup.sh

# Execute deepspeed command
deepspeed \
    --master_port ${MASTER_PORT} \
    --module align_anything.trainers.text_image_to_text.dpo \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --train_datasets ${TRAIN_DATASETS} \
    --train_template RLAIFV \
    --train_split train \
    --freeze_mm_proj True \
    --freeze_vision_tower False \
    --output_dir ${OUTPUT_DIR}

Is there something wrong here, or is it normal for the loss to increase initially? I really appreciate your help.

Answer 1 · 2024-08-08T15:14:35.000Z

Hey! I think it might be an issue with the hyperparameter settings. Could you try this set of hyperparameters?

 # Freeze the multi modal projection layer
 freeze_mm_proj: True
 # Freeze the vison tower model
 freeze_vision_tower: True
 # Freeze the language model
 freeze_language_model: False

According to our experiments, the training seems to be running quite well! We will provide a stable version as soon as possible.

Answer 2 · 2024-08-09T05:14:51.000Z

Thank you for your response. I successfully achieved nearly 100% accuracy in DPO by freezing the vision model and projector layer. However, I’m still curious whether it is necessary to freeze the visual part in MMLM RLFH.

I attempted to freeze the visual part when training the PPO reward model, but it seems to have failed.

deepspeed \
	--master_port ${MASTER_PORT} \
	--module align_anything.trainers.text_image_to_text.rm \
	--model_name_or_path ${MODEL_NAME_OR_PATH} \
	--train_datasets ${TRAIN_DATASETS} \
	--eval_datasets ${EVAL_DATASETS} \
	--output_dir ${OUTPUT_DIR} \
	--freeze_mm_proj True \
        --freeze_vision_tower True \
        --freeze_language_model False \
  	--train_split train \
	--eval_split train \
	--train_template RLAIFV \
	--eval_template RLAIFV

Is there anything wrong? Thank you very much for your help!

Answer 3 · 2024-08-14T15:41:05.000Z

Sorry for the late response! We have spent considerable effort and discovered a set of hyperparameters that can effectively improve the results:

 # Freeze the multi modal projection layer
 freeze_mm_proj: True
 # Freeze the vison tower model
 freeze_vision_tower: True
 # Freeze the language model
 freeze_language_model: False

In fact, due to the complexity of multimodal information, RM models that include visual inputs are often difficult to train. The accuracy is often lower than in cases with only text, which is also confirmed in the issue here.

We will continue to explore better hyperparameter settings, and we welcome any assistance from you and the community!

Answer 4 · 2024-09-12T02:48:18.000Z

Due to the lack of response for an extended period, we are temporarily closing this issue. Feel free to reopen it at any time.