PKU-Alignment/align-anything

[Question] LLaVA DPO training loss increases

Closed this issue · 4 comments

Required prerequisites

Questions

Hi everyone,

When I tried using the DPO algorithm to train LLaVA, I observed an abnormal increase in loss, but the reward accuracy was slightly above 0.5 after training and seemed to be continuously increasing. The reward margin was fluctuating upwards, with the better sample reward and worse sample reward oscillating with the same period. My biggest confusion is why the loss keeps increasing. Is there something wrong? I haven't changed any training code.

Loss figure
image

Reward figure
image

Reward margin figure
image

Better/Worse sample reward figure
image

Below is my training script. I am using 8 A100 GPUs for training with a batch size of 4 per GPU (unmodified):

# You can replace it with a local model path
MODEL_NAME_OR_PATH="pretrained_models/llava-hf/llava-1.5-7b-hf"
# You can replace it with a local dataset path
TRAIN_DATASETS="pretrained_models/openbmb/RLAIF-V-Dataset"
# You can replace it with a new path with correct permission
OUTPUT_DIR="./output/dpo"
# For wandb online logging
export WANDB_API_KEY=""

# Source the setup script
source ./scripts/setup.sh

# Execute deepspeed command
deepspeed \
    --master_port ${MASTER_PORT} \
    --module align_anything.trainers.text_image_to_text.dpo \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --train_datasets ${TRAIN_DATASETS} \
    --train_template RLAIFV \
    --train_split train \
    --freeze_mm_proj True \
    --freeze_vision_tower False \
    --output_dir ${OUTPUT_DIR}

Is there something wrong here, or is it normal for the loss to increase initially? I really appreciate your help.

Hey! I think it might be an issue with the hyperparameter settings. Could you try this set of hyperparameters?

 # Freeze the multi modal projection layer
 freeze_mm_proj: True
 # Freeze the vison tower model
 freeze_vision_tower: True
 # Freeze the language model
 freeze_language_model: False

According to our experiments, the training seems to be running quite well! We will provide a stable version as soon as possible.

image

Thank you for your response. I successfully achieved nearly 100% accuracy in DPO by freezing the vision model and projector layer. However, I’m still curious whether it is necessary to freeze the visual part in MMLM RLFH.

I attempted to freeze the visual part when training the PPO reward model, but it seems to have failed.

image

image

deepspeed \
	--master_port ${MASTER_PORT} \
	--module align_anything.trainers.text_image_to_text.rm \
	--model_name_or_path ${MODEL_NAME_OR_PATH} \
	--train_datasets ${TRAIN_DATASETS} \
	--eval_datasets ${EVAL_DATASETS} \
	--output_dir ${OUTPUT_DIR} \
	--freeze_mm_proj True \
        --freeze_vision_tower True \
        --freeze_language_model False \
  	--train_split train \
	--eval_split train \
	--train_template RLAIFV \
	--eval_template RLAIFV

Is there anything wrong? Thank you very much for your help!

Sorry for the late response! We have spent considerable effort and discovered a set of hyperparameters that can effectively improve the results:

 # Freeze the multi modal projection layer
 freeze_mm_proj: True
 # Freeze the vison tower model
 freeze_vision_tower: True
 # Freeze the language model
 freeze_language_model: False
image

In fact, due to the complexity of multimodal information, RM models that include visual inputs are often difficult to train. The accuracy is often lower than in cases with only text, which is also confirmed in the issue here.

We will continue to explore better hyperparameter settings, and we welcome any assistance from you and the community!

Due to the lack of response for an extended period, we are temporarily closing this issue. Feel free to reopen it at any time.