inf value occurs during forwarding process when fine-tuning VL branch with LLAVA-150K+MiniGPT4-3.5K+webvid-instruct
xuboshen opened this issue · 1 comments
Great works! But I've met some problems and hope anyone has some ideas.
When I fine-tune the VL branch only with LLaMA-2 on image/video instruction datas, inf values occurs and the value of torch.max(hidden_states) and torch.min(hidden_states) becomes larger and larger.
Several attempts have been made:
- I have already checked the issue lists.
- I have consulted the huggingface forum and searched the google results.
Preparations:
My platform: 8*A6000 48G, the environment is setup exactly following the environment.yml in this repository.
The data is prepared following LLaVa (coco), WebVid-10M and MiniGPT-4.
7B LLaMA-2 Pretrained weights are from this repo as well.
The demo correctly runs on remote platform, and training process seems correct. I did not modify any code here.
Problem
I found that some data can occur 'inf' numbers at the last layer of LLaMA-2, where the index of decoder layer number is 31 in the autoregressive loop in LLaMA-2. The error does not occurs immediately, instead, the value of torch.max(hidden_states) and torch.min(hidden_states) becomes larger and larger for positives / smaller and smaller for negatives.
Do you or anyone have any ideas on why this problem occurs, and how to solve it? I appreciate anyone's time and help in advance.
I actually try to set batchsize=1 and the training proceeds as expected, while batchsize=4 produces inf values and fails training.
Could anyone explain this phenomenon?