RupertLuo/Valley

Gradient issue

TonyXuQAQ opened this issue · 9 comments

Hi, after going through the training code, it seems that the gradient is not properly backpropagated. It seems that all projector layers mm_projectorare called within torch.no_grad (i.e., call_1, call_2). If so, it means the projector layer is not trained at all, right? Is this a typo in the released code or an error?

Can you share the error output and training configuration file?

There is no error. I just used the raw code of this repo. I mean, the projector mm_projector layer seems not been trained properly in valley/model/valley.py. All mm_projector are wrapped in torch.no_grad so that the projector will not be trained, since the gradient is blocked within torch.no_grad.

image In the file train.py, You can set whether need to update the projector.

But projectors are wrapped inside torch.no_grad. So the gradient cannot pass the projector, i.e., the projector is not trained. And you did not use this layer elsewhere. I wonder how you trained this projector.
Screenshot from 2023-09-21 10-50-57

But projectors are wrapped inside torch.no_grad. So the gradient cannot pass the projector, i.e., the projector is not trained. And you did not use this layer elsewhere. I wonder how you trained this projector. Screenshot from 2023-09-21 10-50-57

@TonyXuQAQ I find the projector is not wrapped inside 'torch.no_grad' in the original code of this repo, as follows:
image in
[https://github.com/RupertLuo/Valley/blob/8da73a9551cd9ce520c47f7c3f508fdfc387f4f8/valley/model/valley.py].
I guess the "bug" is caused by reorganizing the codes. And the projector should be outside the 'torch.no_grad' as the released models are trained with tuning projector.

Thanks for the information.

During finetuning, I also noticed that your current version code cannot load VideoChat-instruct-11K normally. Because LLaVA-instruct-150K's label is organized as "{'human":... "gpt":...}", but VideoChat-instruct-11K's label is organized as "{'q':..., 'a':...}". These two datasets have different label formats. But your code did not do format transformation. I guess you missed the label pre-processing code.

I don't know how, based on your llama-2-pretrain weights, I finetuned valley on the above two datasets and the results are very bad. I will refer to the early commits of this repo for debugging.

So may I know which commit is used to train the provided valley-2-7b? I just want to re-implement the performance of the provided checkpoints

Thanks for the information.

During finetuning, I also noticed that your current version code cannot load VideoChat-instruct-11K normally. Because LLaVA-instruct-150K's label is organized as "{'human":... "gpt":...}", but VideoChat-instruct-11K's label is organized as "{'q':..., 'a':...}". These two datasets have different label formats. But your code did not do format transformation. I guess you missed the label pre-processing code.

I don't know how, based on your llama-2-pretrain weights, I finetuned valley on the above two datasets and the results are very bad. I will refer to the early commits of this repo for debugging.

LLaVA-instruct-150k should be able to load. For videochat-11k, you need to convert the format to LLaVA-instruct-150k.

So may I know which commit is used to train the provided valley-2-7b? I just want to re-implement the performance of the provided checkpoints

Thank you for your continued attention to this project. I will synchronize it to the code that can be perfectly trained as soon as possible.