Gradient issue
TonyXuQAQ opened this issue · 9 comments
Hi, after going through the training code, it seems that the gradient is not properly backpropagated. It seems that all projector layers mm_projector
are called within torch.no_grad
(i.e., call_1, call_2). If so, it means the projector layer is not trained at all, right? Is this a typo in the released code or an error?
Can you share the error output and training configuration file?
There is no error. I just used the raw code of this repo. I mean, the projector mm_projector
layer seems not been trained properly in valley/model/valley.py
. All mm_projector
are wrapped in torch.no_grad
so that the projector will not be trained, since the gradient is blocked within torch.no_grad
.
But projectors are wrapped inside
torch.no_grad
. So the gradient cannot pass the projector, i.e., the projector is not trained. And you did not use this layer elsewhere. I wonder how you trained this projector.
@TonyXuQAQ I find the projector is not wrapped inside 'torch.no_grad' in the original code of this repo, as follows:
in
[https://github.com/RupertLuo/Valley/blob/8da73a9551cd9ce520c47f7c3f508fdfc387f4f8/valley/model/valley.py].
I guess the "bug" is caused by reorganizing the codes. And the projector should be outside the 'torch.no_grad' as the released models are trained with tuning projector.
Thanks for the information.
During finetuning, I also noticed that your current version code cannot load VideoChat-instruct-11K normally. Because LLaVA-instruct-150K's label is organized as "{'human":... "gpt":...}", but VideoChat-instruct-11K's label is organized as "{'q':..., 'a':...}". These two datasets have different label formats. But your code did not do format transformation. I guess you missed the label pre-processing code.
I don't know how, based on your llama-2-pretrain weights, I finetuned valley on the above two datasets and the results are very bad. I will refer to the early commits of this repo for debugging.
So may I know which commit is used to train the provided valley-2-7b? I just want to re-implement the performance of the provided checkpoints
Thanks for the information.
During finetuning, I also noticed that your current version code cannot load VideoChat-instruct-11K normally. Because LLaVA-instruct-150K's label is organized as "{'human":... "gpt":...}", but VideoChat-instruct-11K's label is organized as "{'q':..., 'a':...}". These two datasets have different label formats. But your code did not do format transformation. I guess you missed the label pre-processing code.
I don't know how, based on your llama-2-pretrain weights, I finetuned valley on the above two datasets and the results are very bad. I will refer to the early commits of this repo for debugging.
LLaVA-instruct-150k should be able to load. For videochat-11k, you need to convert the format to LLaVA-instruct-150k.
So may I know which commit is used to train the provided valley-2-7b? I just want to re-implement the performance of the provided checkpoints
Thank you for your continued attention to this project. I will synchronize it to the code that can be perfectly trained as soon as possible.