The training process detail

Question

The training process detail

lucasjinreal opened this issue 6 months ago · 23 comments

Hi, did u first train the projector, and then train projector + LLM, what's the detail of them.

Answer 1 · 2024-03-28T03:42:50.000Z

Oh!, before using external cv models, we slightly trained the projector only with 1 percent of total batch, and then we freeze it. I think it does not depend on the performance quietly, which is also evaluated in the paper MM1, apple paper.

We will add its minor training procedure to the manuscript. Thanks a lot!

Answer 2 · 2024-03-28T05:34:22.000Z

so first trainng proj only , then free proj, and train vit and llm? Why freeze proj at second time?

Answer 3 · 2024-03-28T07:14:19.000Z

This is because I did not capture its effectiveness of training proj at second stage, at least on our model. The effectiveness of training MoAI-Mixer modules seem more better.

Actually, there is, either, no reason to not train vision encoder of MoAI, while LLaVA1.6 (Microsoft) and MM1 (Apple) trained vision encoder, but MoAI did not adopt vision encoder training and proj training.

Answer 4 · 2024-03-28T07:18:45.000Z

Most of the time open ivison encoder training got worse result.
Do u think you can get better results if train them all?

Answer 5 · 2024-03-28T07:20:37.000Z

I cannot convince that training vision encoder always got worse result. I think it depends on the model setup or training setup. Recently, I read more papers training vision encoder leads to performance gain.

I would recommend reading Figure 10(c) of MM1 paper [link].

In addition, LLaVA1.6-blog [link] described full model training at second stage.

Thanks for great discussion of training detail!

Answer 6 · 2024-03-28T07:43:58.000Z

The MM1 paper claims that open vit only works (better) then un free when imaga tokens are too many.

It might depends on the LLM and proj design

Answer 7 · 2024-03-28T07:52:02.000Z

Yes, recent papers have had lots of image tokens by using dynamic looking image. As I said, it depends on design of models. It makes sense many image feature can represent more rich onformation for VL tasks.

Answer 8 · 2024-04-11T03:31:36.000Z

Sorry to ask a maybe naive question. What does the "projector" you mentioned mean? Is this MLP the projector? "Two linear layers with GELU activation function serve as the bridge connector between vision and language components, denoted by 'MLP'"

Answer 9 · 2024-04-12T07:28:29.000Z

The projector means a few layers with MLP, of which role is to make a bride connector from vision encoder to backbone multimodal LLM

Answer 10 · 2024-04-15T05:43:11.000Z

The projector means a few layers with MLP, of which role is to make a bride connector from vision encoder to backbone multimodal LLM

Thanks for your reply. I noticed that you used qlora in training. Did you use qlora in training the projector? or you just use qlora in training the following steps?

Answer 11 · 2024-04-15T07:32:03.000Z

No reason to quatize projector. Thus, used backbone llm only

Answer 12 · 2024-04-15T08:14:26.000Z

Thanks very much. Sorry to bother you again. Another question, did you use InternLM-7B or InternLM2-7B or InternLM2-base-7b as base model? And did you base "InternLM/lmdeploy" to write the training model?

Answer 13 · 2024-04-15T09:20:40.000Z

We used https://huggingface.co/internlm/internlm2-7b getting the highest likes

Answer 14 · 2024-04-15T09:26:50.000Z

We used https://huggingface.co/internlm/internlm2-7b getting the highest likes

thanks very much.

Answer 15 · 2024-04-16T04:55:28.000Z

Sorry to bother again, I noticed that you wrote all training is based on Llava-instruct-655k filtered by ShareGPT-4v. I want to ask that, when you train the vision projector, your mentioned 1/10 of the dataset, are they randomly selected from Llava-instruct-655k? Why didn't you use the LLAVA pretraining dataset of LCS-558K?

Answer 16 · 2024-04-16T05:30:41.000Z

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.

MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.

By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.

Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.

Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Answer 17 · 2024-05-13T02:13:16.000Z

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.

MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.

By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.

Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.

Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Sorry to bother again. I have other questions about the training. How many cards did you use during training and how long is the training time for the two stages?

Answer 18 · 2024-05-16T04:45:40.000Z

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.
MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.
By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.
Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.
Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Sorry to bother again. I have other questions about the training. How many cards did you use during training and how long is the training time for the two stages?

We use approximately 665K number of training samples, and two or three days are consumed for each training step with 6 x A6000.

Answer 19 · 2024-05-22T12:32:27.000Z

Dear author, thanks a lot for your help! I have another question.
In the inference code, you make prompt with:
"prompt = " [UNUSED_TOKEN_146]user\n" + prompt + "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n""
Seems a little different with llava. In training, did the prompt processing in the same way with that in inference?

Answer 20 · 2024-05-22T20:23:25.000Z

Dear author, thanks a lot for your help! I have another question. In the inference code, you make prompt with: "prompt = " [UNUSED_TOKEN_146]user\n" + prompt + "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"" Seems a little different with llava. In training, did the prompt processing in the same way with that in inference?

The input prompt is important to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Answer 21 · 2024-05-23T09:17:46.000Z

mportant to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Thanks a lot for your kind reply! That's very useful.
So, in training, you changed the system prompt to ""AI assistant should give helpful and detailed answers to user after fully understanding an image.",and keep the left setting of conversation in the same way with Llava?

Answer 22 · 2024-05-23T11:39:59.000Z

mportant to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Thanks a lot for your kind reply! That's very useful. So, in training, you changed the system prompt to ""AI assistant should give helpful and detailed answers to user after fully understanding an image.",and keep the left setting of conversation in the same way with Llava?

Yes. As I experienced, however, whatever content of system prompt conversation did not affect the performances well. It is just a format

Answer 23 · 2024-05-24T03:31:08.000Z

nversation did not affect the performances well. It is just a format

It is quite reasonable. I am troubled by reproducing a training code.
In inference, the cv models seems not so fast, it requires several seconds to run all the cv models for one image. How did you reach a high speed in training?
By the way, is your training code based on LLava or InternLM or InternLM-X-Composer?