ByungKwanLee/MoAI

The training process detail

lucasjinreal opened this issue · 23 comments

Hi, did u first train the projector, and then train projector + LLM, what's the detail of them.

Oh!, before using external cv models, we slightly trained the projector only with 1 percent of total batch, and then we freeze it. I think it does not depend on the performance quietly, which is also evaluated in the paper MM1, apple paper.

We will add its minor training procedure to the manuscript. Thanks a lot!

so first trainng proj only , then free proj, and train vit and llm? Why freeze proj at second time?

This is because I did not capture its effectiveness of training proj at second stage, at least on our model. The effectiveness of training MoAI-Mixer modules seem more better.

Actually, there is, either, no reason to not train vision encoder of MoAI, while LLaVA1.6 (Microsoft) and MM1 (Apple) trained vision encoder, but MoAI did not adopt vision encoder training and proj training.

Most of the time open ivison encoder training got worse result.
Do u think you can get better results if train them all?

I cannot convince that training vision encoder always got worse result. I think it depends on the model setup or training setup. Recently, I read more papers training vision encoder leads to performance gain.

I would recommend reading Figure 10(c) of MM1 paper [link].

In addition, LLaVA1.6-blog [link] described full model training at second stage.

Thanks for great discussion of training detail!

The MM1 paper claims that open vit only works (better) then un free when imaga tokens are too many.

It might depends on the LLM and proj design

Yes, recent papers have had lots of image tokens by using dynamic looking image. As I said, it depends on design of models. It makes sense many image feature can represent more rich onformation for VL tasks.

Sorry to ask a maybe naive question. What does the "projector" you mentioned mean? Is this MLP the projector? "Two linear layers with GELU activation function serve as the bridge connector between vision and language components, denoted by 'MLP'"

The projector means a few layers with MLP, of which role is to make a bride connector from vision encoder to backbone multimodal LLM

The projector means a few layers with MLP, of which role is to make a bride connector from vision encoder to backbone multimodal LLM

Thanks for your reply. I noticed that you used qlora in training. Did you use qlora in training the projector? or you just use qlora in training the following steps?

No reason to quatize projector. Thus, used backbone llm only

Thanks very much. Sorry to bother you again. Another question, did you use InternLM-7B or InternLM2-7B or InternLM2-base-7b as base model? And did you base "InternLM/lmdeploy" to write the training model?

We used https://huggingface.co/internlm/internlm2-7b getting the highest likes

thanks very much.

Sorry to bother again, I noticed that you wrote all training is based on Llava-instruct-655k filtered by ShareGPT-4v. I want to ask that, when you train the vision projector, your mentioned 1/10 of the dataset, are they randomly selected from Llava-instruct-655k? Why didn't you use the LLAVA pretraining dataset of LCS-558K?

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.

MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.

By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.

Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.

Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.

MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.

By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.

Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.

Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Sorry to bother again. I have other questions about the training. How many cards did you use during training and how long is the training time for the two stages?

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.
MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.
By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.
Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.
Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Sorry to bother again. I have other questions about the training. How many cards did you use during training and how long is the training time for the two stages?

We use approximately 665K number of training samples, and two or three days are consumed for each training step with 6 x A6000.

Dear author, thanks a lot for your help! I have another question.
In the inference code, you make prompt with:
"prompt = " [UNUSED_TOKEN_146]user\n" + prompt + "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n""
Seems a little different with llava. In training, did the prompt processing in the same way with that in inference?

Dear author, thanks a lot for your help! I have another question. In the inference code, you make prompt with: "prompt = " [UNUSED_TOKEN_146]user\n" + prompt + "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"" Seems a little different with llava. In training, did the prompt processing in the same way with that in inference?

The input prompt is important to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

mportant to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Thanks a lot for your kind reply! That's very useful.
So, in training, you changed the system prompt to ""AI assistant should give helpful and detailed answers to user after fully understanding an image.",and keep the left setting of conversation in the same way with Llava?

mportant to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Thanks a lot for your kind reply! That's very useful. So, in training, you changed the system prompt to ""AI assistant should give helpful and detailed answers to user after fully understanding an image.",and keep the left setting of conversation in the same way with Llava?

Yes. As I experienced, however, whatever content of system prompt conversation did not affect the performances well. It is just a format

nversation did not affect the performances well. It is just a format

It is quite reasonable. I am troubled by reproducing a training code.
In inference, the cv models seems not so fast, it requires several seconds to run all the cv models for one image. How did you reach a high speed in training?
By the way, is your training code based on LLava or InternLM or InternLM-X-Composer?