kohjingyu/gill

How could this affect the performance?

Closed this issue · 10 comments

Hi,
I'm training the model with "Llama2" as the frozen LLM, I was wondering how exactly this part of the code could affect the model training and performance? or actually what is the purpose of this?
meaning going thorough if condition or the else part

image

Would really appreciate the help
Best

The purpose of this code is because most of these models don't have pad tokens (and training assumes the existence of pad tokens for batching purposes). If your model has padding tokens, you shouldn't need to deal with this. It shouldn't affect performance.

BTW, you might want to take note of the following detail if you are using LLaMA:

Another detail you might want to check is that I think LLaMA-2 does not do weight tying (OPT does), so you will need to unfreeze the lm_head too if you are training with LLaMA. Otherwise I think the new [IMG] embeddings will not be meaningful. This will also cost you more GPU memory however, since you need to keep the optimization states of more parameters.

Thanks for the explanation.
Ahh Okay very good point, didn't know that, so then I guess it means actually the training configuration needs to be updated as well to accommodate the newly unfrozen parameters?

Right, you'd probably have to add some lines here to zero out the non-IMG parts of the lm_head.

yeah makes sense, and how would you suggest the best to unfreeze the lm_head?

just out of curiosity, I did not know about the weight tying in Llama2 before and managed to train the model once with 'meta-llama/Llama-2-7b-chat-hf' but it was still generating the images properly and seemed to be working fine, is there a possibility that it does not affect the training that much?

That's interesting! It could be possible, I guess, if the [IMG] embedding is being trained on both sides (like the OPT version). This will probably change the behavior of the LLaMA text generation ability, and I'm not too sure how that interaction would ensure.

how would you suggest the best to unfreeze the lm_head

Probably something like:

for param in self.lm.lm_head.parameters():
    param.requires_grad = True

in the model init.

Thanks a lot, and I think zeroing out non-IMG parts of lm-head using the same masking approach would do the job?

The purpose of this code is because most of these models don't have pad tokens (and training assumes the existence of pad tokens for batching purposes). If your model has padding tokens, you shouldn't need to deal with this. It shouldn't affect performance.

Also, regarding padding tokens, as long as the tokenizer padding token and end of sequence token are the same (e.g. tokenizer.pad_token, tokenizer.eos_token: </s> </s>) the problem of the model not having a padding token has been taken care of?
I think the reason that you used another if statement to either assign the token itself or the token ID instead is due to specifics of the mentioned model in your code ['EleutherAI/gpt-j-6B'] right?
In the previous training on Llama2 I mentioned before, I just set the token IDs to be the same, not the tokens themselves, so was wondering should I try to retrain the model and make sure to set both the token ID and the token itself to be the same?

I checked the param_count.txt output from the training that I have done on Llama2 13b and Llama2 7b and it seems the lm_head weights are being trained:

image

but can this confirm the model is being trained properly?
P.s I'm using Llama2-chat-hf variation of Llama2

Yeah, that should work I think! If you're planning to follow the same training procedure as the paper (besides swapping OPT for LLaMA), you'd also want to zero out the [IMG] gradients like we did.

Will do for sure, thanks
Also, can you tell me what would be the best way to run one validation loop on a different dataset on a trained model? let's say I have a trained model using your script and want to validate it on another dataset, thanks