The selection of LLM
Closed this issue · 4 comments
Thank you for your nice work! I want to ask why you choose dolly as the LLM, instead of some famous model like LLaMA2. Is there any other consideration?
Hello, @zhi-xuan-chen.
There are two main reasons for using dolly-v2-3b
. First, LLaMA2 did not exist when the research began, and LLaMA1 was difficult to acquire a license. Second, a model of appropriate size for the time, the 3B, was available. Given our limited computing resources, we wanted to use the smallest LLM model possible.
Thank you for your response. In addition, I want to ask why you trained the special token embedding from scratch instead of using the embedding from VQGAN for initialization.
There are two reasons for that design choice.
First, the (text) embedding dimension of LLM is 2560, which is not the same as the latent dimension of VQ-GAN, which has a dimension of 256, so it cannot be used directly for initialization.
Second, even if the dimensions are the same, the already constructed text embedding space of LLM and the latent space of VQ-GAN are not directly compatible, so it was not considered appropriate to combine the two for initialization.
Of course, there may be a way to use the latent vectors of VQ-GAN for better initialization of the image token embedding region of LLM. Thank you for your sharp point.
OK, I get it. Thank you for your reply.