InstructBLIP and SEED Implementation
MichaelMaiii opened this issue · 2 comments
MichaelMaiii commented
Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?
StanLei52 commented
MichaelMaiii commented
Thanks a lot. It seems that only the 'vitlensL_processors' is available now.
By the way, I notice that SEED-LLaMa has outperformed InstructBlip in image captioning, maybe it's more concise and well-performed to use the SEED-LLaMa for both text and image generation.