AlignGPT-VL/AlignGPT

Questions about the Adaptive Alignment-based Instruction-tuning

Closed this issue · 3 comments

Hello. Thanks for your excellent work!

I find your work very interesting and the motivation is very sound. The experimental results also prove that the work is effective. But I have some confusion about the model architecture.

I can clearly understand the architecture and strategy of the pre-training phase, assigning a corresponding embedding spliced in front of all inputs based on the similarity of the image and text. Said another way, I can treat each alignment embedding as a task-related soft prompt (similar to P-Tuning), from weak to strong, representing tasks that require local features (e.g. image captioning), and tasks that require global features (e.g. VQA). But there are some confusions about the series of operations in the instruction-tuning phase.

  1. What is the physical significance of $H_I \otimes H_T$ in Equation (2) in the paper? $H_I$ is image_embeds and $H_T$ is the average embedding of the text. It looks like $H_I$ does a scaling based on $H_T$. What is the practical significance of the results obtained? I cannot understand why the result obtained this way, after MLP and softmax, is a weight matrix $\alpha$. Because of these N_IMAGE_TOKEN tokens of $H_I$, each token does the same scaling according to $H_T$ and is not scaled for a particular region.
  2. I'm also confused by Equation (3). According to the pre-training phase, $H_{align}$ should be equivalent to the meaning of an alignment embedding. It is understandable if it is $H_N$ or $\sum \alpha H_i$ alone, but why is it an accumulation of them, and does this differ from the pre-training phase?

I would appreciate it if you could answer my confusion.

To supplement, my question was asked after having already looked at the code, and I think the shapes of the input and output vectors for both formulas are reasonable, but I still don't understand why the vectors after being so operated make the physical sense that you claim in your paper.

To supplement, my question was asked after having already looked at the code, and I think the shapes of the input and output vectors for both formulas are reasonable, but I still don't understand why the vectors after being so operated make the physical sense that you claim in your paper.

Sorry for your confusion. Let me first describe the workflow of AlignGPT.

  • In the pre-training phase, instead of treating all image-text pairs equally, we assign different levels of alignment capabilities to different image-text pairs. This is achieved by the CLIP score. Image-text pairs with lower CLIP scores suggest that the text describes only part of the image’s information, whereas pairs with higher CLIP scores indicate that the text provides a more comprehensive description of the image. To be specific, we first compute the CLIP similarities for all training image-text pairs. Then, we rank all image-text pairs based on their similarity scores. Finally, we use a bucketing technique to divide them into $N$ discrete alignment levels. In other words, we assign an alignment level for each image-text pair. We initialize each alignment level as an alignment vector and continuously update its representation during the pre-training phase.

  • After the pre-training stage, we obtain $N$ alignment vectors ${H_1, H_2, ..., H_N}$ corresponding to $N$ discrete alignment levels ${1, 2, ..., N}$. Among them, $H_N$ represents the highest level of alignment, In other words, $H_N$ indicates that the text provides very comprehensive description of an image. Here we regard it as a global alignment vector. The vectors below $H_N$ represent different degrees of alignment between the image and the text (${H_1, H_2, ..., H_{N-1}}$), which means the text only describes a part of the information of the image from weak to strong. Thus, we regard them as local alignment vectors of varying degrees.

  • In the instruction-tuning phase, we not only allocate global alignment capabilities to the instructions of each task, but also adaptively distribute varying degrees of local alignment capabilities based on the distinct alignment needs of each instruction. The reason behind this is that global alignment serves as the foundation for cross-modal understanding; only by mastering global alignment capabilities can a model truly focus on enhancing local alignment abilities. Specifically, in addition to the global alignment vectors, we assign different weights to the local alignment vectors via a gate network. These weights are obtained based on input instructions and image, as the input instructions greatly influence the visual regions the model should focus on. The implementation of the gate network is as follows:
    $$\alpha=softmax(W(H_I \otimes H_T)+b),$$

  • Finally, we aggregate the global alignment vector and the local alignment vectors with varying weights to ensure a more precise fulfillment of alignment requirements for each instruction:
    $$H_{align} = H_N + \sum_{i=1}^{N-1} \alpha H_i,$$
    where $H_{align}$ means the alignment vector for each instruction during the instruction-tuning stage.

Then, we answer your questions step by step:

  • in Equation (2), we use MLP because we need to map the dimension of HI⊗HT to the same dimension as the local alignment vectors of varying degrees (${H_1, H_2, ..., H_{N-1}}$). Then we use softmax to assign the weights for each local alignment vector. During the instruction-tuning phase, we freeze the alignment vectors, while updating the parameters of the gate network. In this way, each local alignment vector can learn the appropriate weight.

  • in Equation (3), we aggregate the global alignment vector and the local alignment vectors with varying weights to ensure a more precise fulfillment of alignment requirements for each instruction. The reason comes from two aspects: (1) as we mentioned in the introduction, the instructions currently used for finetuning cover various tasks such as image captioning, visual question answering, and visual grounding, etc. The instructions of these tasks place different requirements on the alignment capabilities. For example, image captioning tasks mainly rely on global alignment between images and text, while VQA and visual grounding tasks require not only global alignment but also local alignment capabilities between images and text; (2) global alignment serves as the foundation for cross-modal understanding; only by mastering global alignment capabilities can a model truly focus on enhancing local alignment abilities.

Thanks very much for your answer.
I think I have understood your method.