The article of GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints ,endorsed by different LLMs since its release such as Mistral Large, presents a new approach of allocating query heads with key and value heads when computing the scaled dot product. This work englobes both standard multi-head attention, in which each query head is attributed to one key head, and grouped-query attention which allocates a subgroup of query heads to one key head.
The architecture of this work is as follows:
-
attention : contains two attention mechanisms explained later.
-
patches_embedder : divides the input images into patches through a 2D convolution, in which the kernel size is the patch size. Then embed each patch.
-
ViT : assembles the different components.