ViT-implementation-with-MHA-and-GQA

About

The article of GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints ,endorsed by different LLMs since its release such as Mistral Large, presents a new approach of allocating query heads with key and value heads when computing the scaled dot product. This work englobes both standard multi-head attention, in which each query head is attributed to one key head, and grouped-query attention which allocates a subgroup of query heads to one key head.

Module

The architecture of this work is as follows:

attention : contains two attention mechanisms explained later.
patches_embedder : divides the input images into patches through a 2D convolution, in which the kernel size is the patch size. Then embed each patch.
- init.py
- _base.py
ViT : assembles the different components.
- init.py
- _base.py
README.md

Sora-mmh/ViT-implementation-with-MHA-and-GQA

ViT-implementation-with-MHA-and-GQA

About

Module