/ViT-implementation-with-MHA-and-GQA

implement ViT from scratch and add GQA mechanism

Primary LanguagePython

ViT-implementation-with-MHA-and-GQA

About

The article of GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints ,endorsed by different LLMs since its release such as Mistral Large, presents a new approach of allocating query heads with key and value heads when computing the scaled dot product. This work englobes both standard multi-head attention, in which each query head is attributed to one key head, and grouped-query attention which allocates a subgroup of query heads to one key head.

Module

The architecture of this work is as follows: