Unofficial PyTorch/🤗Transformers implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, with Llama3 and Gemma model supported. (Llama 2 and 1 is also supported)
- Paper Link: https://arxiv.org/abs/2404.07143
Type I. Infini Attention in Model-wise, Trainer-wise
- Overrides modeling and config python files.
- Full edit, Not compatible with basic HF trainer.
- Need custom training code
- Memory usage is much lower than SDPA(default) attention
- can train Gemma-2B with 32768 seq len(2048*16) on 2x H100 80G (with AdamW optimizer, No gradient checkpointing)
- can train Llama-3-8B with 1M seq len(2048*512) on 2x H100 80G (with Adafactor optimizer, no grad checkpointing)
- Can train 'infinite' context -- check
train.gemma.infini.noclm.1Mseq.sh
with 1x H100 80G (with AdamW optimizer, No gradient checkpointing)
Type II. Infini Attention in Attention-Layer only
- Overrides modeling python file only, especially Attention layer only.
- Minimal edit, fully compatible with HF(Trainer, etc)
- Memory usage is ~eq with SDPA(default) attention
- can train Gemma-2B with 8192 seq len(128*64) on 2x H100 80G (with Adafactor Optimizer + Gradient Checkpointing)
git clone https://github.com/Beomi/InfiniTransformer
We need to install the latest version(
b109257f4f
) of 🤗Transformers from the source code.
pip install -r requirements.txt
pip install -e git+https://github.com/huggingface/transformers.git@b109257f4f#egg=transformers
# or just pip install transformers
python test_basic.infini.py
Train Llama-3 1M seq len with 2K segment size, with MiniPile Dataset
./train.llama.infini.noclm.1Mseq.sh
or
Train Gemma-2B 32K seq len with 2K segment size, with WikiText2 Dataset
./train.gemma.infini.noclm.sh
or
Train Gemma-2B 1M seq len with 2K segment size, with MiniPile Dataset
./train.gemma.infini.noclm.1Mseq.sh
git clone https://github.com/Beomi/InfiniTransformer
We need to install the latest version(
b109257f4f
) of 🤗Transformers from the source code.
pip install -r requirements.txt
pip install -e git+https://github.com/huggingface/transformers.git@b109257f4f#egg=transformers
python test_basic.infini.py
python test_basic.py
python test_train.small.gemma.py
Example code used wikitext-2-raw-v1 from https://huggingface.co/datasets/wikitext
Here's the test wandb log here -> https://api.wandb.ai/links/beomi2/1rsqrkfn
python test_basic.trained.py
Sample Generation w/ 1-epoch Trained Model on WikiText2
Input:
This work introduces an efficient method to scale Transformer-based
Output1:
models for denoising , denoising denoising , and deep denoising of images of the U2 EPK model , using a coefficient that is a function of the depth of the image resolution . The paper experiments with image denoising by Turbo @-@ based filtering , denoising by generative adversarial networks , and video denoising by denoising each of the three elements of the video ( color of the pixels / frames ) . The results are considered fair . The video is not discussed . The paper is not considering an actual application in an industrial context , line is probably a 1 . It is built in the Nohmi…
Output2:
vision models across platforms using a custom architecture optimized for both vision ( 3D / 2D ) and vision and language . In other words , a single model can run on different types of devices , a feature that is critical for the development of general @-@ purpose and large-scale AI ( see also : The One @-@ Model @-@ for @-@ All @-@ Things @-@ AI Problem ) . The model is the first to reach a global scale ( 200 GPU + ) on a single GPU using the Transformer and its variants . The model can run at the end of 1967 . He had his family relocated to a house in a nearby neighborhood , where they lived for five years , before returning to their primary residence in St. Petersburg . Later comments of 1968 made by his fellow musician Bruce Hornsby made it clear that he had gone through a lot , both personally and professionally .