/Axon

AI research labπŸ”¬: implementations of AI papers and theoretical research: InstructGPT, llama, transformers, diffusion models, RLHF, etc...

Primary LanguageJupyter NotebookMIT LicenseMIT

Axon: AI research Lab.πŸ”¬

Your Image Description

https://arxiv.org/abs/2307.09288 https://arxiv.org/abs/1706.03762 https://arxiv.org/abs/2305.13245 https://arxiv.org/abs/2104.09864 https://arxiv.org/abs/1910.07467 https://arxiv.org/abs/2104.12470 https://arxiv.org/abs/2203.02155 https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/2305.18290

Welcome to Axon: AI Research Lab! This repository serves as a collaborative platform for implementing cutting-edge AI research papers and conducting novel research in various areas of artificial intelligence. Our mission is to bridge the gap between theoretical research and practical applications by providing high-quality, reproducible implementations of seminal and contemporary AI papers: InstructGPT, llama, transformers, diffusion models, RLHF, etc...


Papers implemented:

  • attention is all you need.
    • The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best-performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

  • InstructGPT.
    • Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

  • Llama.
    • In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

  • Multi-Head attention.
  • Multi-Query attention.
  • Grouped-Query attention.
    • Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

      β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”     β”Œβ”€β”€β”€β”    β”Œβ”€β”€β”€β”             β”Œβ”€β”€β”€β”
      β”‚ v β”‚β”‚ v β”‚β”‚ v β”‚β”‚ v β”‚     β”‚ v β”‚    β”‚ v β”‚             β”‚ v β”‚
      β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜     β””β”€β”€β”€β”˜    β””β”€β”€β”€β”˜             β””β”€β”€β”€β”˜
        β”‚    β”‚    β”‚    β”‚         β”‚        β”‚                 β”‚
      β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”     β”Œβ”€β”€β”€β”    β”Œβ”€β”€β”€β”             β”Œβ”€β”€β”€β”
      β”‚ k β”‚β”‚ k β”‚β”‚ k β”‚β”‚ k β”‚     β”‚ k β”‚    β”‚ k β”‚             β”‚ k β”‚
      β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜     β””β”€β”€β”€β”˜    β””β”€β”€β”€β”˜             β””β”€β”€β”€β”˜
        β”‚    β”‚    β”‚    β”‚      β”Œβ”€β”€β”΄β”€β”€β”  β”Œβ”€β”€β”΄β”€β”€β”      β”Œβ”€β”€β”€β”€β”¬β”€β”€β”΄β”€β”¬β”€β”€β”€β”€β”
      β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”  β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”  β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”β”Œβ”€β”€β”€β”
      β”‚ q β”‚β”‚ q β”‚β”‚ q β”‚β”‚ q β”‚  β”‚ q β”‚β”‚ q β”‚β”‚ q β”‚β”‚ q β”‚  β”‚ q β”‚β”‚ q β”‚β”‚ q β”‚β”‚ q β”‚
      β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜  β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜  β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜β””β”€β”€β”€β”˜
      ◀️──────────────────▢️  ◀️──────────────────▢️  ◀️──────────────────▢️
              MHA                    GQA                   MQA
        n_query_groups=4       n_query_groups=2      n_query_groups=1
MHA GQA MQA
High quality A good compromise between quality and Loss in quality
Computationally slow speed Computationally fast
  • reinforcement learning from human feedback.
    • A promising approach to improve the robustness and exploration in Reinforcement Learning is collecting human feedback and that way incorporating prior knowledge of the target environment. It is, however, often too expensive to obtain enough feedback of good quality. To mitigate the issue, we aim to rely on a group of multiple experts (and non-experts) with different skill levels to generate enough feedback. Such feedback can therefore be inconsistent and infrequent. In this paper, we build upon prior work -- Advise, a Bayesian approach attempting to maximise the information gained from human feedback -- extending the algorithm to accept feedback from this larger group of humans, the trainers, while also estimating each trainer's reliability. We show how aggregating feedback from multiple trainers improves the total feedback's accuracy and make the collection process easier in two ways. Firstly, this approach addresses the case of some of the trainers being adversarial. Secondly, having access to the information about each trainer reliability provides a second layer of robustness and offers valuable information for people managing the whole system to improve the overall trust in the system. It offers an actionable tool for improving the feedback collection process or modifying the reward function design if needed. We empirically show that our approach can accurately learn the reliability of each trainer correctly and use it to maximise the information gained from the multiple trainers' feedback, even if some of the sources are adversarial.


Axon's Packages:

packages with their papers implemented:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Transformer  β”‚        β”‚    X-Llama    β”‚        β”‚      Dali     β”‚        β”‚  InstructGPT  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                        β”‚                        β”‚                        β”‚
        β”‚                        β”‚                        β”‚                        β”‚
        β–Ό                        β–Ό                        β–Ό                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "Attention is All β”‚   β”‚ "Llama2"                  β”‚   β”‚ "DDPM"            |   β”‚ "RLHF Survey"          β”‚
β”‚ You Need"         β”‚   β”‚ "RoFormer"                β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ "PPO"                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ "GQA"                     β”‚                           β”‚ "DPO"                  β”‚
                        β”‚ "Attention is All         β”‚                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        
                        β”‚ You Need"                 |                                                  
                        β”‚ "KV-cache", RMSNorm       |                                                   
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           
  • Transformer model
    • Abstract. The Transformer neural network is a powerful deep learning model that was introduced in a landmark paper titled "attention is all you need" by Vaswani et al. in 2017. It revolutionized the field of natural language processing (NLP) and has since found applications in various other domains. The Transformer architecture is based on the concept of attention, enabling it to capture long-range dependencies and achieve state-of-the-art performance on a wide range of tasks. The transformer is a neural network component that can be used to learn useful represen tations of sequences or sets of data-points [Vaswani et al., 2017]. The transformer has driven recent advances in natural language processing [Devlin et al., 2019], computer vision [Dosovitskiy et al., 2021], and spatio-temporal modelling [Bi et al., 2022].
  • X-Llama
    • X-Llama is an advanced language model framework, inspired by the original Llama model but enhanced with additional features such as Grouped Query Attention (GQA), Multi-Head Attention (MHA), and more. This project aims to provide a flexible and extensible platform for experimenting with various attention mechanisms and building state-of-the-art natural language processing models.

project structure: The model was constructed in approximately ~500 lines of code, and you have the model's configuration.

X-Llama/
β”‚
β”œβ”€β”€ images/
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ attentions/
β”‚   β”œβ”€β”€ rotary_embeddings/
β”‚   └── transformer/
β”‚
β”œβ”€β”€ model
β”‚
└── config
β”‚
└── inference

  • DDPM
    • Diffusion Models are generative models, meaning that they are used to generate data similar to the data on which they are trained. Fundamentally, Diffusion Models work by destroying training data through the successive addition of Gaussian noise and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process. Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).
  • InstructGPT
    • AI alignment: A large language model typically is pre-trained on a massive amount of data, for example, the entire Wikipedia and billions of web pages. This gives the language model a vast β€œknowledge” of information to complete any prompt in a reasonable way. However, to use an LLM as a chat assistant (for example ChatGPT) we want to force the language model to follow a particular style. For example, we may want the following:

      • Do not use offensive language
      • Do not use racist expressions
      • Answer questions using a particular style The goal of AI alignment is to align the model’s behavior with a desired behavior.

Usage:

first to download the repo:

https://github.com/Esmail-ibraheem/Axon.git

Then you have this built tree. Check the README file for each package to gain a better understanding.:

Axon/
β”‚
β”œβ”€β”€ Transformer model/
β”‚   β”œβ”€β”€ transformer/
β”‚   β”œβ”€β”€ translator/
β”‚   └── assets/
|   └── Readme.md
|
β”œβ”€β”€ X-Llama/
β”‚   β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ X-Llama/
β”‚   └── assets/
|   └── Readme.md
β”‚
β”œβ”€β”€ Dali/
|   
β”‚
└── RLHF (InstructGPT)/
β”‚
└── Readme.md
β”‚
└── NN.jpg

Citation

@misc{Gumaan2024-Axon,
  title   = "Axon",
  author  = "Gumaan, Esmail",
  howpublished = {\url{https://github.com/Esmail-ibraheem/Axon}},
  year    = "2024",
  month   = "May",
  note    = "[Online; accessed 2024-05-24]",
}

Notes and Acknowledgments:

I built this AI research lab, Axon, as an ecosystem for implementing research papers on topics ranging from transformers and x-Llama to diffusion models. The lab also focuses on understanding the theoretical and mathematical aspects of the research, as detailed in the README file for each package. This project contains multiple packages, each offering different implementations of various papers. If you want to add an implementation of a research paper, please make a pull request. This project is open to implementations of more papers.

papers: