/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing

LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.

Primary LanguageJupyter NotebookMIT LicenseMIT

LLM-PowerHouse: A Curated Guide for Large Language Models with Custom Training and Inferencing

Welcome to LLM-PowerHouse, your ultimate resource for unleashing the full potential of Large Language Models (LLMs) with custom training and inferencing. This GitHub repository is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of LLMs and build intelligent applications that push the boundaries of natural language understanding.

Table of contents

πŸ§‘β€πŸ”¬ LLM Scientist

In this segment of the curriculum, participants delve into mastering the creation of top-notch LLMs through cutting-edge methodologies.

Toggle section
graph LR
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> Architecture["The LLM architecture πŸ—οΈ"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> Instruction["Building an instruction dataset πŸ“š"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> Pretraining["Pretraining models πŸ› οΈ"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> FineTuning["Supervised Fine-Tuning 🎯"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> RLHF["RLHF πŸ”"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> Evaluation["Evaluation πŸ“Š"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> Quantization["Quantization βš–οΈ"]
    Scientist["LLM Scientist πŸ‘©β€πŸ”¬"] --> Trends["New Trends πŸ“ˆ"]
    Architecture["The LLM architecture πŸ—οΈ"] --> HLV["High Level View πŸ”"]
    Architecture["The LLM architecture πŸ—οΈ"] --> Tokenization["Tokenization πŸ” "]
    Architecture["The LLM architecture πŸ—οΈ"] --> Attention["Attention Mechanisms 🧠"]
    Architecture["The LLM architecture πŸ—οΈ"] --> Generation["Text Generation ✍️"]
    Instruction["Building an instruction dataset πŸ“š"] --> Alpaca["Alpaca-like dataset πŸ¦™"]
    Instruction["Building an instruction dataset πŸ“š"] --> Advanced["Advanced Techniques πŸ“ˆ"]
    Instruction["Building an instruction dataset πŸ“š"] --> Filtering["Filtering Data πŸ”"]
    Instruction["Building an instruction dataset πŸ“š"] --> Prompt["Prompt Templates πŸ“"]
    Pretraining["Pretraining models πŸ› οΈ"] --> Pipeline["Data Pipeline πŸš€"]
    Pretraining["Pretraining models πŸ› οΈ"] --> CLM["Casual Language Modeling πŸ“"]
    Pretraining["Pretraining models πŸ› οΈ"] --> Scaling["Scaling Laws πŸ“"]
    Pretraining["Pretraining models πŸ› οΈ"] --> HPC["High-Performance Computing πŸ’»"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> Full["Full fine-tuning πŸ› οΈ"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> Lora["Lora and QLoRA πŸŒ€"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> Axoloti["Axoloti 🦠"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> DeepSpeed["DeepSpeed ⚑"]
    RLHF["RLHF πŸ”"] --> Preference["Preference Datasets πŸ“"]
    RLHF["RLHF πŸ”"] --> Optimization["Proximal Policy Optimization 🎯"]
    RLHF["RLHF πŸ”"] --> DPO["Direct Preference Optimization πŸ“ˆ"]
    Evaluation["Evaluation πŸ“Š"] --> Traditional["Traditional Metrics πŸ“"]
    Evaluation["Evaluation πŸ“Š"] --> General["General Benchmarks πŸ“ˆ"]
    Evaluation["Evaluation πŸ“Š"] --> Task["Task-specific Benchmarks πŸ“‹"]
    Evaluation["Evaluation πŸ“Š"] --> HF["Human Evaluation πŸ‘©β€πŸ”¬"]
    Quantization["Quantization βš–οΈ"] --> Base["Base Techniques πŸ› οΈ"]
    Quantization["Quantization βš–οΈ"] --> GGUF["GGUF and llama.cpp 🐐"]
    Quantization["Quantization βš–οΈ"] --> GPTQ["GPTQ and EXL2 πŸ€–"]
    Quantization["Quantization βš–οΈ"] --> AWQ["AWQ πŸš€"]
    Trends["New Trends πŸ“ˆ"] --> Positional["Positional Embeddings 🎯"]
    Trends["New Trends πŸ“ˆ"] --> Merging["Model Merging πŸ”„"]
    Trends["New Trends πŸ“ˆ"] --> MOE["Mixture of Experts 🎭"]
    Trends["New Trends πŸ“ˆ"] --> Multimodal["Multimodal Models πŸ“·"]
Loading

1. The LLM architecture πŸ—οΈ

An overview of the Transformer architecture, with emphasis on inputs (tokens) and outputs (logits), and the importance of understanding the vanilla attention mechanism and its improved versions.

concept Description
Transformer Architecture (High-Level) Review encoder-decoder Transformers, specifically the decoder-only GPT architecture used in modern LLMs.
Tokenization Understand how raw text is converted into tokens (words or subwords) for the model to process.
Attention Mechanisms Grasp the theory behind attention, including self-attention and scaled dot-product attention, which allows the model to focus on relevant parts of the input during output generation.
Text Generation Learn different methods the model uses to generate output sequences. Common strategies include greedy decoding, beam search, top-k sampling, and nucleus sampling.

Further Exploration

Reference Description Link
The Illustrated Transformer by Jay Alammar A visual and intuitive explanation of the Transformer model πŸ”—
The Illustrated GPT-2 by Jay Alammar Focuses on the GPT architecture, similar to Llama's. πŸ”—
Visual intro to Transformers by 3Blue1Brown Simple visual intro to Transformers πŸ”—
LLM Visualization by Brendan Bycroft 3D visualization of LLM internals πŸ”—
nanoGPT by Andrej Karpathy Reimplementation of GPT from scratch (for programmers) πŸ”—
Decoding Strategies in LLMs Provides code and visuals for decoding strategies πŸ”—

2. Building an instruction dataset πŸ“š

While it's easy to find raw data from Wikipedia and other websites, it's difficult to collect pairs of instructions and answers in the wild. Like in traditional machine learning, the quality of the dataset will directly influence the quality of the model, which is why it might be the most important component in the fine-tuning process.

Concept Description
Alpaca-like dataset This dataset generation method utilizes the OpenAI API (GPT) to synthesize data from scratch, allowing for the specification of seeds and system prompts to foster diversity within the dataset.
Advanced techniques Delve into methods for enhancing existing datasets with Evol-Instruct, and explore approaches for generating top-tier synthetic data akin to those outlined in the Orca and phi-1 research papers.
Filtering data Employ traditional techniques such as regex, near-duplicate removal, and prioritizing answers with substantial token counts to refine datasets.
Prompt templates Recognize the absence of a definitive standard for structuring instructions and responses, underscoring the importance of familiarity with various chat templates like ChatML and Alpaca.

Further Exploration

Reference Description Link
Preparing a Dataset for Instruction tuning by Thomas Capelle Explores the Alpaca and Alpaca-GPT4 datasets and discusses formatting methods. πŸ”—
Generating a Clinical Instruction Dataset by Solano Todeschini Provides a tutorial on creating a synthetic instruction dataset using GPT-4. πŸ”—
GPT 3.5 for news classification by Kshitiz Sahay Demonstrates using GPT 3.5 to create an instruction dataset for fine-tuning Llama 2 in news classification. πŸ”—
Dataset creation for fine-tuning LLM Notebook containing techniques to filter a dataset and upload the result. πŸ”—
Chat Template by Matthew Carrigan Hugging Face's page about prompt templates πŸ”—

3. Pretraining models πŸ› οΈ

Pre-training, being both lengthy and expensive, is not the primary focus of this course. While it's beneficial to grasp the fundamentals of pre-training, practical experience in this area is not mandatory.

Concept Description
Data pipeline Pre-training involves handling vast datasets, such as the 2 trillion tokens used in Llama 2, which necessitates tasks like filtering, tokenization, and vocabulary preparation.
Causal language modeling Understand the distinction between causal and masked language modeling, including insights into the corresponding loss functions. Explore efficient pre-training techniques through resources like Megatron-LM or gpt-neox.
Scaling laws Delve into the scaling laws, which elucidate the anticipated model performance based on factors like model size, dataset size, and computational resources utilized during training.
High-Performance Computing While beyond the scope of this discussion, a deeper understanding of HPC becomes essential for those considering building their own LLMs from scratch, encompassing aspects like hardware selection and distributed workload management.

Further Exploration

Reference Description Link
LLMDataHub by Junhao Zhao Offers a carefully curated collection of datasets tailored for pre-training, fine-tuning, and RLHF. πŸ”—
Training a causal language model from scratch by Hugging Face Guides users through the process of pre-training a GPT-2 model from the ground up using the transformers library. πŸ”—
TinyLlama by Zhang et al. Provides insights into the training process of a Llama model from scratch, offering a comprehensive understanding. πŸ”—
Causal language modeling by Hugging Face Explores the distinctions between causal and masked language modeling, alongside a tutorial on efficiently fine-tuning a DistilGPT-2 model. πŸ”—
Chinchilla's wild implications by nostalgebraist Delves into the scaling laws and their implications for LLMs, offering valuable insights into their broader significance. πŸ”—
BLOOM by BigScience Provides a comprehensive overview of the BLOOM model's construction, offering valuable insights into its engineering aspects and encountered challenges. πŸ”—
OPT-175 Logbook by Meta Offers research logs detailing the successes and failures encountered during the pre-training of a large language model with 175B parameters. πŸ”—
LLM 360 Presents a comprehensive framework for open-source LLMs, encompassing training and data preparation code, datasets, evaluation metrics, and models. πŸ”—

4. Supervised Fine-Tuning 🎯

Pre-trained models are trained to predict the next word, so they're not great as assistants. But with SFT, you can adjust them to follow instructions. Plus, you can fine-tune them on different data, even private stuff GPT-4 hasn't seen, and use them without needing paid APIs like OpenAI's.

Concept Description
Full fine-tuning Full fine-tuning involves training all parameters in the model, though it's not the most efficient approach, it can yield slightly improved results.
LoRA LoRA, a parameter-efficient technique (PEFT) based on low-rank adapters, focuses on training only these adapters rather than all model parameters.
QLoRA QLoRA, another PEFT stemming from LoRA, also quantizes model weights to 4 bits and introduces paged optimizers to manage memory spikes efficiently.
Axolotl Axolotl stands as a user-friendly and potent fine-tuning tool, extensively utilized in numerous state-of-the-art open-source models.
DeepSpeed DeepSpeed facilitates efficient pre-training and fine-tuning of large language models across multi-GPU and multi-node settings, often integrated within Axolotl for enhanced performance.

Futher Exploration

Reference Description Link
The Novice's LLM Training Guide by Alpin Provides an overview of essential concepts and parameters for fine-tuning LLMs. πŸ”—
LoRA insights by Sebastian Raschka Offers practical insights into LoRA and guidance on selecting optimal parameters. πŸ”—
Fine-Tune Your Own Llama 2 Model Presents a hands-on tutorial on fine-tuning a Llama 2 model using Hugging Face libraries. πŸ”—
Padding Large Language Models by Benjamin Marie Outlines best practices for padding training examples in causal LLMs. πŸ”—

In-Depth Articles

NLP

Article Resources
LLMs Overview πŸ”—
NLP Embeddings πŸ”—
Sampling πŸ”—
Tokenization πŸ”—
Transformer πŸ”—
Interview Preparation πŸ”—

Models

Article Resources
Generative Pre-trained Transformer (GPT) πŸ”—

Training

Article Resources
Activation Function πŸ”—
Fine Tuning Models πŸ”—
Enhancing Model Compression: Inference and Training Optimization Strategies πŸ”—
Model Summary πŸ”—
Splitting Datasets πŸ”—
Train Loss > Val Loss πŸ”—
Parameter Efficient Fine-Tuning πŸ”—
Gradient Descent and Backprop πŸ”—
Overfitting And Underfitting πŸ”—
Gradient Accumulation and Checkpointing πŸ”—
Flash Attention πŸ”—

Enhancing Model Compression: Inference and Training Optimization Strategies

Article Resources
Quantization πŸ”—
Knowledge Distillation πŸ”—
Pruning πŸ”—
DeepSpeed πŸ”—
Sharding πŸ”—
Mixed Precision Training πŸ”—
Inference Optimization πŸ”—

Evaluation Metrics

Article Resources
Classification πŸ”—
Regression πŸ”—
Generative Text Models πŸ”—

Open LLMs

Article Resources
Open Source LLM Space for Commercial Use πŸ”—
Open Source LLM Space for Research Use πŸ”—
LLM Training Frameworks πŸ”—
Effective Deployment Strategies for Language Models πŸ”—
Tutorials about LLM πŸ”—
Courses about LLM πŸ”—
Deployment πŸ”—

Resources for cost analysis and network visualization

Article Resources
Lambda Labs vs AWS Cost Analysis πŸ”—
Neural Network Visualization πŸ”—

Codebase Mastery: Building with Perfection

Title Repository
Instruction based data prepare using OpenAI πŸ”—
Optimal Fine-Tuning using the Trainer API: From Training to Model Inference πŸ”—
Efficient Fine-tuning and inference LLMs with PEFT and LoRA πŸ”—
Efficient Fine-tuning and inference LLMs Accelerate πŸ”—
Efficient Fine-tuning with T5 πŸ”—
Train Large Language Models with LoRA and Hugging Face πŸ”—
Fine-Tune Your Own Llama 2 Model in a Colab Notebook πŸ”—
Guanaco Chatbot Demo with LLaMA-7B Model πŸ”—
PEFT Finetune-Bloom-560m-tagger πŸ”—
Finetune_Meta_OPT-6-1b_Model_bnb_peft πŸ”—
Finetune Falcon-7b with BNB Self Supervised Training πŸ”—
FineTune LLaMa2 with QLoRa πŸ”—
Stable_Vicuna13B_8bit_in_Colab πŸ”—
GPT-Neo-X-20B-bnb2bit_training πŸ”—
MPT-Instruct-30B Model Training πŸ”—
RLHF_Training_for_CustomDataset_for_AnyModel πŸ”—
Fine_tuning_Microsoft_Phi_1_5b_on_custom_dataset(dialogstudio) πŸ”—
Finetuning OpenAI GPT3.5 Turbo πŸ”—
Finetuning Mistral-7b FineTuning Model using Autotrain-advanced πŸ”—
RAG LangChain Tutorial πŸ”—
Mistral DPO Trainer πŸ”—
LLM Sharding πŸ”—
Integrating Unstructured and Graph Knowledge with Neo4j and LangChain for Enhanced Question πŸ”—
vLLM Benchmarking πŸ”—
Milvus Vector Database πŸ”—
Decoding Strategies πŸ”—
Peft QLora SageMaker Training πŸ”—
Optimize Single Model SageMaker Endpoint πŸ”—
Multi Adapter Inference πŸ”—
Inf2 LLM SM Deployment πŸ”—
Text Chunk Visualization In Progress πŸ”—
Fine-tune Llama 3 with ORPO πŸ”—
4 bit LLM Quantization with GPTQ πŸ”—

LLM PlayLab

LLM Projects Respository
CSVQConnect πŸ”—
AI_VIRTUAL_ASSISTANT πŸ”—
DocuBotMultiPDFConversationalAssistant πŸ”—
autogpt πŸ”—
meta_llama_2finetuned_text_generation_summarization πŸ”—
text_generation_using_Llama πŸ”—
llm_using_petals πŸ”—
llm_using_petals πŸ”—
Salesforce-xgen πŸ”—
text_summarization_using_open_llama_7b πŸ”—
Text_summarization_using_GPT-J πŸ”—
codllama πŸ”—
Image_to_text_using_LLaVA πŸ”—
Tabular_data_using_llamaindex πŸ”—
nextword_sentence_prediction πŸ”—
Text-Generation-using-DeciLM-7B-instruct πŸ”—
Gemini-blog-creation πŸ”—
Prepare_holiday_cards_with_Gemini_and_Sheets πŸ”—
Code-Generattion_using_phi2_llm πŸ”—
RAG-USING-GEMINI πŸ”—
Resturant-Recommendation-Multi-Modal-RAG-using-Gemini πŸ”—
slim-sentiment-tool πŸ”—
Synthetic-Data-Generation-Using-LLM πŸ”—
Architecture-for-building-a-Chat-Assistant πŸ”—
LLM-CHAT-ASSISTANT-WITH-DYNAMIC-CONTEXT-BASED-ON-QUERY πŸ”—
Text Classifier using LLM πŸ”—
Multiclass sentiment Analysis πŸ”—
Text-Generation-Using-GROQ πŸ”—
DataAgents πŸ”—
PandasQuery_tabular_data πŸ”—
Exploratory_Data_Analysis_using_LLM πŸ”—

LLM Alligmment

Alignment is an emerging field of study where you ensure that an AI system performs exactly what you want it to perform. In the context of LLMs specifically, alignment is a process that trains an LLM to ensure that the generated outputs align with human values and goals.

What are the current methods for LLM alignment?

You will find many alignment methods in research literature, we will only stick to 3 alignment methods for the sake of discussion

πŸ“Œ RLHF:

  • Step 1 & 2: Train an LLM (pre-training for the base model + supervised/instruction fine-tuning for chat model)
  • Step 3: RLHF uses an ancillary language model (it could be much smaller than the main LLM) to learn human preferences. This can be done using a preference dataset - it contains a prompt, and a response/set of responses graded by expert human labelers. This is called a β€œreward model”.
  • Step 4: Use a reinforcement learning algorithm (eg: PPO - proximal policy optimization), where the LLM is the agent, the reward model provides a positive or negative reward to the LLM based on how well it’s responses align with the β€œhuman preferred responses”. In theory, it is as simple as that. However, implementation isn’t that easy - requiring lot of human experts and compute resources. To overcome the β€œexpense” of RLHF, researchers developed DPO.
  • RLHF : RLHF: Reinforcement Learning from Human Feedback

πŸ“Œ DPO:

  • Step 1&2 remain the same
  • Step 4: DPO eliminates the need for the training of a reward model (i.e step 3). How? DPO defines an additional preference loss as a function of it’s policy and uses the language model directly as the reward model. The idea is simple, If you are already training such a powerful LLM, why not train itself to distinguish between good and bad responses, instead of using another model?
  • DPO is shown to be more computationally efficient (in case of RLHF you also need to constantly monitor the behavior of the reward model) and has better performance than RLHF in several settings.
  • Blog on DPO : Aligning LLMs with Direct Preference Optimization (DPO)β€” background, overview, intuition and paper summary

πŸ“Œ ORPO:

  • The newest method out of all 3, ORPO combines Step 2, 3 & 4 into a single step - so the dataset required for this method is a combination of a fine-tuning + preference dataset.
  • The supervised fine-tuning and alignment/preference optimization is performed in a single step. This is because the fine-tuning step, while allowing the model to specialize to tasks and domains, can also increase the probability of undesired responses from the model.
  • ORPO combines the steps using a single objective function by incorporating an odds ratio (OR) term - reward preferred responses & penalizing rejected responses.
  • Blog on ORPO : ORPO Outperforms SFT+DPO | Train Phi-2 with ORPO

What I am learning

After immersing myself in the recent GenAI text-based language model hype for nearly a month, I have made several observations about its performance on my specific tasks.

Please note that these observations are subjective and specific to my own experiences, and your conclusions may differ.

  • We need a minimum of 7B parameter models (<7B) for optimal natural language understanding performance. Models with fewer parameters result in a significant decrease in performance. However, using models with more than 7 billion parameters requires a GPU with greater than 24GB VRAM (>24GB).
  • Benchmarks can be tricky as different LLMs perform better or worse depending on the task. It is crucial to find the model that works best for your specific use case. In my experience, MPT-7B is still the superior choice compared to Falcon-7B.
  • Prompts change with each model iteration. Therefore, multiple reworks are necessary to adapt to these changes. While there are potential solutions, their effectiveness is still being evaluated.
  • For fine-tuning, you need at least one GPU with greater than 24GB VRAM (>24GB). A GPU with 32GB or 40GB VRAM is recommended.
  • Fine-tuning only the last few layers to speed up LLM training/finetuning may not yield satisfactory results. I have tried this approach, but it didn't work well.
  • Loading 8-bit or 4-bit models can save VRAM. For a 7B model, instead of requiring 16GB, it takes approximately 10GB or less than 6GB, respectively. However, this reduction in VRAM usage comes at the cost of significantly decreased inference speed. It may also result in lower performance in text understanding tasks.
  • Those who are exploring LLM applications for their companies should be aware of licensing considerations. Training a model with another model as a reference and requiring original weights is not advisable for commercial settings.
  • There are three major types of LLMs: basic (like GPT-2/3), chat-enabled, and instruction-enabled. Most of the time, basic models are not usable as they are and require fine-tuning. Chat versions tend to be the best, but they are often not open-source.
  • Not every problem needs to be solved with LLMs. Avoid forcing a solution around LLMs. Similar to the situation with deep reinforcement learning in the past, it is important to find the most appropriate approach.
  • I have tried but didn't use langchains and vector-dbs. I never needed them. Simple Python, embeddings, and efficient dot product operations worked well for me.
  • LLMs do not need to have complete world knowledge. Humans also don't possess comprehensive knowledge but can adapt. LLMs only need to know how to utilize the available knowledge. It might be possible to create smaller models by separating the knowledge component.
  • The next wave of innovation might involve simulating "thoughts" before answering, rather than simply predicting one word after another. This approach could lead to significant advancements.
  • The overparameterization of LLMs presents a significant challenge: they tend to memorize extensive amounts of training data. This becomes particularly problematic in RAG scenarios when the context conflicts with this "implicit" knowledge. However, the situation escalates further when the context itself contains contradictory information. A recent survey paper comprehensively analyzes these "knowledge conflicts" in LLMs, categorizing them into three distinct types:
    • Context-Memory Conflicts: Arise when external context contradicts the LLM's internal knowledge.

      • Solution
        • Fine-tune on counterfactual contexts to prioritize external information.
        • Utilize specialized prompts to reinforce adherence to context
        • Apply decoding techniques to amplify context probabilities.
        • Pre-train on diverse contexts across documents.
    • Inter-Context Conflicts: Contradictions between multiple external sources.

      • Solution:
        • Employ specialized models for contradiction detection.
        • Utilize fact-checking frameworks integrated with external tools.
        • Fine-tune discriminators to identify reliable sources.
        • Aggregate high-confidence answers from augmented queries.
    • Intra-Memory Conflicts: The LLM gives inconsistent outputs for similar inputs due to conflicting internal knowledge.

      • Solution:
        • Fine-tune with consistency loss functions.
        • Implement plug-in methods, retraining on word definitions.
        • Ensemble one model's outputs with another's coherence scoring.
        • Apply contrastive decoding, focusing on truthful layers/heads.
  • The difference between PPO and DPOs: in DPO you don’t need to train a reward model anymore. Having good and bad data would be sufficient!
  • ORPO: β€œA straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. β€œ Hong, Lee, Thorne (2024)
  • KTO: β€œKTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.” Ethayarajh et al (2024)

Contributing

Contributions are welcome! If you'd like to contribute to this project, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Created with ❀️ by Sunil Ghimire