LLM-PowerHouse: A Curated Guide for Large Language Models with Custom Training and Inferencing

Welcome to LLM-PowerHouse, your ultimate resource for unleashing the full potential of Large Language Models (LLMs) with custom training and inferencing. This GitHub repository is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of LLMs and build intelligent applications that push the boundaries of natural language understanding.

In-Depth Articles
Codebase Mastery: Building with Perfection
What I am learning
Contributing

In-Depth Articles

NLP

Article	Resources
LLMs Overview	🔗
NLP Embeddings	🔗
Sampling	🔗
Tokenization	🔗
Transformer	🔗

Models

Article	Resources
Generative Pre-trained Transformer (GPT)	🔗

Training

Article	Resources
Activation Function	🔗
Fine Tuning Models	🔗
Enhancing Model Compression: Inference and Training Optimization Strategies	🔗
Model Summary	🔗
Splitting Datasets	🔗
Train Loss > Val Loss	🔗
Parameter Efficient Fine-Tuning	🔗
Gradient Descent and Backprop	🔗
Overfitting And Underfitting	🔗
Gradient Accumulation and Checkpointing	🔗
Flash Attention	🔗

Enhancing Model Compression: Inference and Training Optimization Strategies

Article	Resources
Quantization	🔗
Knowledge Distillation	🔗
Pruning	🔗
DeepSpeed	🔗
Sharding	🔗
Mixed Precision Training	🔗
Inference Optimization	🔗

Evaluation Metrics

Article	Resources
Classification	🔗
Regression	🔗
Generative Text Models	🔗

Open LLMs

Article	Resources
Open Source LLM Space for Commercial Use	🔗
Open Source LLM Space for Research Use	🔗
LLM Training Frameworks	🔗
Effective Deployment Strategies for Language Models	🔗
Tutorials about LLM	🔗
Courses about LLM	🔗

Cost Analysis

Article	Resources
Lambda Labs vs AWS Cost Analysis	🔗

Codebase Mastery: Building with Perfection

Title	Repository
Instruction based data prepare using OpenAI	🔗
Optimal Fine-Tuning using the Trainer API: From Training to Model Inference	🔗
Efficient Fine-tuning and inference LLMs with PEFT and LoRA	🔗
Efficient Fine-tuning and inference LLMs Accelerate	🔗
Efficient Fine-tuning with T5	🔗
Train Large Language Models with LoRA and Hugging Face	🔗
Fine-Tune Your Own Llama 2 Model in a Colab Notebook	🔗
Guanaco Chatbot Demo with LLaMA-7B Model	🔗
PEFT Finetune-Bloom-560m-tagger	🔗
Finetune_Meta_OPT-6-1b_Model_bnb_peft	🔗
Finetune Falcon-7b with BNB Self Supervised Training	🔗
FineTune LLaMa2 with QLoRa	🔗
Stable_Vicuna13B_8bit_in_Colab	🔗
GPT-Neo-X-20B-bnb2bit_training	🔗
MPT-Instruct-30B Model Training	🔗
RLHF_Training_for_CustomDataset_for_AnyModel	🔗
Fine_tuning_Microsoft_Phi_1_5b_on_custom_dataset(dialogstudio)	🔗
Finetuning OpenAI GPT3.5 Turbo	🔗
Finetuning Mistral-7b FineTuning Model using Autotrain-advanced	🔗
RAG LangChain Tutorial	🔗
Mistral DPO Trainer	🔗
LLM Sharding	🔗
Integrating Unstructured and Graph Knowledge with Neo4j and LangChain for Enhanced Question	🔗
vLLM Benchmarking	🔗

What I am learning

After immersing myself in the recent GenAI text-based language model hype for nearly a month, I have made several observations about its performance on my specific tasks.

Please note that these observations are subjective and specific to my own experiences, and your conclusions may differ.

We need a minimum of 7B parameter models (<7B) for optimal natural language understanding performance. Models with fewer parameters result in a significant decrease in performance. However, using models with more than 7 billion parameters requires a GPU with greater than 24GB VRAM (>24GB).
Benchmarks can be tricky as different LLMs perform better or worse depending on the task. It is crucial to find the model that works best for your specific use case. In my experience, MPT-7B is still the superior choice compared to Falcon-7B.
Prompts change with each model iteration. Therefore, multiple reworks are necessary to adapt to these changes. While there are potential solutions, their effectiveness is still being evaluated.
For fine-tuning, you need at least one GPU with greater than 24GB VRAM (>24GB). A GPU with 32GB or 40GB VRAM is recommended.
Fine-tuning only the last few layers to speed up LLM training/finetuning may not yield satisfactory results. I have tried this approach, but it didn't work well.
Loading 8-bit or 4-bit models can save VRAM. For a 7B model, instead of requiring 16GB, it takes approximately 10GB or less than 6GB, respectively. However, this reduction in VRAM usage comes at the cost of significantly decreased inference speed. It may also result in lower performance in text understanding tasks.
Those who are exploring LLM applications for their companies should be aware of licensing considerations. Training a model with another model as a reference and requiring original weights is not advisable for commercial settings.
There are three major types of LLMs: basic (like GPT-2/3), chat-enabled, and instruction-enabled. Most of the time, basic models are not usable as they are and require fine-tuning. Chat versions tend to be the best, but they are often not open-source.
Not every problem needs to be solved with LLMs. Avoid forcing a solution around LLMs. Similar to the situation with deep reinforcement learning in the past, it is important to find the most appropriate approach.
I have tried but didn't use langchains and vector-dbs. I never needed them. Simple Python, embeddings, and efficient dot product operations worked well for me.
LLMs do not need to have complete world knowledge. Humans also don't possess comprehensive knowledge but can adapt. LLMs only need to know how to utilize the available knowledge. It might be possible to create smaller models by separating the knowledge component.
The next wave of innovation might involve simulating "thoughts" before answering, rather than simply predicting one word after another. This approach could lead to significant advancements.

Contributing

Contributions are welcome! If you'd like to contribute to this project, feel free to open an issue or submit a pull request.

ashishlal/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing