LLM-PowerHouse: A Curated Guide for Large Language Models with Custom Training and Inferencing

Welcome to LLM-PowerHouse, your ultimate resource for unleashing the full potential of Large Language Models (LLMs) with custom training and inferencing. This GitHub repository is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of LLMs and build intelligent applications that push the boundaries of natural language understanding.

Prompt Engineering
Fine Tuning
When to perform
Difference between Masked Language Model and Causal Language Model
Open Source LLM Space for Research Use
Open Source LLM Space for Commercial Use
LLM Training Frameworks
Effective Deployment Strategies for Language Models
Courses about LLM
Tutorials about LLM
Codebase Mastery: Building with Perfection with In-Depth Articles
What I am learning

Prompt Engineering

Prompt engineering is a technique that can be used to improve the performance of LLMs on specific tasks. It involves crafting prompts that help the LLM to generate the desired output. This can be done by providing the model with additional information, such as examples of the desired output, or using specific language the model is likely to understand.

Prompt Engineering can be powerful tool, but it is important to note that it is not a silver bullet. LLMs can still generate incorrect or unexpected output even with the best prompts. As a result, it is important to test the output of LLMs carefully before using them in production.

Fine Tuning

In other hand, fine-tuing is adapting a pre-trained LLM to a specific task or domain by training it further on a smaller, task-specific dataset. This is done by adjusting the model's weights and parameters to minimize the loss function and improve its performance on the task.

Fine-tuning can be more effective way to improve the performance of LLMs on specific tasks that prompt engineering. However, it is also more time-consuming and expensive. As a result, it is important to consider the cost and time involved in fine-tuning before deciding whether to use it.

When to perform

Fine-tuning is typically needed when the task is new or challenging or when the desired output is highly specific. In these cases, prompt engineering may not be able to provide the model with enough information to generate the desired output.

Prompt engineering is typically sufficient when the task is well-defined and the desired output could be more specific. In these cases, prompt engineering can provide the model with the information it needs to generate the desired output.

Difference between Masked Language Model and Causal Language Model

	Masked Language Model (MLM)	Causal Language Model (CLM)
Training Objective	Predict original values of masked tokens within context	Generate next token based on preceding tokens
Input Conditioning	Includes original context with randomly masked tokens	Includes preceding tokens in the sequence
Bidirectionality	Bidirectional, considers entire context to predict masked tokens	Unidirectional, relies on preceding tokens
Use Cases	Text completion, masked word prediction, text understanding	Text generation, story completion, language translation
Example Model	BERT (Bidirectional Encoder Representations from Transformers)	GPT (Generative Pretrained Transformer)
Associated Task	Natural Language Understanding (NLU)	Natural Language Generation (NLG)
Example Sentence	"I want to __ a book."	"The cat sat on the __."
Masked Token	`[MASK]`	`[MASK]`
Objective	Predict missing word	Predict next word in sequence
Possible Predictions	Read, buy, borrow, ...	mat, chair, table, ...
Context	"I want to [MASK] a book."	"The cat sat on the [MASK]."
Output Prediction	Read	mat

Open Source LLM Space for Research Use

Language Model	Description	Link
Baize	Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.	🔗
Koala	A Dialogue Model for Academic Research	🔗
Dalai	The simplest way to run LLaMA on your local machine	🔗
LLaMA	A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA	🔗
ColossalChat	LLM trained with RLHF powered by Colossal-AI	🔗
Vicuna	An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.	🔗
Dolly	A cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT	🔗
GPT4All	Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa	🔗
Alpaca	A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA	🔗

Open Source LLM Space for Commercial Use

Language Model	Release Date	Checkpoints	Paper/Blog	Params (B)	Context Length	Licence	Try it
T5	2019/10	T5 & Flan-T5, Flan-T5-xxl (HF)	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	0.06 - 11	512	Apache 2.0	T5-Large
UL2	2022/10	UL2 & Flan-UL2, Flan-UL2 (HF)	UL2 20B: An Open Source Unified Language Learner	20	512, 2048	Apache 2.0
Cerebras-GPT	2023/03	Cerebras-GPT	Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models (Paper)	0.111 - 13	2048	Apache 2.0	Cerebras-GPT-1.3B
Open Assistant (Pythia family)	2023/03	OA-Pythia-12B-SFT-8, OA-Pythia-12B-SFT-4, OA-Pythia-12B-SFT-1	Democratizing Large Language Model Alignment	12	2048	Apache 2.0	Pythia-2.8B
Pythia	2023/04	pythia 70M - 12B	Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling	0.07 - 12	2048	Apache 2.0
Dolly	2023/04	dolly-v2-12b	Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM	3, 7, 12	2048	MIT
DLite	2023/05	dlite-v2-1_5b	Announcing DLite V2: Lightweight, Open LLMs That Can Run Anywhere	0.124 - 1.5	1024	Apache 2.0	DLite-v2-1.5B
RWKV	2021/08	RWKV, ChatRWKV	The RWKV Language Model (and my LM tricks)	0.1 - 14	infinity (RNN)	Apache 2.0
GPT-J-6B	2023/06	GPT-J-6B, GPT4All-J	GPT-J-6B: 6B JAX-Based Transformer	6	2048	Apache 2.0
GPT-NeoX-20B	2022/04	GPT-NEOX-20B	GPT-NeoX-20B: An Open-Source Autoregressive Language Model	20	2048	Apache 2.0
Bloom	2022/11	Bloom	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model	176	2048	OpenRAIL-M v1
StableLM-Alpha	2023/04	StableLM-Alpha	Stability AI Launches the First of its StableLM Suite of Language Models	3 - 65	4096	CC BY-SA-4.0
FastChat-T5	2023/04	fastchat-t5-3b-v1.0	We are excited to release FastChat-T5: our compact and commercial-friendly chatbot!	3	512	Apache 2.0
h2oGPT	2023/05	h2oGPT	Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey	12 - 20	256 - 2048	Apache 2.0
MPT-7B	2023/05	MPT-7B, MPT-7B-Instruct	Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs	7	84k (ALiBi)	Apache 2.0, CC BY-SA-3.0
RedPajama-INCITE	2023/05	RedPajama-INCITE	Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models	3 - 7	2048	Apache 2.0	RedPajama-INCITE-Instruct-3B-v1
OpenLLaMA	2023/05	open_llama_3b, open_llama_7b, open_llama_13b	OpenLLaMA: An Open Reproduction of LLaMA	3, 7	2048	Apache 2.0	OpenLLaMA-7B-Preview_200bt
Falcon	2023/05	Falcon-40B, Falcon-7B	The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only	7, 40	2048	Apache 2.0
MPT-30B	2023/06	MPT-30B, MPT-30B-instruct	MPT-30B: Raising the bar for open-source foundation models	30	8192	Apache 2.0, CC BY-SA-3.0	MPT 30B inference code using CPU
LLaMA 2	2023	LLaMA 2 Weights	Llama 2: Open Foundation and Fine-Tuned Chat Models	7 - 70	4096	Custom Free if you have under 700M users and you cannot use LLaMA outputs to train other LLMs besides LLaMA and its derivatives	HuggingChat

LLM Training Frameworks

Framework	Description	Resource
Alpa	Alpa is a system for training and serving large-scale neural networks.	🔗
DeepSpeed	DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.	🔗
Megatron-DeepSpeed	DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others.	🔗
FairScale	FairScale is a PyTorch extension library for high performance and large scale training	🔗
Megatron-LM	Ongoing research training transformer models at scale.	🔗
Colossal-AI	Making large AI models cheaper, faster, and more accessible.	🔗
BMTrain	Efficient Training for Big Models.	🔗
Mesh Tensorflow	Mesh TensorFlow: Model Parallelism Made Easier.	🔗
maxtext	A simple, performant and scalable Jax LLM!	🔗
gpt-neox	An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.	🔗
Trainer API	Provides an API for feature-complete training in PyTorch for most standard use cases	🔗
Lighting	Deep learning framework to train, deploy, and ship AI products Lightning fast	🔗

Effective Deployment Strategies for Language Models

Deployment Tools	Description	Resource
SkyPilot	Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.	🔗
vLLM	A high-throughput and memory-efficient inference and serving engine for LLMs	🔗
Text Generation Inference	A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.	🔗
Haystack	an open-source NLP framework that allows you to use LLMs and transformer-based models from Hugging Face, OpenAI and Cohere to interact with your own data.	🔗
Sidekick	Data integration platform for LLMs.	🔗
LangChain	Building applications with LLMs through composability	🔗
wechat-chatgpt	Use ChatGPT On Wechat via wechaty	🔗
promptfoo	Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.	🔗
Agenta	Easily build, version, evaluate and deploy your LLM-powered apps	🔗

Courses about LLM

Courses	Link
Full Stack DeepLearning's "LLM BootCamp"	🔗
Cohere's LLM University - By Luis Serrano, Jay Alammar and Meor Amer	🔗
"fast.ai's Part 2 of "Practical Deep Learning for Coders"	🔗
Deep Learning Fundamentals by Lightning AI & Sebastian Raschka, PhD.	🔗
Hugging Face's NLP Course	🔗
`DeepLearning.AI` ChatGPT Prompt Engineering for Developers	🔗
`Princeton` Understanding Large Language Models	🔗
`Stanford` CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF	🔗
`Stanford` CS324-Large Language Models	🔗
`Stanford` CS25-Transformers United V2	🔗
`Stanford` GPT-3 & Beyond	🔗
`MIT` Introduction to Data-Centric AI	🔗
`Cohere` LLM University to learn about LLMs and NLP	🔗
`Oreilly` Deploying GPT and Large Language Models	🔗
`edx` Professional Certificate in Large Language Models	🔗
`coursera` Natural Language Processing Specialization	🔗
`Class Central` Large Language Models	🔗
`Rycolab` Large Language Models	🔗

Tutorials about LLM

Tutorials	Author	Link
State of GPT	Andrej Karpathy	🔗
Instruction finetuning and RLHF lecture	Hyung Won Chung	🔗
Scaling, emergence, and reasoning in large language models	Jason Wei	🔗
Open Pretrained Transformers	Susan Zhang	🔗
How Does ChatGPT Work?	Ameet Deshpande	🔗
GPT in 60 Lines of NumPy	Jay Mody	🔗
Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models	ICML 2022	🔗
Foundational Robustness of Foundation Models	NeurIPS 2022	🔗
Let's build GPT: from scratch, in code, spelled out	Andrej Karpathy	🔗 👨‍💻
Prompt Engineering Guide	DAIR.AI	🔗
Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers	Philipp Schmid	🔗
Illustrating Reinforcement Learning from Human Feedback (RLHF)	HuggingFace	🔗
What Makes a Dialog Agent Useful?	HuggingFace	🔗
How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources	Yao Fu	🔗
What Is ChatGPT Doing … and Why Does It Work?	Stephen Wolfram	🔗
Why did all of the public reproduction of GPT-3 fail?	Jingfeng Yang	🔗
Pure Rust implementation of a minimal Generative Pretrained Transformer	Keyvan Kambakhsh	🔗

Codebase Mastery: Building with Perfection with In-Depth Articles

Title	Repository
Instruction based data prepare using OpenAI	🔗
Optimal Fine-Tuning using the Trainer API: From Training to Model Inference	🔗
Efficient Fine-tuning and inference LLMs with PEFT and LoRA	🔗
Efficient Fine-tuning and inference LLMs Accelerate	🔗
Curated list of of articles	🔗

What I am learning

After immersing myself in the recent GenAI text-based language model hype for nearly a month, I have made several observations about its performance on my specific tasks.

Please note that these observations are subjective and specific to my own experiences, and your conclusions may differ.

We need a minimum of 7B parameter models (<7B) for optimal natural language understanding performance. Models with fewer parameters result in a significant decrease in performance. However, using models with more than 7 billion parameters requires a GPU with greater than 24GB VRAM (>24GB).
Benchmarks can be tricky as different LLMs perform better or worse depending on the task. It is crucial to find the model that works best for your specific use case. In my experience, MPT-7B is still the superior choice compared to Falcon-7B.
Prompts change with each model iteration. Therefore, multiple reworks are necessary to adapt to these changes. While there are potential solutions, their effectiveness is still being evaluated.
For fine-tuning, you need at least one GPU with greater than 24GB VRAM (>24GB). A GPU with 32GB or 40GB VRAM is recommended.
Fine-tuning only the last few layers to speed up LLM training/finetuning may not yield satisfactory results. I have tried this approach, but it didn't work well.
Loading 8-bit or 4-bit models can save VRAM. For a 7B model, instead of requiring 16GB, it takes approximately 10GB or less than 6GB, respectively. However, this reduction in VRAM usage comes at the cost of significantly decreased inference speed. It may also result in lower performance in text understanding tasks.
Those who are exploring LLM applications for their companies should be aware of licensing considerations. Training a model with another model as a reference and requiring original weights is not advisable for commercial settings.
There are three major types of LLMs: basic (like GPT-2/3), chat-enabled, and instruction-enabled. Most of the time, basic models are not usable as they are and require fine-tuning. Chat versions tend to be the best, but they are often not open-source.
Not every problem needs to be solved with LLMs. Avoid forcing a solution around LLMs. Similar to the situation with deep reinforcement learning in the past, it is important to find the most appropriate approach.
I have tried but didn't use langchains and vector-dbs. I never needed them. Simple Python, embeddings, and efficient dot product operations worked well for me.
LLMs do not need to have complete world knowledge. Humans also don't possess comprehensive knowledge but can adapt. LLMs only need to know how to utilize the available knowledge. It might be possible to create smaller models by separating the knowledge component.
The next wave of innovation might involve simulating "thoughts" before answering, rather than simply predicting one word after another. This approach could lead to significant advancements.

subhrajit-mohanty/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing