Gen-AI Lifecycle
Transformers
- Transformers model key-highlights
  - Transformer
  - Simplified Transformer architecture
In-depth Understanding of Transformer architecture step by step
- Types of transformer models
HistoryEvolution tree of LLM models
Transformer model code from Scratch
Prompt engineering and Prompting types
Generative AI Configurations
Fine-tuning LLM's with PEFT & LoRA
Fine-tuning LLM's with RLHF
LLM's in applications

Gen-AI Lifecycle

Here is diagram of the Gen-AI Lifecycle

Scope: Defining the scope of LLM as accurately and narrowly is very important w.r.t use-case because LLM's are capable of carrying out multiple tasks based on the size and architecture of model. Getting really specific about what you need your model to do can save you time and compute cost

Examples of Specific tasks can be Q & A bot, Text Summarization, or Named-Entity recognition etc.
Select: In this stage it's important to decide whether to train our own model from scratch or work with an existing base model.
Adapt & Align Model: With our model in hand, the next step is to assess its performance and carry out additional training if needed for our application.

Prompt engineering can sometimes be enough to get our model to perform well, so we'll likely start by trying in-context learning, using examples suited to our task and use case.

There are still cases, however, where the model may not perform as well as we need, even with one or a few shot inference, and in that case, we can try fine-tuning our model. This supervised learning process of training LLMs.

As models become more capable, it's becoming increasingly important to ensure that they behave well and in a way that is aligned with human preferences in deployment. An additional fine-tuning technique called reinforcement learning with human feedback, which can help to make sure that your model behaves well.

An important aspect of all of these techniques is evaluation. We will explore some metrics and benchmarks that can be used to determine how well your model is performing or how well aligned it is to our preferences.

Note that this adapt and aligned stage of app development can be highly iterative or repetitive process until we get model performance stable enough for our criteria and needs.
Application Integration: At this stage, an important step is to optimize our model for deployment. Create front-end apps by using our customized LLMs.

Limitations: There are some fundamental limitations of LLMs that can be difficult to overcome through training alone like their tendency to invent information when they don't know an answer, or their limited ability to carry out complex reasoning and mathematics.

Transformers

LLMs-> more parameters -> more memory ->better models

So it's important to understand difference between Parameters & Hyperparameters

Parameters are variables that are learned by the model from the data, such as weights and biases. They allow the model to learn the rules from the data. Hence why models with billions of parameters are performing really Good(e.g. GPT(175B),BLOOM(175B)) vs model with millions of parameters(e.g. Bert(110M))
Hyperparameters are variables that are set manually before training, such as learning rate, batch size, number of layers, etc. They control how the model is trained and how the parameters are updated.

Some examples of parameters are the attention weights, the feed-forward network weights, and the embeddings.

Some examples of hyperparameters are the number of heads, the hidden size, the dropout rate, and the warmup steps.

Some of LLM's

“Attention is All You Need” by Vaswani et al. (2017) was the paper which introduced transformer architecture

Transformers model key-highlights

Scale efficiently to use multi-core GPU's.
Parallel processing of input data and thus making use of much larger training datasets.
Learn to pay attention to the meaning of the words it's processing

The Power of LLM's comes from model architecture which was used to train this kind of models vs old architecture like RNN

The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence and not just the neighbors.

It applies attention weights to those relationships so that the model learns the relevance of each word to each other words no matter where they are in the input.

Based on this sentence itself model has ability to answer following questions:

Who has the book?
Who could have the book?

In a nutshell it has ability to understand the context of document given to it.

In the above diagram we can see that word book is strongly connected with or paying attention to the word teacher and the word student.

This is called self-attention and the ability to learn a attention in this way across the whole input significantly approves the model's ability to encode language.

Self-attention is the key attributes of the transformer architecture.Let's dive in on transformer architecture diagram

Transformer

Simplified Transformer architecture

In-depth Understanding of Transformer architecture step by step

Types of transformer models

Encoder only(Auto-encoding): Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence.

Objective: The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Original text	The	teacher	teaches	the	student
MLM(Masked Language modeling)	The	teacher	<mask>	the	student
Reconstruct text(denoise)	The	teacher	teaches	the	student
Bidirectional context	---	----->	teaches	<--	-------

Use-cases:

Sentiment Analysis
Named Entity Recognition
Word Classification

Models:

BERT
DistilBERT
RoBERTa

Encoder-Decoder(Sequence to Sequence): Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

Objective: T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.

<span corruption> The teacher <mask> <mask> student

Sentinel token(mask words) The teacher <x> student

Reconstruct span The teacher <teaches> <the> student

Use-cases:
- Translation
- Text Summarization
- generative question answering
Models:
- BART
- T5

<span corruption>	The	teacher	<mask>	<mask>	student
Sentinel token(mask words)	The	teacher	<x>		student
Reconstruct span	The	teacher	<teaches>	<the>	student

Decoder only(autoregressive models):

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

Objective: The pretraining of decoder models usually revolves around predicting the next word in the sentence.

Original text The teacher teaches the student

Causal Language Modeling The teacher ? ? ?

Predict Next word The teacher teaches the student

Unidirectional context --- ------ ---->

Use-Cases:
- Text generation
Models:
- GPT
- GPT2
- BLOOM
- BARD
- CLAUDE
- PaLM
- LLAMA,LLAMA2

Original text	The	teacher	teaches	the	student
Causal Language Modeling	The	teacher	?	?	?
Predict Next word	The	teacher	teaches	the	student
Unidirectional context	---	------	---->

History(Evolution tree) of LLM models

The evolutionary tree of modern LLMs traces the development of language models in recent years and highlights some of the most well-known models. Models on the same branch have closer relationships.

Transformer-based models are shown in non-grey colors: decoder-only models in the blue branch, encoder-only models in the pink branch, and encoder-decoder models in the green branch.

The vertical position of the models on the timeline represents their release dates. Open-source models are represented by solid squares, while closed-source models are represented by hollow ones.

To view the animated view of evolution tree click here

The stacked bar plot in the bottom right corner shows the number of models from various companies and institutions.

For more details you can refer to this link

Data-Centric AI concepts behind GPT models

Try GPT-2 Transformer out on your own at live this url https://transformer.huggingface.co/doc/gpt2-large

Transformer model code from Scratch

We will look into the actual implementation of transformer model code and its concept in detail for better understanding.

Prompt engineering and Prompting types

We will learn the following topics in prompt engineering & types

What is prompting, prompt engineering & In-Context learning?
Designing Prompts for different tasks

Prompt Engineering techniques

Applications

Progam-Aided learning (PAL)
Generating Data
Generating Code
prompt functions

Risks

prompt Injection
prompt leaking
Jail-breaking

Reference guide

pdf download for prompting

Generative AI Configurations(or Inference Parameters)

Temperature: The temperature should be set according to the task and domain expectations. A higher temperature value of 0.7 to 0.9 may be desired, as it can produce more original and diverse texts.
Maximum length or tokens: Set the word count, it makes your responses much cleaner.

Bear in mind that you can only return 2048 tokens, or about 300–400 words per response. Anything longer may result in a response being cut off.

Don't worry, just prompt "continue" and it should keep going (you may need to copy and paste the last sentence or two).
Top p: A hyperparameter that controls the cumulative probability of the candidate tokens that the model can choose from.

A lower top p means that only the most probable tokens are considered, while a higher top p means that more tokens are considered.

Difference between Top p and Temperature

Temperature and top_p are two parameters that affect the randomness of the output of a language model, such as GPT-3.

Temperature affects the confidence of the model in its top choices, while top_p affects the number of choices that the model considers.

A low temperature makes the output more deterministic and less diverse, while a high temperature makes the output more stochastic and more diverse.

Top_p sampling selects only the tokens that have a cumulative probability mass above a certain threshold.

Above table from openAI community blogpost

Frequency penalty: A hyperparameter that controls the repetition of words or phrases in the generated text.

A higher frequency penalty means less repetition, while a lower frequency penalty means more repetition.
Presence penalty: A hyperparameter that controls the novelty of words or phrases in the generated text.

A higher presence penalty means more novelty, while a lower presence penalty means more familiarity.

For example, for creative writing, a higher presence penalty value of 0.6 to 0.8 may be desired, as it can encourage the generation of new and original ideas.

For text summarization, a lower presence penalty value of 0.2 to 0.4 may be preferred, as it can ensure the consistency and relevance of the summaries.

Difference between frequency vs Presence penalties

Article to understand difference between Frequency Vs Presence Penalty

Fine-tuning LLM's with PEFT(Parameter Efficient Fine-Tuning)

Fine tune = update foundation model weights

(AKA parameter fine tuning)

Update more layers = better model performance
Full fine-tuning typically produces one model per task
- Serve one model per task
- May forget other pre-trained tasks: catastrophic forgetting
Full fine-tuning LLMs is expensive. How to avoid it?
- X-shot learning(we have seen this approach in prompt-engineering)
- Parameter-efficient fine tuning

Training LLM poses 2 main challenges

Increasing compute power
Increasing file size of model

PEFT and LoRa

PEFT is a method that employs various techniques, including LoRa, to efficiently fine-tune large language models.

LoRa(Low-Rank Adaptation) focuses on adding extra weights to the model while freezing most of the pre-trained network’s parameters. This approach helps prevent catastrophic forgetting, a situation where models forgets what they were originally trained on during the full fine-tuning process.

Background

The Research Paper about LoRa was published by microsoft researchers in 2021.A library named Loralib was also created on github and later in Feb-2023 it was supported by PEFT library from HuggingFace.

Background on Fine tuning

SFT Finetuning approaches

Following are the finetuning approaches

what is Instruction fine-tuning?

What are task specific Instructions or examples

1. Example classify the review

2. Example Summarize or Translate the sentence

For this we first need to Generate data for task specific examples.

We can use Prompt instruction template to generate the instruction data.

Let's look at how the instruction data looks like on HuggingFace

LoRa(Low Rank Adaption)?

We freeze the existing weights of pretrained model and perform the training on the new instruction data using the LoRa.

What is Rank of Matrix?

Blog to understand how to calculate Rank of Matrix

We can see that after Rank-8 there is not much of an improvement and accuracy remain consistent.

Now we can start with Code-walkthrough in notebook.

code for finetuning flan-T5 model

As a Next step we can even automate the fine-tuning step even further using HuggingFace AutoTrian library.

pip install autotrain-advanced

Once we have pip installed the autotrain-advanced and kept our data in data folder in local system. Use the following code

# set the hyperparameter and variables

project_name = 'my_autotrain_llm'
model_name = 'user/llama-2-7b-hf-small-shards' 

learning_rate = 2e-4 
num_epochs = 1 
batch_size = 1 
block_size = 1024 
trainer = "sft" 
warmup_ratio = 0.1 
weight_decay = 0.01 
gradient_accumulation = 4 
use_fp16 = True 
use_peft = True 
use_int4 = True 
lora_r = 16 
lora_alpha = 32 
lora_dropout = 0.05 

# store all parameters in environment variable
import os
os.environ["PROJECT_NAME"] = project_name
os.environ["MODEL_NAME"] = model_name
os.environ["PUSH_TO_HUB"] = str(push_to_hub)
os.environ["HF_TOKEN"] = hf_token
os.environ["REPO_ID"] = repo_id
os.environ["LEARNING_RATE"] = str(learning_rate)
os.environ["NUM_EPOCHS"] = str(num_epochs)
os.environ["BATCH_SIZE"] = str(batch_size)
os.environ["BLOCK_SIZE"] = str(block_size)
os.environ["WARMUP_RATIO"] = str(warmup_ratio)
os.environ["WEIGHT_DECAY"] = str(weight_decay)
os.environ["GRADIENT_ACCUMULATION"] = str(gradient_accumulation)
os.environ["USE_FP16"] = str(use_fp16)
os.environ["USE_PEFT"] = str(use_peft)
os.environ["USE_INT4"] = str(use_int4)
os.environ["LORA_R"] = str(lora_r)
os.environ["LORA_ALPHA"] = str(lora_alpha)
os.environ["LORA_DROPOUT"] = str(lora_dropout)

run the below code in shell or notebook

!autotrain llm \
--train \
--model ${MODEL_NAME} \
--project-name ${PROJECT_NAME} \
--data-path data/ \
--text-column text \
--lr ${LEARNING_RATE} \
--batch-size ${BATCH_SIZE} \
--epochs ${NUM_EPOCHS} \
--block-size ${BLOCK_SIZE} \
--warmup-ratio ${WARMUP_RATIO} \
--lora-r ${LORA_R} \
--lora-alpha ${LORA_ALPHA} \
--lora-dropout ${LORA_DROPOUT} \
--weight-decay ${WEIGHT_DECAY} \
--gradient-accumulation ${GRADIENT_ACCUMULATION}

Fine-tuning LLM's with RLHF

Why fine-tune LLM's with RLHF?

Because Fine-tunning LLM further with Human feedback will ensure that model is more aligned with Human Values and not toxic in nature.

In year 2020 below was finding from OpenAI

We can observe that fine-tuning with Human feedback was improving model performance when compared with Initial fine-tuning and No-fine-tuning.

Below is complete pipeline view of building fine-tuned LLM with RLHF to create very specific LLM based on Human values. Applications of same can be many like Individualized learning Plan bot, Personal Bot etc.

What is Reinforcement Learning?

In RL we have an agent and environment where Objective is to maximize the reward received for actions.

$r{_t}$ is a reward that agent gets for taking action $a{_t}$ given the current state $s{_t}$ within the evironment

Example of RL in tic-tac-toe

Reward Model: This is custom trained ML model which gives the reward to the agent based on the response generated by LLM and how closely it is matching with expected response.

Example: A custom toxicity classification model can be trained on text data which can then be used to evaluate the LLM response for toxicity.

Note: Training a reward model requires us to prepare the data with human feedback for each of the response with rank ideally called as human rank completion pair.

RL policy or algorithm which was used in RLHF was PPO(Proximal Policy Optimization)

Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm that is used to train agents to perform tasks in an environment. It is an iterative algorithm that improves the policy through trial and error.

For detail understanding PPO you can refer to this link.

Reward Model

Finally complete flow of LLM model weight update is as follows:

We can stop the training by setting max number of iterations or by threshold value of helpful response.

Usually ~20,000 iterations should be good enough.

LLM apps (LLM's in application)

RAG (Retrieval Augmented Generation)

Ingestion Procress in RAG

Take a set of proprietary documents
Split them up into smaller chunks
Create an embedding for each document

Query Process in RAG

Create an embedding for the query
Find the most similar documents in the embedding space
Pass those documents, along with the original query, into a language model to generate an answer

Let's Look at live examples of Chatbot with using pdf document.

Code is in the folder llm-apps/llama2-using-chainlit

Tools we are going to leverage

Langchain for QARetriever,Pdf,textsplit
FAISS for vectorStorage DB
HuggingFaceEmbeddings(Sentence_transformers) for creating embeddings of text chunks
Chainlit for Chatbot interface

ShankarChavan/Gen-AI-LLMs