Everything-about-LLMs

Lauguage models

Transformers

The figure below from the original paper¹ shows the architecture of a transformer. There are so many amazing videos and blogs that explain transfomers very well. I wouldn't provide just one more version of it. You can google around or read the paper yourself. I will, though, provide an implementation of it here.

Separated by the magenta line in the middle, we will call the structure above this line the encoder and below the line the decoder for the remainder of the section of Lauguage models.

BERT

GPT

This folder contains Karpathy's implementation of a mini version of GPT. You can run it to train a character-level language model on your laptop to generate shakespearean (well kind of 🙈) text. He did a very nice tutorial to walk through the code almost line by line. You can watch it here. If you are completely new to language modelling, this video may help you to understand more basics.

You can find much more details about the code in Karpathy's original repo. The code in this folder has been adapted to contain the minimal running code.

LLaMA

Fine-tuning

LoRA

If you don't know what LoRA is, you can watch this Toutube video here, or read the LoRA paper² first.

Toy problem: I wrote a notebook to show how to fine-tune a reeeeaaaal simple binary classification model with LoRA, see here.
The real deal: of course, some amazing people already implemented LoRA as a library. Here's the notebook on how to fine-tune LLaMA 2 with the LoRA library.

QLoRA

As discussed in the LoRA for LLMs notebook, we only need to train about 12% of the original parameter count by applying this low rank representation. However, we still have to load the entire model, as the low rank weight matrix is added to the orginal weights. For the smallest Llama 2 model with 7 billion parameters, it will require 28G memory on the GPU allocated just to store the parameters, making it impossible to train on lower-end GPUs such as T4 or V100.

Therefore, (...drum rolls...) QLoRA³ was proposed. QLoRA loads the 4-bit quantized weights from a pretrained model, and then apply LoRA to fine tune the model. There are more technical details you may be interested in. If so, you can read the paper or watch this video here.

With the LoRA library (check the notebook), it is very easy to adopt QLoRA. All you need to do is to specify in the configuration as below:

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                             load_in_4bit=True, # <------ *here*
                                             #  load_in_8bit=True,
                                             )

Unfortunately, quantization leads to an information loss. This is a tradeoff between memory and accuracy. If needed, there's also an 8-bit option.

By choosing to load the entire pre-trained model in 4-bit, we can fine-tune a 7-billon-parameter model on a single T4 GPU. Check out the RAM usage during training:

RLHF

Multimodal models

CLIP

Concenptually, CLIP is very simple. The figure in the CLIP paper⁴ says it all.

For this visual-language application, step (1) in the figure needs a few components:

data: images with text describing them
a visual encoder to extract image features
a language encoder to extract text features
learn by maximising the similarity between the paired image and text features indicated by the blue squares in the matrix in the figure (contrastive learning)

I wrote a (very) simple example in this notebook which implements and explains the contrastive learning objective, and describes the components in step (2) and (3). However, I used the same style of text labels for training and testing. So no zero-shot here.

GLIDE

GLIDE⁵ is a text-to-image diffusion model with CLIP as the guidance. If you aren't familiar with diffusion models, you can watch this video for a quick explaination to the concept. If you want more technical details, you can start with these papers: diffusion generative model⁶, DDPM⁷, DDIM⁸, and a variational perspective of diffusion models⁹.

The classifier-guided diffusion model was intially proposed by Dhariwal & Nichol¹⁰. The main idea is this: if the image generated is a dog, we want the parameters of the generative model in training to move towards the direction where it can be classified with the correct label by a good classifier, see the illustration below.

It seems to be a great idea. However, this approach requires training a seperate classifier. Morever, it is not easy to train a good classifier that works perfectly for the generated images. For this reason, Ho and Salimans¹¹ proposed a classifier-free guidance, see the figure below. $c$ is the condition (for example, text) on which the images are generated, and $\emptyset$ indicates no condition. So the model is encouraged to move towards the direction where images are generated with the conditions.

GLIDE uses a CLIP model to replace the classifier in their classifer-guided diffusion model.

DALL·E 2

DALL·E 2 is another concenptually simply model that produces amazing results.

The first half of the model is a pre-trained CLIP (frozen once trained), i.e., the part above the dash line in the figure in the DALL·E 2 paper¹², see below.

In CLIP, we have trained two encoders to extract features from image and text inputs. DALL·E 2 extracts the text embeddings with a pretrained CLIP model, and pass it through a prior model to learn the image embeddings. This prior can be an autoregressive model or a diffusion model. Finally, a diffusion model is used to produce an image conditioned on the image embeddings learned from the prior as well as the CLIP text embedding optionally.

DALL·E 2 also enables classifier-free CLIP guidance as used in GLIDE by randomly setting the CLIP embeddings to zero 10% of the time.

Stable Diffusion

Stable diffusion is a latent diffusion model¹³ which operates in the latent space instead of the original data space, see the figure below.

The red-shaded and green-shaded parts illustrate the pixel and the latent space. First, stable diffusion uses a pretrained model (VQGAN¹⁴ in this case) to encode the images to latent, and decode the learned latent back to images. In practice, this reduces a high-resolution image to a latent representation of the shape [64x64xnumber of channels].

In the latent space, stable diffusion deploys a diffusion model equipped with a denoising U-Net. In each denoising diffusion step, the U-Net is augmented with the cross-attention machanishm which computes the attention between the noisy latent and whichever conditons included in the model for conditional generation.

No surprise that classifier-free CLIP guidance is also used in stable diffusion. The "switch" in the figure above controls whether the model produces images with or without conditions.

Image to Image

An implementation of Stable Diffusion Image-to-Image can be found here. Alternatively, you can also play with this notebook.

Engineering magics for training an LLM

Memory Optimization: ZeRO

Model parallelism: MegatronLM

Pipeline Parallelism

Checkpointing and Deterministic Training

FlashAttention

KV caching

Gradient checkpointing

Data efficiency

Reference:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30. ↩
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. LoRA: Low-Rank Adaptation of Large Language Models, arXiv preprint arXiv:2106.09685 ↩
Dettmers, T., Pagnoni, A., Holtzman, A. and Zettlemoyer, L., 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314. ↩
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR. ↩
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I. and Chen, M., 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741. ↩
Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning. PMLR. 37: 2256–2265 ↩
Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, pp.6840-6851. ↩
Song, J., Meng, C. and Ermon, S., 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. ↩
Kingma, D., Salimans, T., Poole, B. and Ho, J., 2021. Variational diffusion models. Advances in neural information processing systems, 34, pp.21696-21707. ↩
Dhariwal, P. and Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34, pp.8780-8794. ↩
Ho, J. and Salimans, T., 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598. ↩
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. and Chen, M., 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), p.3. ↩
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695). ↩
Esser, P., Rombach, R. and Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873-12883). ↩

tianlinxu312/Everything-about-LLMs