This is a Transformer based Large Language Model (LLM) training demo with only ~240 lines of code.
Inspired by nanoGPT, I wrote this demo to show how to train a LLM from scratch using PyTorch. The code is very simple and easy to understand. It's a good start point for beginners to learn how to train a LLM.
The demo is trained on a 450Kb sample textbook dataset, and the model size is about 51M. I trained on a single i7 CPU, and the training time takes about 20 minutes, result in approximately ~1.3M parameters.
- Install dependencies
pip install numpy requests torch tiktoken
- Run model.py
First time when you run it, the program will download the dataset and save to data
folder.
Then the model will start training on the dataset. Training & validation losses
will be printed on the console screen, something like:
Step: 0 Training Loss: 11.68 Validation Loss: 11.681
Step: 20 Training Loss: 10.322 Validation Loss: 10.287
Step: 40 Training Loss: 8.689 Validation Loss: 8.783
Step: 60 Training Loss: 7.198 Validation Loss: 7.617
Step: 80 Training Loss: 6.795 Validation Loss: 7.353
Step: 100 Training Loss: 6.598 Validation Loss: 6.789
...
The training loss will decrease as the training goes on. After 5000 iterations, the training will stop and the losses are down to around 2.807
. The model will be saved under name model-ckpt.pt
.
Then a sample text will be generated and pop to the console screen from the model we just trained, something like:
The salesperson to identify the other cost savings interaction towards a nextProps audience, and interactive relationships with them. Creating a genuine curiosityouraging a persuasive knowledge, focus on the customer's strengths and responding, as a friendly and thoroughly authority.
Encouraging open communication style to customers that their values in the customer's individual finding the conversation.2. Addressing a harmoning ConcernBIG: Giving and demeanor is another vital aspect of practicing a successful sales interaction. By sharing case studies, addressing any this compromising clearly, pis
It looks pretty descent!
Feel free to change some of the hyperparameters on the top of the model.py
file, and see how it affects the training process.
- Step-by-step Jupyter Notebook
I also provide a step-by-step Jupyter Notebook step-by-step.ipynb
to help you understand the architecture logic. To run this, you also need to insall:
pip install matplotlib pandas
This notebook prints out the intermediate results of each step followed by Transformer architecture from original paper, but only the Decoder part (Since GPT only use the decoder). So you can see how the model is trained each single step. For examples:
- what a [4,16] matrix of input embedding looks like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 627 1383 88861 279 1989 315 25607 16940 65931 323 32097 11 584 26458 13520 449
1 15749 311 9615 3619 872 6444 6 3966 11 10742 11 323 32097 13 3296 22815
2 13189 315 1701 5557 304 6763 374 88861 7528 10758 7526 13 4314 7526 2997 2613
3 323 6376 2867 26470 1603 16661 264 49148 627 18 13 81745 48023 75311 7246 66044
- the positional encoding plot of the input sequence:
- the attention matrix of the first Q * K layer:
- after applying Mask attention of the above matrix:
Under /GPT2
directory, I put some sample code to show how to fine-tune a pre-trained GPT2 model, as well as inference from it.
As if you're new to LLM, I recommend you to read my blog post Transformer Architecture: LLM From Zero-to-Hero , which breaks down the concepts of a Transformer architecture.
- nanoGPT Andrej Karpathy's famous video tutorial on how to build a GPT model from scratch.
- Transformers from Scratch A clear and easy implementation of Andrej's video contents by Mat Miller.
- Attention is all you need The original paper of Transformer architecture.