This is a personal project aimed to build an LLM based on Epic of Manas (Sayakbai Karalaev's version). This repository is a modified fork of nano-GPT by @karpathy.
We need to prepare our dataset first. There is an openly available Epic of Manas book (Sayakbai Karalaev's version) in PDF format that I was able to find. Before we begin training, we need our dataset to be in text format:
preprocessing.ipynb
is a simple script that convertsepic-of-manas.pdf
into text usingPyPDF2
.- Manual formatting was done to some extent to remove double line issues.
prepare.py
is a forked text to.bin
format train-test dataset converter taken from Andrej Karpathy's nanoGPT repository. This will be needed for more in-depth training.
If you don't have a powerful GPU and using a laptop/Macbook like I did in the beginning, you can run model_training_v1.py
to train Manas-GPT with limited number of hyperparameters (10.7 M). It took about ~5 hours on Macbook Air M2.
python model_training_v1.py
# outputs 500 tokens to manas_synthetic_text.txt
For this task, I used Google Colab Pro+, as it offers more powerful GPUs with background execution. Since Google Colab VMs flush all of its working directory after session timeouts, colab_runtime
has the files that need to be uploaded to Google Colab file directory for training. Note that you have to connect it to your GDrive for training models to be saved.
The main notebook that needs to be uploaded and ran is colab_runtime/manas_training_v2.ipynb
Model Feature | Value |
---|---|
Hyperparameters | 123.59M |
Tokens per epoch | 330,240 |
Dataset size | 263,517 words |
Pre-trained model is available on archive.org and can be downloaded and placed in model/manas-gpt-v2.pt
. To generate a sample response with prompt in prompt.txt
, run:
python sample.py
Training Setup | Value |
---|---|
Graphics Card | 1x NVIDIA L4 Tensor Core |
GPU RAM | 23,034MiB |
System RAM | 53GB |
Training Epochs | 6,000 |
Training Time | ~22 hours |
Here's a sample text from 263517 word long Epic of Manas:
Манастын туула элегиндеги бабалары
Түп атасы Түгөл кан,
үбүнөн Кыдыр даарыган,
Башкы атасы баары кан,
Башынан Кыдыр даарыган.
Түнөп өткөн жерине
Түптүү мазар орногон
Муну түбүнөн кудай оңдогон.
Басып өткөн жерине
Базарлуу калаа орногон,
Күн экени билбеген,
Кордукту коюп салыптыр.
Күтөрүп абиз ит эле,
Түнөрүп жерге кирген жок,
Букту элдеп билгенин
Өз рт экени билиптир.
Астындагы калың кол
Далай жолдон кармап токтоду,
Казыналык чоң Дөөдүр
Мойнунан чачымынып,
- Python 3.11.7
- Torch 2.4.0.dev20240414 (Nightly, supports M1/M2 Mac GPU processing)
- Conda 24.3.0
- Andrej Karpathy (@karpathy - OpenAI co-founder) for open-sourcing nano-GPT and releasing an amazing tutorial Let's build GPT: from scratch, in code, spelled out on YouTube!
- Murat Jumashev (@jumasheff) for meeting up with me in San Francisco, consulting on the project and connecting me with people in AI/DS area.
- Tilek Mamutov (LinkedIn) for connecting me with Murat Jumashev.