/Language_model

An autoregressive language model like ChatGPT.

Primary LanguagePythonMIT LicenseMIT

๐Ÿ’ฌ Language model

Release Language Libraries Size Open Source


This repository contains the code to train and test autoregressive language models like ChatGPT from scratch. I also used it to train the french open-source DimensionGPT models.



๐Ÿ“‹ Summary


๐Ÿค– DimensionGPT

Using this repository, I trained DimensionGPT-0.2B, a small 0.2B language model on 50B tokens with my personal RTX 3090 GPU during โ‰ˆ570 hours.


๐Ÿ—๏ธ Architecture

The model is based on the transformer architecture (only the decoder part) from the paper Attention is All You Need by Google Brain (2017), with a few improvements:


Here are the main parameters of the architecture:

Parameter Value
Embedding dimension 1,024
Number of layers 16
Heads dimension 64
Feed forward hidden dimension 2,730
Number of heads 16
Number of grouped heads 4
Window size 256
Context length 512
Vocab size 32,000

The resulting model has 208,929,792 trainable parameters and fits on a single RTX 3090 GPU with a batch size of 16 for training using mixed precision. For inference only, the model will probably fit on any modern GPU.


๐Ÿ’พ Data

The dataset used to train this model is exclusively in french and is a mix of multiple sources:

Source Documents Tokens Multiplier Ratio
Common Crawl (FR) 21,476,796 35,821,271,160 1.0 76.89 %
Wikipedia (FR) 2,700,373 1,626,389,831 4.0 13.96 %
French news articles 20,446,435 11,308,851,150 0.3 7.28 %
French books 29,322 2,796,450,308 0.2 1.20 %
French institutions documents 87,103 147,034,958 2.0 0.63 %
Others 2,761 7,287,322 2.0 0.03 %
Total 44,742,790 51,707,284,729 - 100.00 %

For the tokenization, I created my own tokenizer that starts by cleaning the text to keep only a predefined set of characters, then it uses the Byte Pair Encoding (BPE) algorithm to create the vocabulary. I trained the tokenizer on a 300 million characters subset of the dataset to get my 32,000 tokens vocabulary.


๐Ÿฆพ Training

For the training I used stochastic gradient descent with warmup and cosine decay learning rate schedules, here are the main hyperparameters:

Hyperparameter Value
Batch size (tokens) 524,288
Optimizer AdamW
Learning rate 6.0 ร— 10-4
Warmup steps 2,000
Decay steps 100,000
ฮฒ1 0.9
ฮฒ2 0.95
ฮต 10-5
Weight decay 0.1
Gradient clipping 1.0

I trained the model on my personal RTX 3090 GPU for 1 epoch on the full dataset (13 times the Chinchilla optimal) using mixed precision and gradient accumulation to increase the speed and reduce the memory usage :

Training summary
Tokens 52,428,800,000
Steps 100,000
FLOPs 6.6 ร— 1019
Duration 573 hours
Final loss 2.19
Final accuracy 54.8 %


๐Ÿช› Fine-tuning

I fine-tuned the model on the french instructions dataset I made for this project to create DimensionGPT-0.2B-Chat, a 0.2B language model trained to follow instructions and answer questions in french.


๐Ÿงช Tests

Here are some examples of the model outputs:


๐ŸŽ›๏ธ Weights

The trained weights of the different models are available on Google Drive, you just need to:

  • Download the .pt file of the model you want to use and put it in the models folder
  • Download the vocab.txt file and put it in the data folder

๐Ÿ“ฆ Dependencies


Run the following command to install the dependencies:

$ pip install -r requirements.txt

โš ๏ธ You may need to use a specific command for PyTorch if you want to use CUDA

โš ๏ธ You way need to manually install a Flash Attention release for Windows


๐Ÿฆพ Training

  • Run the create_data.ipynb file to create the tokenizer and the dataset (it may take an entire day and consume a few hundred gigabytes of disk space)

  • Run the training.ipynb file (you can stop the training at any time and resume it later thanks to the checkpoints)

  • If you don't have an overpriced 24GB GPU like me, the default settings (those used to train DimensionGPT) may not work for you. You can try to:

    • Reduce the batch size (less stable and worse lowest point)
    • Increase the accumulation steps (fix previous problems but slower)
    • Reduce some architecture parameters (worse lowest point)

โš—๏ธ Testing

  • Run the testing.ipynb file to use the models you downloaded or trained

๐Ÿ™ Credits