Documentation soon... This needs a lot of rafactoring. No time for that yet :(
- We have prepared one of the largest pre-training corpus of 3 billion tokens of training data in the Azerbaijani language.
- We have around 13 stages of data filtering and curation pipeline customized for the Azerbaijani language.
- We have trained custom SentencePiece BPE tokenizer that is very good at Azerbaijani.
- We have pre-trained 150M LLaMA-based generative (decoder-only transformer) on 3B tokens for 2-3 epochs.
- We have also trained the SFT (Supervised Fine-Tuned) model to give it chatbot-like behavior.
- We have evaluated based on custom metrics.