LLM Training Made Quick, Flexible and Easy

Build and run a complete LLM pipeline in <30 lines of code (including training, finetuning, instruction-tuning, evaluation) in any setting (single-GPU, multi-GPU, FSDP, (Q)LoRA, CPU offloading, etc.)

A concise framework to train LLMs quickly. You only need focus on the most important parts (data and model, and maybe training strategies), without writing, or scrolling through unnecessary code.

Whats special about this repo?

This repo is a very clean pythonic implementation of LLMs (unlike most repositories, filled with 1000s of lines of incomprehensible code).

  • Clean means that different components are clearly separated, and initialized using intuitive function arguments.
  • Clean also means that training scripts are very minimal, which is a result of high level abstractions. This ensures you don't have to scroll through unnecessary code.

Without going into much details here to keep this README concise, see here for more details and design choice justifications.

Key Features of this repo

  • Download any HF dataset quickly.
  • Define custom architectures, or download any HF model (for continual pretraining), while using a common API template.
  • Automatic setup of Data, Model and Tokenizer using high-level APIs.
  • Easy setup of Distributed Training (single-GPU, DDP, FSDP, CPU Offloading etc.).
  • Finetune pretrained Huggingface models in (Q)LORA setting.
  • Inbuilt support for Mixed Precision Training.
  • Automatic checkpointing of Model based on best validation loss.
  • Evaluate models on benchmarks easily.

This Repo's components in a nutshell

  • Data source: huggingface datasets
  • Tokenizer: Tiktoken
  • Model: standard GPT architecture with Flash Attention
  • Trainer function: Learner class provided by fast.ai
  • Loss function: CrossEntropyLossFlat (pytorch)
  • Distributed Training: HuggingFace Accelerate
  • Distributed Backend: NCCL
  • Progress Logging: Weights and Biases
  • Precision: bf16
  • Evaluation: Eleuther AI's lm-evaluation-harness backend

Get Started.



  • (Extensively Tested on, but not strictly required) NVIDIA GeForce: RTX 3090, or similar GPU hardware. More number of GPUs, the better.
  • CUDA: 12.4 (or higher)
  • Driver Version: 550.X (or higher)


  • Python 3.10 (preferably in a Conda environment)
  • Setup the repo requirements using sh setup.sh.
    • You can also manually install necessary libraries: pip install -r requirements.txt.
    • However, setup.sh also installs some other dependencies like lm-evaluation-harness for benchmark evaluation. So we recommend this.

Run LLM Training

Want to quickly run the training, on a single GPU, with no adjustments?

  • python train.py

  • This runs a ~125M standard GPT architecture model on Simple Wiki (~51M tokens) using a standard Adam Optimizer. Model is checked for performance on validation set every 1000 iterations, and saved if the validation loss is the best one encountered yet.

But if you have multiple GPUs... (Distributed Training)

  • We use Huggingface Accelerate to carry out distributed training (which itself is built on top of PyTorch DDP).
  • Before you can launch distributed training using accelerate, you need to create certain configurations (using accelerate config) that tell acclerate the nature of distributed training For example,
    • whether the training is distributed across multiple GPUs in a single node, or multiple nodes are involved,
    • which machine is the main machine,
    • whether you want DDP or FSDP based distributed training, etc.

Navigate to this document for details.

  • Next, simply run

    accelerate launch train.py

    LLM training runs can be very long, so you may want to just launch this process in the background. nohup is a good solution for that.

  • We have provided some configs for both DDP and FSDP settings in the directory configs. To run using a specific config (without running accelerate config all over again), use the --config_file arg.

    • to run Phi-3 with QLoRA, using DDP, run accelerate launch --config_file configs/singlemachine_DDP.yml train_phi3.py

Optional: Log progress to W&B

  • W&B logging

    • There is a very short process you need to first carry out to setup W&B your system. Navigate to this document for deets.

    • Edit train.py Set log_wandb to True. Make sure that project is set to the same project you used in wandb init configuration.

Customize your training run

Customize Model

See model/README.md to see guidelines of building your own pytorch model OR import from an existing huggingface model.

Customize Data

See data/README.md to see guidelines of downloading custom datasets from huggingface datasets.

Customize Training process

  • If you wish to introduce changes to the training process itself (including optimization strategy, grad accumulation, etc etc), you need to do that using the Learner Class. Visit fast.ai documentation to explore this. For people unfamiliar with fast.ai, just open a github issue, I'll try to look into it and incorporate it as a training option.

Evaluate Your Models.

  • To evaluate your models on benchmarks, we have an evaluate function in our model class, where you simply need to provide the names of the benchmarks as a list of strings (eg ['mmlu', 'boolq']), and get results as a table in the CLI, and optionally as a saved json dictionary (for future visualization). See docstring inside the gpt.py inside model directory.
  • Before running evaluation, make sure you have installed lm-evaluation-harness in dev mode. If you setup this repo using setup.sh, it will automatically do taht for you.


See this document

Known Issues

see this document


If you find this repository useful in your research or work, please consider citing it:

  title={aLLMond: LLM Training Made Quick and Easy},
  author={Palaash Agrawal},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PalaashAgrawal/allmond}},

