/numpitron

Simplistic small language model 3D-parallelism training using NumPy and MPI

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

🐌 NuMPItron

Simplistic small language model 3D-parallelism training using NumPy and MPI. Inspired by Megatron-LM and Nanotron and based only on NumPy and MPI for Python, NuMPItron offers a variety of ways to train your Transformer at a snail's pace.

This library is meant as a learning experience for implementing distributed training strategies. Ideally the library will be capable of both 3D parallelism (TP + MP + DP) and ZeRO. If you want to follow along, make sure to check out my blog.

Feature Roadmap

Core functionality will be 3D parallel and ZeRO stage 1 since these can be combined in general:

  • Single Core
  • Tensor Parallel
  • Distributed Data Parallel
  • Pipeline Parallel
  • Distributed sampling strategies
  • ZeRO

When/if this is done, we will look at expert parallel strategies.

Installation

First, ensure mpi4py is installed by following the instructions on the MPI for Python page.

Then, install the library using:

git clone https://github.com/lweitkamp/numpitron
cd numpitron
pip install -e .  # -e .[dev] for unit tests

Examples

You will need to download the shakespeare dataset (shakespeare_char_{train|val}.bin) from Google Drive and place it in the data folder.

Training with tensor/data parallelism can be done using the train_shakespeare.py script:

mpirun -n {1, 2, ...} python train_shakespeare.py \
    --tensor-parallel-size {1, 2, ...} \
    --data-parallel-size {1, 2, ...}

Make sure that the product of --{tensor, data}-parallel-size is equal to -n. Parameters and optimizer state will be stored at data/model.npy to be used for sampling. Training takes about 12 hours for --tensor-parallel-size 2 and 32 hours without tensor parallel, reaching a loss of about ~1.801 after a couple of hours, depending on your hardware (I'm using a 2015 macbook pro):

Note that the graph above only implies that on CPU you are better off performing smaller matmuls (data/tensor parallel combinations). This makes sense given that you are compute bound quite easily on the CPU.

Run a sample generation using the following:

mpirun -n {1, 2, ...} python sample.py \
    --tensor-parallel-size {1, 2, ..}

With the pretrained model loaded you would expect to see the following text below. Not bad, not great.

Seecon:
Commendom:
Who tear pout mine so I profit in.

BRUTUS:
Why, bear are dreadful he gnot letted and Chrown.

AUFIDIUS:
The may my heart, John my moone, with have glo:
But the bluike to ther opeesusate! Camille,
A marin curstifies will to a lise

Footnotes

  1. This matches Karpathy's log loss at same model size at his NanoGPT repo.