karpathy/nanoGPT

Training on M1 "MPS"

okpatil4u opened this issue ยท 45 comments

Most of the people do not have access to 8XA100 40GB systems. But a single M1 Max laptop with 64 GB memory could host the training. How difficult is it to port this code to "MPS" ?

I take it back. Seems like these are 8 x 40 GB systems.

There is a good paper on cramming [Cramming: Training a Language Model on a Single GPU in One Day]
https://arxiv.org/abs/2212.14034

I thought some work on these lines was done here as well.

Actually I think this issue is great to keep open, in case anyone investigates nanoGPT in mps context. I haven't tried yet.

What is the actual memory requirement ? Will Mac Studio with 128 GB RAM be sufficient for training ?

Refining the above comment slightly, do you currently have any (rough is fine) estimates on the relative sizes of the memory footprint for the just the model parameters, params plus the forward activations as a fn of bsz, versus the backward graph as a fn of bsz, on the 8xA100 40gb configuration? Where does it peak across the server during training?

That might start to inform some people on how to go about laying this out on the resources they have.

Also relevant for inference.

I haven't had a chance to do any benchmarking yet but training starts just fine on M1 Ultra with --device=mps.

I tried out "i only have a MacBook" from README but with --device="mps" and it seems to run faster. With CPU, one iteration is roughly about 100ms whereas with mps is about ~40ms. My machine is a base line Mac Studio.

That's for training a very small transformer. My machine is 64 gb RAM, M1 Max. For bert-medium like architecture, this is how it goes.

Overriding: dataset = shakespeare
Overriding: n_layer = 8
Overriding: n_head = 512
Overriding: n_embd = 512
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 128

Initializing a new model from scratch
number of parameters: 50.98M
step 0: train loss 10.9816, val loss 10.9783
iter 0: loss 10.9711, time 4613.50ms
iter 1: loss 10.9673, time 5791.48ms
iter 2: loss 10.9647, time 7842.40ms
iter 3: loss 10.9646, time 10196.35ms
iter 4: loss 10.9604, time 11602.34ms
iter 5: loss 10.9495, time 9393.25ms
iter 6: loss 10.9615, time 10373.34ms

@itakafu thank you for reporting, i'll add mentions of mps to the readme&code.

test on MacBook Air M2, without charger:

with mps: roughly 150~200ms for one iteration
without mps: roughly 450 ~ 500ms for one iteration

just for one reference

@itakafu thank you for reporting, i'll add mentions of mps to the readme&code.

tomeck commented

Confirmed works great w/device='mps'. But make sure to install this version of pytorch:

$ pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

I'm getting <40ms

Thank you SO MUCH for this

@tomeck Weird, I'm getting 300ms on M2 (Macbook Air 16GB):

python3 train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device='mps' --compile=False --eval_iters=1 --block_size=64 --batch_size=8
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
vocab_size not found in data/shakespeare/meta.pkl, using GPT-2 default of 50257
Initializing a new model from scratch
number of parameters: 3.42M
step 0: train loss 10.8177, val loss 10.8162
iter 0: loss 10.8288, time 438.06ms
iter 1: loss 10.8117, time 303.12ms
iter 2: loss 10.8236, time 301.04ms
iter 3: loss 10.8265, time 299.64ms
iter 4: loss 10.8128, time 299.96ms
iter 5: loss 10.8173, time 299.72ms
iter 6: loss 10.8066, time 300.76ms
iter 7: loss 10.8084, time 299.86ms
iter 8: loss 10.8244, time 299.47ms
coltac commented

Just out of curiosity, I'm getting 17ms with a ryzen7 5700x and a 3060ti, 64 gb ram. What kind of iteration time does a A100 do? Are they horribly faster? I have a friend with 2x 3080s and I'm considering doing the big one...

Yep the README documentation doesn't make sense in terms of ms calculations on A100. It states:
"Training on an 8 x A100 40GB node for ~500,000 iters (~1 day) atm gets down to ~3.1"

This would mean - 500000/86400 = 5.787 itr / 1000 ms = 172.8 ms per itr.
And times that by 8 to get a single A100... doesn't make sense.

coltac commented

Oh I'm being stupid, I'm getting 17ms on Shakespeare, I bet it'd way higher on openwebtext

simonw commented

Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2

simonw commented

I also built a little tool you can copy and paste the log output from training into to get a chart:

https://observablehq.com/@simonw/plot-loss-from-nanogpt

Example output:

image

I think the mps section of the readme may be inaccurate: my understanding is that mps just utilizes the on-chip GPU. To use the Neural Engine you'd have to port it to CoreML โ€” which may or may not speed up training but should do wonders for inference. See PyTorch announcement here.

For training, you have to use MPS. For inference you can use ANE.

Hey @simonw , thanks for sharing tutorial on your website!

I tried on my MacBook Air M2 and getting much worse performance:

time python3 train.py \
  --dataset=shakespeare \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=64 \
  --compile=False \
  --eval_iters=1 \
  --block_size=64 \
  --batch_size=8 \
  --device=mps
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
Overriding: device = mps
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 3.42M
using fused AdamW: False
step 0: train loss 10.8153, val loss 10.8133
iter 0: loss 10.8181, time 5264.63ms, mfu -100.00%
iter 1: loss 10.8291, time 1650.46ms, mfu -100.00%
iter 2: loss 10.8164, time 1651.38ms, mfu -100.00%
iter 3: loss 10.7927, time 1639.94ms, mfu -100.00%
iter 4: loss 10.8212, time 1644.10ms, mfu -100.00%
iter 5: loss 10.8067, time 1639.57ms, mfu 0.08%
iter 6: loss 10.8307, time 1635.84ms, mfu 0.08%
iter 7: loss 10.8345, time 1635.17ms, mfu 0.08%
iter 8: loss 10.8262, time 1637.88ms, mfu 0.08%
iter 9: loss 10.8275, time 1643.70ms, mfu 0.08%
iter 10: loss 10.8100, time 1643.38ms, mfu 0.08%
iter 11: loss 10.8100, time 1641.18ms, mfu 0.08%
iter 12: loss 10.8258, time 1647.17ms, mfu 0.08%
iter 13: loss 10.8169, time 1643.93ms, mfu 0.08%
iter 14: loss 10.8139, time 1645.54ms, mfu 0.08%
iter 15: loss 10.8107, time 1642.27ms, mfu 0.08%
iter 16: loss 10.8114, time 1642.16ms, mfu 0.08%
iter 17: loss 10.7969, time 1641.59ms, mfu 0.08%
iter 18: loss 10.8150, time 1643.31ms, mfu 0.08%

Currently on Python 3.11. Spent couple hours trying to reinstall everything but it didn't help. Does anyone have ideas what can be wrong here?

Macbook M1 MAX results on train_shakespeare_char

python train.py config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
batch_size = 16
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 4
n_head = 4
n_embd = 256
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-6 # learning_rate / 10 usually
beta2 = 0.999 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
device = 'mps'  # run on cpu only
compile = False # do not torch compile the model

found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)

step 0: train loss 4.2326, val loss 4.2303
iter 0: loss 4.2329, time 9686.70ms, mfu -100.00%
step 5000: train loss 0.7204, val loss 1.5878
iter 5000: loss 0.9658, time 10224.29ms, mfu 0.48%

python sample.py --out_dir=out-shakespeare-char
Overriding: out_dir = out-shakespeare-char
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
number of parameters: 3.16M
Loading meta from data/shakespeare_char/meta.pkl...

The like order precious soner stout the morning's strength;
The month of his son bounded bones and rough
Since the common people'd courtesy 'gainst their times,
Your brats bear betwixt them away, and nothing
Against the gracious patern of their heads,
For their father is not their silly mouths,
Even in their voices and their loves.

MENENIUS:
You are received;
For they wear them, no more good to bed,
Your people have are endured with them not:
You'll have done as good to them be to brief

It appears that after 086ebe1 was merged the training performance on M1/M2 is significantly slower.

Thanks @deepaktalwardt!

I am using the command suggested by @simonw:

time python3 train.py \
  --dataset=shakespeare \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=64 \
  --compile=False \
  --eval_iters=1 \
  --block_size=64 \
  --batch_size=8 \
  --device=mps

After reverting that commit this is literally flying on my Macbook Pro M2 Max! So just make sure the gradient_accumulation_steps is always equal to 1. Without reverting 086ebe1 it will be 800ms per iter.

Stopped training after 10k iters which took 4min18s.

iter 10139: loss 3.9768, time 25.31ms, mfu 0.13%

KeyboardInterrupt

python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64       232.40s user 72.33s system 117% cpu 4:18.81 total

Has someone tried 'mps' together with 'compile=True' and succeed?

+1 to reverting 086ebe1; I went from 1500ms to 70ms per iteration.

rozek commented

indeed, I also made my own fork and reverted 086ebe1, resulting in a dramatic speedup on my Mac mini M1!

rozek commented

Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2

Simon, thank you very much for your walk-through of an installation of nanoGPT on Apple silicon. By the way, I just tried to run python sample.py after changing the device to mps and it seems to work now: the script spits out a few warnings but then generates output without any problems, but it has to be run under macOS 13.x Ventura.

Has someone tried 'mps' together with 'compile=True' and succeed?

Yep and as follows,

Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 128
Overriding: compile = True
Overriding: eval_iters = 20
Overriding: block_size = 64
Overriding: batch_size = 12
Overriding: device = mps
Overriding: log_interval = 1
Overriding: max_iters = 2000
Overriding: lr_decay_iters = 2000
Overriding: dropout = 0.0
Overriding: gradient_accumulation_steps = 1
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 7.23M
using fused AdamW: False
compiling the model... (takes a ~minute)
step 0: train loss 10.8272, val loss 10.8203
iter 0: loss 10.8421, time 2852.64ms, mfu -100.00%
iter 1: loss 10.8099, time 522.30ms, mfu -100.00%
...
iter 2000: loss 2.6286, time 1241.70ms, mfu 0.16%
python train.py config/train_shakespeare_char.py --dataset=shakespeare 420.38s user 105.07s system 49% cpu 17:34.84 total

~/nanoGPT master ยฑ pip list | grep torch
torch 2.1.0.dev20230401
torchaudio 2.1.0.dev20230401
torchvision 0.16.0.dev20230401

~/nanoGPT master ยฑ python --version
Python 3.9.6

0dB commented

Reverting commit 086ebe1 or overriding gradient_accumulation_steps to 1 is not needed anymore. This seems to have been fixed via the file config/train_shakespeare_char.py with commit 21f9bff. I can confirm 30ms or 775ms iteration times on M1 Pro with mps and depending on whether using "I have a MacBook" settings or plain python train.py config/train_shakespeare_char.py --device=mps --compile=False.

BTW I also did not need the nightly PyTorch build for this. The version available on MacPorts did fine. I did have to comment out code in train.py regarding init_process_group, destroy_process_group and ddp (parallel processing on multiple GPUs).

rozek commented

Unfortunately, I cannot confirm the above statement: using a fresh installation of this repo, trying to train "Shakespeare" took approx. 2.2s per iteration on a Mac mini M1 with 16GB RAM - after reverting 086ebe1 again, every iteration took only 0.067s or even less (what a dramatic change!)

0dB commented

Unfortunately, I cannot confirm the above statement

That's strange. When I revert, which effectively sets gradient_accumulation_steps to 1, I get no change, so to me it seems commit 21f9bff resolves things. Ideas anyone?

rozek commented

well, if you look into commit 21f9bff and compare that with the statement you used for testing (python train.py config/train_shakespeare_char.py --device=mps --compile=False) you will see, that the commit adds one line to config/train_shakespeare_char.py, namely gradient_accumulation_steps = 1 - that's what reverting commit 086ebe1 did generically.

Did you also test train.py with other configurations that do not include gradient_accumulation_steps = 1?

0dB commented

if you look into commit 21f9bff and compare that with the statement you used for testing (python train.py config/train_shakespeare_char.py --device=mps --compile=False) you will see, that the commit adds one line to config/train_shakespeare_char.py, namely gradient_accumulation_steps = 1 - that's what reverting commit 086ebe1 did generically.

I know, this overrides the setting in train.py from what I can report, but does this not work for you?

Did you also test train.py with other configurations that do not include gradient_accumulation_steps = 1?

No, I only have one GPU, so from my understanding from this issue I want this value to be at 1. The only thing I tried just right now is to revert the commit you mention, which sets gradient_accumulation_steps = 1 in train.py again (instead of to 40), but to me it (as to be expected) has the same effect as just using the current code, which now sets an override to this value in config/train_shakespeare_char.py.

It is not clear to me why the code current at the time of writing is dramatically slower for you than the code after reverting the commit. Are you seeing different values for tokens per iteration (output by train.py) when you revert and don't revert the commit? For me this is just batch_size times block_size, so the override is working and gradient_accumulation_steps is for me being set to 1.

rozek commented

Well, I think the reason why setting gradient_accumulation_steps = 1 has such a dramatic effect is still not really clear - at least, not to me.

I tested nanoGPT with the Shakespeare dataset, not with Shakespeare_char which is why I ran into the same problem as a few weeks ago.

And since setting gradient_accumulation_steps = 1 in every configuration file is too tedious, I still recommend to do so in train.py itself - i.e., to revert 086ebe1

Maybe simply try,

python train.py config/train_gpt2.py
--compile=True
--eval_iters=20
--block_size=64
--device=mps
--max_iters=6000
--lr_decay_iters=6000
--gradient_accumulation_steps=1

M1 pro machine running spotify million song dataset:

nanoGPT % python3.10 train.py config/train_meet_summ.py --device=mps --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
Overriding config with config/train_meet_summ.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-meet_summ'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'meet_summ'
wandb_run_name = 'mini-gpt'

dataset = 'meet_summ'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 20
Overriding: log_interval = 1
Overriding: block_size = 64
Overriding: batch_size = 12
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 128
Overriding: max_iters = 2000
Overriding: lr_decay_iters = 2000
Overriding: dropout = 0.0
tokens per iteration will be: 768
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 7.23M
/opt/homebrew/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
num decayed parameter tensors: 18, with 7,233,536 parameters
num non-decayed parameter tensors: 9, with 1,152 parameters
using fused AdamW: False
step 0: train loss 10.8466, val loss 10.8504
iter 0: loss 10.8558, time 1160.57ms, mfu -100.00%
iter 1: loss 10.8502, time 65.83ms, mfu -100.00%
iter 2: loss 10.8342, time 94.48ms, mfu -100.00%
iter 3: loss 10.8267, time 61.11ms, mfu -100.00%
iter 4: loss 10.8191, time 61.72ms, mfu -100.00%
iter 5: loss 10.8195, time 61.16ms, mfu 0.18%
iter 6: loss 10.7832, time 61.36ms, mfu 0.18%
iter 7: loss 10.7710, time 60.81ms, mfu 0.18%
iter 8: loss 10.7230, time 60.78ms, mfu 0.18%
iter 9: loss 10.7206, time 60.25ms, mfu 0.18%

Hi all, I have an mps error, but only when doing architecture sweeps, can someone comment on this issue?
#343

Do you folks not run into this issue with a buggy torch.multinomial on mps?
pytorch/pytorch#92752

Hey everyone, I ran nanoGPT training using "python train.py config/train_shakespeare_char.py --device=mps --compile=False" on 'Mac M1 Pro'

Screenshot 2024-02-16 at 11 32 50โ€ฏPM Screenshot 2024-02-16 at 11 33 18โ€ฏPM

Hi there!
I have been playing with nanoGPT for a while on my mac (m1 pro) and I have noticed that inference is very slow when the length in tokens of the generated output is smaller than the context length. Can anyone confirm this?
I get generation times as low as 1 token/s when using a context length of 256 tokens and generating tokens 200 to 255, but once the context length is passed, generation is much faster.
This does not happen on CUDA.

my m1 pro gets slower when I use mps but faster with cpu??

yeah, having the same problem, but only with inference

yeah, having the same problem, but only with inference

It's been a couple of weeks since I last checked on this, but I had the same issue.
I suspect there is some slowdown when having to truncate the tril triangular matrix that is used as a mask in the attention mechanism (when the number of generated tokens is lower than the context length block_size).
See this.
This is not done in training, as the training elements are created such that they fill the whole context window.

What's even more strange is that, at least for me, this slowdown only happened when generating the first sample (in sample.py).
The only explanation I could give to myself was that Torch performs some sort of caching on that matrix, but I am not sure about the underlying library implementation.

This issue only happens with MPS.

at this point, i might have to subscribe to google colab just to run the code...the downfall of poverty