karpathy/llama2.c

Training Tiny Stories: 'CUDA' -vs- 'MPS'

dbl001 opened this issue · 2 comments

dbl001 commented

I am running Tiny Stories on: i) Google Colab Pro - V100, and ii) iMac 27" with AMD Radeon Pro 5700 XT.

It appears that 'MPS' has issues.

'MPS' after ~10,000 iterations:
wandb: Run summary:
wandb: iter 10000
wandb: loss/train 1.82883
wandb: loss/val 1.82366
wandb: lr 0.0
wandb: mfu 1.73689
wandb: tokens 327680000

 ./run out/model.bin -z data/tok4096.bin -i "Once upon a time, there was a little girl named Lily."
Once upon a time, there was a little girl named Lily. fine pa pa giraffeaky un lots biggest tightonsable atipped bored owner pretty inc ask hug Mommy their grandpa tightly becomeNull feather super about match dreaming yetairsful sweetie color dirt outside withust closed roll team songs playground were load What success superheroCan town everyone the the roof cl pass to it here the a f pass mixThe juice cold dirty ready bad taughtiv sy C flowers twig r their two two ink everyone cut doctor fielded named floor here here ponybeNow fit Mr loo something hat coins friendly riced Lila stack those driver smiled early right woman calm last lights lights lights everyone daddyake my the the many tBen end t handleonononononon wet necklace mild dragon excited what lots tall brave coins games loud waitedons the the his his and again again again, remind pointed Sally pir Lucyx pu tie disappeared su sh Many kind nervousard unt clever clever all grown them scale C green sand waved ponyar present roof dolph chubby D zoom grow pole butterfliesasses coinsMaybe unlock ride chick voiceriesmb C harder c triangle gr dre inc near aunty spaghetti Lily people imag le island ourves prince bloom entama bal card necklace sitting different hanging cars cheer prin grown bow bro Canudd blow walking
achieved tok/s: 3.648486

I've resumed training but the results aren't much better.

Screenshot 2023-12-10 at 2 24 20 PM Screenshot 2023-12-10 at 2 24 13 PM

Colab after 5,300 iterations:
wandb: Run summary:
wandb: iter 5300
wandb: loss/train 1.34563
wandb: loss/val 1.34056
wandb: lr 4e-05
wandb: mfu 3.3931
wandb: tokens 173670400

Once upon a time, there was a little girl named Lily. She loved to eat steak. One day, she went to the park to play with her friends. While she was playing, she saw a big boy coming towards her. 
Lily was scared and didn't know what to do. But then she remembered what her mom had told her. She remembered her mom telling her that it's important to be brave and face your fears. So, Lily went up to the boy and asked him if he wanted to play with her.
The boy smiled and said yes. They played together for a while, and Lily had so much fun. When it was time to go home, she gave the boy some steak, and he felt proud of himself for being brave. Lily went back home feeling happy and excited for the rest of the day.
achieved tok/s: 23.835767

Screenshot 2023-12-10 at 2 35 13 PM Screenshot 2023-12-10 at 2 35 20 PM Screenshot 2023-12-10 at 2 26 59 PM Screenshot 2023-12-10 at 2 27 07 PM

Here's my Colab notebook (gzipped):
llama2_c_tinystories.ipynb.gz

I ran the same training parameters on both systems:

# data
batch_size = 8 # if gradient_accumulation_steps > 1, this is the micro-batch size
max_seq_len = 1024
vocab_source = "custom" # llama2|custom; use Lllama 2 vocab from Meta, or custom trained
vocab_size = 32000 # the Llama 2 tokenizer has 32K tokens
# model
dim = 768
n_layers = 12
n_heads = 12
n_kv_heads = 12
multiple_of = 32
dropout = 0.0
# adamw optimizer
gradient_accumulation_steps = 4  # used to simulate larger batch sizes
learning_rate = 5e-5  # max learning rate
max_iters = 10000  # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True  # whether to decay the learning rate
warmup_iters = 500  # how many steps to warm up for
# system
device = "cuda"  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = "float32"  # float32|bfloat16|float16
compile = False  # use PyTorch 2.0 to compile the model to be faster

On my iMac the environment is:

 % python collect_env.py 
Collecting environment information...
PyTorch version: 2.3.0a0+git937d616
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.1.2 (x86_64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.10.13 (main, Sep 11 2023, 08:21:04) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

Versions of relevant libraries:
[pip3] audiolm-pytorch==0.0.1
[pip3] configmypy==0.1.0
[pip3] mypy==1.4.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] pytorch-transformers==1.1.0
[pip3] tensorly-torch==0.4.0
[pip3] torch==2.1.0a0+git6bc0f4a
[pip3] torch-cluster==1.6.1
[pip3] torch-scatter==2.1.1
[pip3] torch-sparse==0.6.17
[pip3] torch-spline-conv==1.2.2
[pip3] torch-struct==0.5
[pip3] torch-summary==1.4.5
[pip3] torch-utils==0.1.2
[pip3] torchaudio==2.2.0.dev20231111
[pip3] torchtraining-nightly==1604016577
[pip3] torchvision==0.17.0.dev20231111
[pip3] triton==2.1.0
[pip3] vector-quantize-pytorch==0.9.2
[conda] nomkl                     3.0                           0  
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] numpy-base                1.26.2          py310hd8f4981_0  
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] tensorly-torch            0.4.0                    pypi_0    pypi
[conda] torch                     2.2.0a0+git7f1cbc8          pypi_0    pypi
[conda] torch-cluster             1.6.1                    pypi_0    pypi
[conda] torch-scatter             2.1.1                    pypi_0    pypi
[conda] torch-sparse              0.6.17                   pypi_0    pypi
[conda] torch-spline-conv         1.2.2                    pypi_0    pypi
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torch-summary             1.4.5                    pypi_0    pypi
[conda] torch-utils               0.1.2                    pypi_0    pypi
[conda] torchaudio                2.2.0.dev20231111          pypi_0    pypi
[conda] torchtraining-nightly     1604016577               pypi_0    pypi
[conda] torchvision               0.17.0.dev20231111          pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   0.9.2                    pypi_0    pypi

% pip show sentencepiece
Name: sentencepiece
Version: 0.1.97
Summary: SentencePiece python wrapper
Home-page: https://github.com/google/sentencepiece
Author: Taku Kudo
Author-email: taku@google.com
License: Apache
Location: /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages
Requires: 
Required-by: audiocraft, benepar, fschat, pytorch-transformers, sentence-transformers

If I copy the 'model.bin' generated on Colab after ~7,500 iterations and run it on my iMac with local data/tok4096 tokenization, it generates sensible output:

wandb: Run summary:
wandb:       iter 7600
wandb: loss/train 1.2166
wandb:   loss/val 1.24846
wandb:         lr 1e-05
wandb:        mfu 3.39241
wandb:     tokens 249036800
wandb: 
 ./run ~/Downloads/model.bin -z data/tok4096.bin -i "Once upon a time, there was a little girl named Lily." 
Once upon a time, there was a little girl named Lily. She loved to play outside in the sun and jump in the puddles. One day, Lily's mom told her it was time to bathe and get ready for bed. Lily was getting very excited because she was getting ready for the race.
Lily's mom started to cooking some yummy vegetables in the kitchen. She made a big pot of soup and added some vegetables. Lily loved the taste of the vegetables. As she was pouring the soup, she felt a tickle in her nose. She tried to hold it in, but she couldn't feel it.
Suddenly, something unexpected happened. Her little brother came running into the room and accidentally knocked over the pot of vegetables, breaking it into pieces. Lily was upset, but her mom told her not to worry. They cleaned up the mess and made up. When the dinner was ready, Lily's mom had made a delicious dinner. It was all finally ready, and they both enjoyed a delicious meal together.
achieved tok/s: 4.045815

So, the local tokenization is ok. Are these 'MPS' issues generating the model?

dbl001 commented

I've been trying to discover what the 'MPS' issues are with various models (e.g. llama2.c, nanoGPT, etc.). My most recent tests have been with a COVID-19 research dataset. Here are the parameters.

vocab_source = "custom" # llama2|custom; use Lllama 2 vocab from Meta, or custom trained
vocab_size = 32000 # the Llama 2 tokenizer has 32K tokens
# model
dim = 768
n_layers = 16
n_heads = 12
n_kv_heads = 12
multiple_of = 32
dropout = 0.0
# adamw optimizer
gradient_accumulation_steps = 1  # used to simulate larger batch sizes
learning_rate = 5e-5  # max learning rate
max_iters = 5000  # total number of training iterations
#weight_decay = 1e-1
weight_decay = 1e-4
beta1 = 0.9
#beta2 = 0.95
beta2 = 0.9999
grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True  # whether to decay the learning rate
warmup_iters = 500  # how many steps to warm up for
# system
device = "mps"  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = "bfloat16"  # float32|bfloat16|float16
compile = False

The model with train for awhile and then get either -Infs or NaNs.

% python train.py --vocab_source=custom --vocab_size=32000
Overriding: vocab_source = custom
Overriding: vocab_size = 32000
tokens per iteration will be: 6,144
breaks down as: 1 grad accum steps * 1 processes * 6 batch size * 1024 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 113, with 137,822,208 parameters
num non-decayed parameter tensors: 33, with 25,344 parameters
using fused AdamW: False
wandb: Currently logged in as: dbl001. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /Users/davidlaxer/llama2.c/wandb/run-20231222_120425-67iqiru1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run run2023_12_22_12_04_21
wandb: ⭐️ View project at https://wandb.ai/dbl001/llamac
wandb: 🚀 View run at https://wandb.ai/dbl001/llamac/runs/67iqiru1
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 0: train loss 10.5229, val loss 10.5152
0 | loss 10.4903 | lr 0.000000e+00 | 1051591.07ms | mfu -100.00%
10 | loss 10.4250 | lr 1.000000e-06 | 7986.53ms | mfu 0.24%
20 | loss 10.2410 | lr 2.000000e-06 | 7962.80ms | mfu 0.24%
30 | loss 9.9217 | lr 3.000000e-06 | 8302.36ms | mfu 0.24%
40 | loss 9.4151 | lr 4.000000e-06 | 8051.44ms | mfu 0.24%
50 | loss 9.3292 | lr 5.000000e-06 | 7915.61ms | mfu 0.24%
60 | loss 9.2145 | lr 6.000000e-06 | 7915.60ms | mfu 0.24%
70 | loss 9.0480 | lr 7.000000e-06 | 8737.03ms | mfu 0.24%
80 | loss 8.9096 | lr 8.000000e-06 | 7929.12ms | mfu 0.24%
90 | loss 8.7169 | lr 9.000000e-06 | 7555.13ms | mfu 0.24%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 100: train loss 8.5301, val loss 8.5473
saving checkpoint to out
wrote out/model.bin
100 | loss 8.4351 | lr 1.000000e-05 | 848533.82ms | mfu 0.22%
110 | loss 8.3677 | lr 1.100000e-05 | 7808.20ms | mfu 0.22%
120 | loss 8.4150 | lr 1.200000e-05 | 7858.75ms | mfu 0.22%
130 | loss 8.0342 | lr 1.300000e-05 | 8167.59ms | mfu 0.22%
140 | loss 7.8545 | lr 1.400000e-05 | 7522.86ms | mfu 0.23%
150 | loss 7.7217 | lr 1.500000e-05 | 8107.00ms | mfu 0.23%
160 | loss 7.7358 | lr 1.600000e-05 | 10143.37ms | mfu 0.22%
170 | loss 7.8723 | lr 1.700000e-05 | 9415.52ms | mfu 0.22%
180 | loss 7.6995 | lr 1.800000e-05 | 8895.65ms | mfu 0.22%
190 | loss 7.7024 | lr 1.900000e-05 | 9207.56ms | mfu 0.22%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 200: train loss 7.3201, val loss 7.3648
saving checkpoint to out
wrote out/model.bin
200 | loss 7.5329 | lr 2.000000e-05 | 844618.46ms | mfu 0.20%
210 | loss 7.4289 | lr 2.100000e-05 | 6785.28ms | mfu 0.21%
220 | loss 7.2327 | lr 2.200000e-05 | 8362.16ms | mfu 0.21%
230 | loss 7.2898 | lr 2.300000e-05 | 8685.21ms | mfu 0.21%
240 | loss 6.5856 | lr 2.400000e-05 | 8429.21ms | mfu 0.21%
250 | loss 7.1450 | lr 2.500000e-05 | 8581.13ms | mfu 0.21%
260 | loss 6.8924 | lr 2.600000e-05 | 9256.28ms | mfu 0.21%
270 | loss 6.6424 | lr 2.700000e-05 | 8008.70ms | mfu 0.22%
280 | loss 6.6204 | lr 2.800000e-05 | 8175.07ms | mfu 0.22%
290 | loss 6.7038 | lr 2.900000e-05 | 9028.75ms | mfu 0.22%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 300: train loss 6.4878, val loss 6.5648
saving checkpoint to out
wrote out/model.bin
300 | loss 6.2764 | lr 3.000000e-05 | 859275.24ms | mfu 0.20%
310 | loss 6.0329 | lr 3.100000e-05 | 9423.75ms | mfu 0.20%
320 | loss 6.5250 | lr 3.200000e-05 | 9462.67ms | mfu 0.20%
330 | loss 6.0331 | lr 3.300000e-05 | 8219.33ms | mfu 0.20%
340 | loss 6.2929 | lr 3.400000e-05 | 8210.02ms | mfu 0.20%
350 | loss 6.3858 | lr 3.500000e-05 | 8483.06ms | mfu 0.21%
360 | loss 6.0504 | lr 3.600000e-05 | 8135.38ms | mfu 0.21%
370 | loss 6.5675 | lr 3.700000e-05 | 8457.20ms | mfu 0.21%
380 | loss 5.9549 | lr 3.800000e-05 | 8806.67ms | mfu 0.21%
390 | loss 6.2111 | lr 3.900000e-05 | 9974.87ms | mfu 0.21%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 400: train loss 6.0350, val loss 6.1920
saving checkpoint to out
wrote out/model.bin
400 | loss 6.0679 | lr 4.000000e-05 | 1091826.50ms | mfu 0.19%
410 | loss 6.1512 | lr 4.100000e-05 | 8072.29ms | mfu 0.19%
420 | loss 6.0620 | lr 4.200000e-05 | 8236.17ms | mfu 0.20%
430 | loss 6.0499 | lr 4.300000e-05 | 9770.63ms | mfu 0.20%
440 | loss 5.9541 | lr 4.400000e-05 | 9047.66ms | mfu 0.20%
450 | loss 5.7269 | lr 4.500000e-05 | 8423.80ms | mfu 0.20%
460 | loss 6.0458 | lr 4.600000e-05 | 9245.23ms | mfu 0.20%
470 | loss 5.0973 | lr 4.700000e-05 | 8859.56ms | mfu 0.20%
480 | loss 5.9435 | lr 4.800000e-05 | 8309.14ms | mfu 0.21%
490 | loss 5.7543 | lr 4.900000e-05 | 8060.78ms | mfu 0.21%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 500: train loss 5.7373, val loss 5.9544
saving checkpoint to out
wrote out/model.bin
500 | loss 6.0091 | lr 5.000000e-05 | 870067.53ms | mfu 0.19%
510 | loss 5.2391 | lr 4.999939e-05 | 8175.64ms | mfu 0.19%
520 | loss 5.6863 | lr 4.999756e-05 | 7966.06ms | mfu 0.20%
530 | loss 5.7223 | lr 4.999452e-05 | 8039.18ms | mfu 0.20%
540 | loss 5.3572 | lr 4.999025e-05 | 8103.30ms | mfu 0.21%
550 | loss 5.3626 | lr 4.998477e-05 | 8823.12ms | mfu 0.21%
560 | loss 5.5258 | lr 4.997807e-05 | 8428.06ms | mfu 0.21%
570 | loss 5.8287 | lr 4.997015e-05 | 9279.43ms | mfu 0.21%
580 | loss 4.9667 | lr 4.996102e-05 | 8629.52ms | mfu 0.21%
590 | loss 6.1569 | lr 4.995067e-05 | 7949.49ms | mfu 0.21%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 600: train loss 5.5033, val loss 5.8008
saving checkpoint to out
wrote out/model.bin
600 | loss 5.5272 | lr 4.993910e-05 | 867500.58ms | mfu 0.19%
610 | loss 5.5458 | lr 4.992632e-05 | 9717.10ms | mfu 0.19%
620 | loss 6.1198 | lr 4.991232e-05 | 9832.19ms | mfu 0.19%
630 | loss 5.0135 | lr 4.989711e-05 | 9362.08ms | mfu 0.19%
640 | loss 5.4544 | lr 4.988068e-05 | 9934.33ms | mfu 0.19%
650 | loss 5.3185 | lr 4.986305e-05 | 9852.73ms | mfu 0.19%
660 | loss 5.6993 | lr 4.984420e-05 | 10946.96ms | mfu 0.19%
670 | loss 5.3741 | lr 4.982414e-05 | 10456.20ms | mfu 0.19%
680 | loss 5.5355 | lr 4.980287e-05 | 11031.83ms | mfu 0.19%
690 | loss 5.4252 | lr 4.978039e-05 | 9952.31ms | mfu 0.19%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 700: train loss 5.3107, val loss 5.6719
saving checkpoint to out
wrote out/model.bin
700 | loss 5.0865 | lr 4.975670e-05 | 852449.20ms | mfu 0.17%
710 | loss 5.5724 | lr 4.973181e-05 | 9948.06ms | mfu 0.17%
720 | loss 5.5926 | lr 4.970571e-05 | 10089.81ms | mfu 0.18%
730 | loss 5.8298 | lr 4.967841e-05 | 9891.43ms | mfu 0.18%
740 | loss 5.3898 | lr 4.964990e-05 | 10796.33ms | mfu 0.18%
750 | loss 5.7127 | lr 4.962019e-05 | 9811.68ms | mfu 0.18%
760 | loss 5.1750 | lr 4.958929e-05 | 8947.14ms | mfu 0.18%
770 | loss 5.2793 | lr 4.955718e-05 | 9161.19ms | mfu 0.19%
780 | loss 5.5190 | lr 4.952388e-05 | 9747.03ms | mfu 0.19%
790 | loss 4.9607 | lr 4.948938e-05 | 9209.76ms | mfu 0.19%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 800: train loss 5.1545, val loss 5.5718
saving checkpoint to out
wrote out/model.bin
800 | loss 5.4576 | lr 4.945369e-05 | 855916.31ms | mfu 0.17%
810 | loss 4.6901 | lr 4.941681e-05 | 9035.18ms | mfu 0.17%
820 | loss 4.8903 | lr 4.937873e-05 | 8440.16ms | mfu 0.18%
830 | loss 4.4181 | lr 4.933947e-05 | 9305.47ms | mfu 0.18%
840 | loss 5.0928 | lr 4.929903e-05 | 8274.67ms | mfu 0.19%
850 | loss 5.4939 | lr 4.925739e-05 | 9255.03ms | mfu 0.19%
860 | loss 5.4620 | lr 4.921458e-05 | 9231.66ms | mfu 0.19%
870 | loss 5.6146 | lr 4.917058e-05 | 8022.33ms | mfu 0.20%
880 | loss 5.0798 | lr 4.912541e-05 | 8283.84ms | mfu 0.20%
890 | loss 5.1852 | lr 4.907906e-05 | 8486.83ms | mfu 0.20%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 900: train loss 5.0172, val loss 5.4819
saving checkpoint to out
wrote out/model.bin
900 | loss 4.8217 | lr 4.903154e-05 | 855435.06ms | mfu 0.18%
910 | loss 5.1092 | lr 4.898285e-05 | 7973.71ms | mfu 0.19%
920 | loss 4.8212 | lr 4.893299e-05 | 8459.96ms | mfu 0.19%
930 | loss 5.4917 | lr 4.888196e-05 | 8422.17ms | mfu 0.20%
940 | loss 4.2738 | lr 4.882977e-05 | 9007.81ms | mfu 0.20%
950 | loss 5.1529 | lr 4.877641e-05 | 8612.23ms | mfu 0.20%
960 | loss 4.2595 | lr 4.872190e-05 | 8490.12ms | mfu 0.20%
970 | loss 4.7789 | lr 4.866623e-05 | 8267.10ms | mfu 0.21%
980 | loss 4.3645 | lr 4.860941e-05 | 8674.61ms | mfu 0.21%
990 | loss 5.3505 | lr 4.855144e-05 | 8357.45ms | mfu 0.21%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1000: train loss 4.8927, val loss 5.4120
saving checkpoint to out
wrote out/model.bin
1000 | loss 4.6940 | lr 4.849232e-05 | 984054.12ms | mfu 0.19%
1010 | loss 4.7819 | lr 4.843205e-05 | 8521.71ms | mfu 0.19%
1020 | loss 4.5943 | lr 4.837064e-05 | 8861.96ms | mfu 0.20%
1030 | loss 5.0305 | lr 4.830810e-05 | 8279.48ms | mfu 0.20%
1040 | loss 4.6856 | lr 4.824441e-05 | 8650.21ms | mfu 0.20%
1050 | loss 5.0216 | lr 4.817960e-05 | 9415.09ms | mfu 0.20%
1060 | loss 4.3772 | lr 4.811365e-05 | 9863.83ms | mfu 0.20%
1070 | loss 4.7097 | lr 4.804658e-05 | 9237.76ms | mfu 0.20%
1080 | loss 5.2862 | lr 4.797838e-05 | 9578.55ms | mfu 0.20%
1090 | loss 4.9530 | lr 4.790907e-05 | 9027.24ms | mfu 0.20%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1100: train loss 4.7717, val loss 5.3499
saving checkpoint to out
wrote out/model.bin
1100 | loss 4.9974 | lr 4.783864e-05 | 850586.51ms | mfu 0.18%
1110 | loss 5.0609 | lr 4.776709e-05 | 8192.68ms | mfu 0.19%
1120 | loss 4.8763 | lr 4.769444e-05 | 8331.00ms | mfu 0.19%
1130 | loss 4.9531 | lr 4.762068e-05 | 6907.91ms | mfu 0.20%
1140 | loss 4.6328 | lr 4.754581e-05 | 6738.71ms | mfu 0.21%
1150 | loss 5.2984 | lr 4.746985e-05 | 6728.57ms | mfu 0.22%
1160 | loss 4.5999 | lr 4.739279e-05 | 6890.84ms | mfu 0.22%
1170 | loss 4.9067 | lr 4.731465e-05 | 7181.48ms | mfu 0.23%
1180 | loss 5.0734 | lr 4.723541e-05 | 7239.20ms | mfu 0.23%
1190 | loss 4.0202 | lr 4.715509e-05 | 7235.71ms | mfu 0.24%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1200: train loss 4.8594, val loss 5.7248
saving checkpoint to out
wrote out/model.bin
1200 | loss 4.5090 | lr 4.707369e-05 | 13796570.84ms | mfu 0.21%
1210 | loss 4.5758 | lr 4.699121e-05 | 8248.14ms | mfu 0.21%
1220 | loss 4.7654 | lr 4.690767e-05 | 8475.03ms | mfu 0.22%
1230 | loss 5.4346 | lr 4.682305e-05 | 8801.79ms | mfu 0.22%
1240 | loss 5.1975 | lr 4.673737e-05 | 9796.92ms | mfu 0.21%
1250 | loss 4.8145 | lr 4.665064e-05 | 9751.77ms | mfu 0.21%
1260 | loss 5.1132 | lr 4.656284e-05 | 9079.31ms | mfu 0.21%
1270 | loss 5.3448 | lr 4.647400e-05 | 8898.44ms | mfu 0.21%
1280 | loss 5.0621 | lr 4.638411e-05 | 10211.08ms | mfu 0.21%
1290 | loss 5.2875 | lr 4.629317e-05 | 8637.70ms | mfu 0.21%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1300: train loss 4.7212, val loss 5.1972
saving checkpoint to out
wrote out/model.bin
1300 | loss 5.3149 | lr 4.620120e-05 | 859385.16ms | mfu 0.19%
1310 | loss 5.1731 | lr 4.610820e-05 | 8296.90ms | mfu 0.19%
1320 | loss 4.9670 | lr 4.601417e-05 | 8559.66ms | mfu 0.20%
1330 | loss 5.4190 | lr 4.591911e-05 | 8259.73ms | mfu 0.20%
1340 | loss 4.9462 | lr 4.582303e-05 | 9781.23ms | mfu 0.20%
1350 | loss 5.0476 | lr 4.572594e-05 | 9851.32ms | mfu 0.20%
1360 | loss 5.3999 | lr 4.562784e-05 | 8541.99ms | mfu 0.20%
1370 | loss 5.0651 | lr 4.552873e-05 | 8902.38ms | mfu 0.20%
1380 | loss 4.9729 | lr 4.542862e-05 | 8993.98ms | mfu 0.21%
1390 | loss 4.6303 | lr 4.532752e-05 | 7355.79ms | mfu 0.21%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1400: train loss 4.7160, val loss 5.4762
saving checkpoint to out
wrote out/model.bin
1400 | loss 4.8419 | lr 4.522542e-05 | 3529084.17ms | mfu 0.19%
Traceback (most recent call last):
  File "/Users/davidlaxer/llama2.c/train.py", line 310, in <module>
    logits = model(X, Y)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/davidlaxer/llama2.c/model.py", line 263, in forward
    h = layer(h, freqs_cos, freqs_sin)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/davidlaxer/llama2.c/model.py", line 207, in forward
    h = x + self.attention.forward(self.attention_norm(x), freqs_cos, freqs_sin)
  File "/Users/davidlaxer/llama2.c/model.py", line 161, in forward
    assert not torch.logical_or(torch.isinf(output), torch.isnan(output)).any(), err_msg
AssertionError: The output  after transpose is not stable: tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='mps:0',
       grad_fn=<ViewBackward0>)
wandb: 
wandb: Run history:
wandb:       iter ▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
wandb: loss/train █▆▄▃▃▂▂▂▂▁▁▁▁▁▁
wandb:   loss/val █▅▄▃▂▂▂▂▁▁▁▁▂▁▁
wandb:         lr ▁▂▄▅▇████████▇▇
wandb:        mfu ▁██████████████
wandb:     tokens ▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
wandb: 
wandb: Run summary:
wandb:       iter 1400
wandb: loss/train 4.71601
wandb:   loss/val 5.47621
wandb:         lr 5e-05
wandb:        mfu 0.21082
wandb:     tokens 8601600
wandb: 

So I added assertions to try to pinpoint what operation(s) could be dividing by zero, have overflow/underflow, exploding gradients, etc.

          # manual implementation
            scores = torch.matmul(xq, xk.transpose(2, 3)) / math.sqrt(self.head_dim)
            err_msg = f"The output  after scores is not stable: {scores}"
            assert not torch.logical_or(torch.isinf(scores), torch.isnan(scores)).any(), err_msg

When I resumed training at the checkpoint prior to the NaNs' with numerous assertions checking tensors for -Inf and NaNs
two thing happened:
i. the time for each iteration slowed by a factor of 2-3, and
ii. I haven't gotten any new NaNs.

% python train.py --vocab_source=custom --vocab_size=32000
Overriding: vocab_source = custom
Overriding: vocab_size = 32000
tokens per iteration will be: 6,144
breaks down as: 1 grad accum steps * 1 processes * 6 batch size * 1024 max seq len
Resuming training from out
num decayed parameter tensors: 113, with 137,822,208 parameters
num non-decayed parameter tensors: 33, with 25,344 parameters
using fused AdamW: False
wandb: Currently logged in as: dbl001. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /Users/davidlaxer/llama2.c/wandb/run-20231223_061052-88rpvndz
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run run2023_12_23_06_10_46
wandb: ⭐️ View project at https://wandb.ai/dbl001/llamac
wandb: 🚀 View run at https://wandb.ai/dbl001/llamac/runs/88rpvndz
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1400: train loss 4.7160, val loss 5.1368
saving checkpoint to out
wrote out/model.bin
1400 | loss 4.4253 | lr 4.522542e-05 | 3744730.81ms | mfu -100.00%
1410 | loss 5.1437 | lr 4.512234e-05 | 23375.43ms | mfu 0.08%
1420 | loss 4.4087 | lr 4.501828e-05 | 23881.84ms | mfu 0.08%
1430 | loss 4.7489 | lr 4.491325e-05 | 23615.35ms | mfu 0.08%
1440 | loss 4.0930 | lr 4.480724e-05 | 23686.11ms | mfu 0.08%
1450 | loss 4.8052 | lr 4.470027e-05 | 23652.07ms | mfu 0.08%
1460 | loss 4.8364 | lr 4.459234e-05 | 22506.38ms | mfu 0.08%
1470 | loss 4.3753 | lr 4.448345e-05 | 23318.68ms | mfu 0.08%
1480 | loss 4.7004 | lr 4.437361e-05 | 22537.14ms | mfu 0.08%
1490 | loss 4.8544 | lr 4.426283e-05 | 24084.47ms | mfu 0.08%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1500: train loss 4.1818, val loss 5.1098
saving checkpoint to out
wrote out/model.bin
1500 | loss 4.6525 | lr 4.415111e-05 | 3887341.48ms | mfu 0.07%
1510 | loss 4.1455 | lr 4.403846e-05 | 22862.19ms | mfu 0.08%
1520 | loss 4.7745 | lr 4.392488e-05 | 21689.91ms | mfu 0.08%
1530 | loss 4.4104 | lr 4.381037e-05 | 21086.61ms | mfu 0.08%
1540 | loss 3.5080 | lr 4.369495e-05 | 21571.38ms | mfu 0.08%
1550 | loss 4.5036 | lr 4.357862e-05 | 21261.27ms | mfu 0.08%
1560 | loss 4.3820 | lr 4.346138e-05 | 21061.43ms | mfu 0.08%
1570 | loss 4.6843 | lr 4.334325e-05 | 36302.10ms | mfu 0.08%
1580 | loss 4.8977 | lr 4.322422e-05 | 24714.99ms | mfu 0.08%
1590 | loss 5.0075 | lr 4.310430e-05 | 23758.57ms | mfu 0.08%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1600: train loss 4.1711, val loss 5.0889
saving checkpoint to out
wrote out/model.bin
1600 | loss 4.5460 | lr 4.298350e-05 | 3871690.12ms | mfu 0.07%
1610 | loss 4.7614 | lr 4.286182e-05 | 23098.85ms | mfu 0.07%
1620 | loss 4.9370 | lr 4.273927e-05 | 22423.83ms | mfu 0.07%
1630 | loss 4.6698 | lr 4.261586e-05 | 24143.78ms | mfu 0.07%
1640 | loss 3.9477 | lr 4.249158e-05 | 22315.55ms | mfu 0.08%
1650 | loss 4.4027 | lr 4.236646e-05 | 23349.27ms | mfu 0.08%
1660 | loss 4.7358 | lr 4.224049e-05 | 23731.63ms | mfu 0.08%
1670 | loss 4.4030 | lr 4.211368e-05 | 24967.93ms | mfu 0.08%
1680 | loss 4.5963 | lr 4.198603e-05 | 23714.32ms | mfu 0.08%
1690 | loss 4.5313 | lr 4.185756e-05 | 22577.01ms | mfu 0.08%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1700: train loss 4.1253, val loss 5.0436
saving checkpoint to out
wrote out/model.bin
1700 | loss 4.2296 | lr 4.172827e-05 | 3769860.63ms | mfu 0.07%
1710 | loss 3.4392 | lr 4.159816e-05 | 21626.95ms | mfu 0.07%
1720 | loss 4.5379 | lr 4.146724e-05 | 21638.41ms | mfu 0.07%
1730 | loss 4.3117 | lr 4.133552e-05 | 21650.88ms | mfu 0.08%
1740 | loss 4.3748 | lr 4.120300e-05 | 21839.64ms | mfu 0.08%
1750 | loss 4.4910 | lr 4.106969e-05 | 21552.58ms | mfu 0.08%
1760 | loss 3.8457 | lr 4.093560e-05 | 21436.50ms | mfu 0.08%
1770 | loss 4.4880 | lr 4.080073e-05 | 21557.70ms | mfu 0.08%
1780 | loss 3.6890 | lr 4.066510e-05 | 21757.75ms | mfu 0.08%
1790 | loss 4.5338 | lr 4.052869e-05 | 21053.19ms | mfu 0.08%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1800: train loss 4.0669, val loss 5.0291
saving checkpoint to out
wrote out/model.bin
1800 | loss 4.0547 | lr 4.039154e-05 | 4993445.06ms | mfu 0.07%
1810 | loss 4.1183 | lr 4.025363e-05 | 21422.80ms | mfu 0.08%
1820 | loss 4.6204 | lr 4.011498e-05 | 21034.09ms | mfu 0.08%
1830 | loss 4.1170 | lr 3.997559e-05 | 21737.89ms | mfu 0.08%
1840 | loss 4.1425 | lr 3.983547e-05 | 21759.81ms | mfu 0.08%
1850 | loss 4.0380 | lr 3.969463e-05 | 21762.27ms | mfu 0.08%
1860 | loss 4.4524 | lr 3.955307e-05 | 22061.17ms | mfu 0.08%
1870 | loss 3.0768 | lr 3.941081e-05 | 21781.28ms | mfu 0.08%
1880 | loss 4.0528 | lr 3.926784e-05 | 21513.16ms | mfu 0.08%
1890 | loss 4.3504 | lr 3.912418e-05 | 21955.43ms | mfu 0.08%
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 1900: train loss 4.0252, val loss 5.0059
saving checkpoint to out
wrote out/model.bin
1900 | loss 4.3999 | lr 3.897982e-05 | 3733163.88ms | mfu 0.07%
1910 | loss 3.5555 | lr 3.883479e-05 | 22730.30ms | mfu 0.08%
1920 | loss 4.1726 | lr 3.868908e-05 | 22654.03ms | mfu 0.08%
1930 | loss 4.2485 | lr 3.854271e-05 | 21537.36ms | mfu 0.08%
1940 | loss 3.4835 | lr 3.839567e-05 | 23232.64ms | mfu 0.08%
1950 | loss 3.9259 | lr 3.824798e-05 | 23190.81ms | mfu 0.08%

Could the assertions checking tensors have changed the computation graph or forces lazy values to be fixed, changing either the timing of operations in the GPU or the path of the computations ... leading to 'stability' in training?

% ./run out/model.bin -z data/tok32000.bin -i "Breakthrough infection in a vaccinated individual was defined as an RT-qPCR-positive test 14 or more days after the individual received the second vaccine dose, conditional on this RT-qPCR-positive test being the first ever positive for this individual."
Breakthrough infection in a vaccinated individual was defined as an RT-qPCR-positive test 14 or more days after the individual received the second vaccine dose, conditional on this RT-qPCR-positive test being the first ever positive for this individual. 5 Notes 59 C for our work were seen in the establishment of carbon intensity. At the 59 years after scored link to population samples from 50% for measurement, 59ad whom, and 50 countries were related to trans Am directly. If the narrative were less than 20 min, the expiratory therapeutic policy was introduced to be patients who led to a more nearly an overall severity time, but combined with active animal cataba and peptide-ELOovannungs tend to be found. D II times is highly reliability in financial localization from the same facilities, along with a proper spatialX relationship towards an analytical method of CHD in reducing contaminated dysfunction. 21 were able to produce a negative control assessment strategy for different sciences in lethal and secondary effect of regulation under the promising mechanisms underlying the interaction between signal closing and location of virulence. In a report of 22 subjects, Cfiner et al. a small analysis of 47 patients have generated a inner
achieved tok/s: 3.419881

I was able to train an llama2.c model with 12 heads and 12 layers with data from Covid-19 research papers (https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge/data) using device='mps' for 25,000 epochs without getting NaN's or -Inf's. This was using torch. version: 2.4.0a0+gite3d5afc.
The inference output from 'run' still doesn't make too much sense, however, this is due to the memory limitations of an AMD Radeon Pro 5700 XT (16GB) with 12 layers and 12 heads (~100 million parameters).

Screenshot 2024-05-11 at 7 56 21 AM
%  ./run out/model.bin -z data/tok32000.bin
1688806,6a27155a25e27697eb3b07f88cf9340164a2e848,Sustischen occurring due to external blockade and side effects in turhexate ribavirin A mRNA vaccine is the diet-related traumatic rejecting enzyme-2 inhibitors of severe acute respiratory syndrome-2 coronavirus 2 infection in the treatment of asthma in asthma [32] . Nevertheless, we propose a known non-structural protein-activated protein kinase 2 mRNA in human mice with a Bcl-2 inhibitor-free regimen, which emerges as a key target to modulate estrogen responses in the treatment of asthma and of asthma. We provide an alternative approach in combination therapy with a fast in vitro administration in human Ab-derived Nitigating Toxicity, cardiovascular effects, and pulmonary vaccine invasion of such newborn mice and mouse lungs. The present field of this study was approved in England for IFN-beta-induced ROSTPTT-derived inhibitor BMA-25168. Prospective prediction of the efficacy of lopinavir-TIAS treatment in non-and nasal pulmonary infections suggests that it effectively exhibit inhibitory activity against virions in vitro. Once these are anti
achieved tok/s: 4.756398