Modified nanoGPT for enwik8 character level transformer.
Implements local-global hybrid self attention.
Train a character-level GPT on 100M character of compressed Wikipedia. First, we download it as a single file and turn it from raw text into one large stream of integers:
$ python data/enwik8/prepare.py
This creates train.bin
, val.bin
, test.bin
with a 90M, 5M, 5M split in that data directory. Now we train a baby GPT with the settings in the config/train_enwik8_small.py config file for baseline model, or config/train_enwik8_hybrid.py for the hybrid SA model:
$ python train.py config/train_enwik8_hybrid.py
We can sample from the final model checkpoint and see the generated samples in the out_dir, saved as sample.txt files:
$ python sample.py --out_dir=out-enwik8-char-hybrid
This generates a few samples, for example:
""" A band named "Esperanto masculini" (see also [[Masculine wars]]) shares the [[extreme perfection of language]], i.e. the [[synonym]] of "Esperanto masculini" (see ''[[Esperanto masculini]]''.) Both sides are used in different ways such as [[Elementary masculini]] (see [[Esperanto masculini]]) and [[Esperanto masculini]] (see [[List of esperanto masculini]]). In the United States, Esperanto masculini is also used in masculini large parts of [[Africa]]. """
$ python test.py --out_dir=out-enwik8-char-small
$ python test.py --out_dir=out-enwik8-char-hybrid
Running for 100000 training steps, we observe the following losses on test:
model | params | test loss |
---|---|---|
small (baseline) | 12.95M | 1.3056 |
hybrid | 12.95M | 1.2893 |