jax-llm

JAX implementation of Large Language Models. You can train GPT-2-like models with 青空文庫 (aozora bunko-clean dataset) or any other text dataset. Model implementation is based on NanoLM.

How to use

Prepare the aozora bunko-clean dataset.

python3 src/jax_llm/prepare_aozora.py --book_num 10246

This command generates a single text file. We use 10246 books (80M Tokens).

Note

You can use any dataset for training by simply preparing a suitable txt file, without executing this command. For example, Wikitext-JA is a good choice.

Train the BPE (Byte Pair Encoding) tokenizer.

Specify the path to the text file created in the previous step.

python3 src/jax_llm/train_tokenizer.py --data_name "aozora_10246"

Train the NanoLM model with aozora bunko-clean dataset.

python3 src/jax_llm/train.py --data_name "aozora_10246" --batch_size 256 --n_iterations 20000 --n_freq_eval 100 --dropout_rate 0.0 --learning_rate 0.001 --num_layers 12 --embed_size 512  --head_size 64 --num_heads 8 --block_size 128 --wandb_log True

This command takes about one hour with a single GPU (A100). If you don't have a GPU, you can scale down the model size by reducing the dataset size and the model size.

Generate text with the trained model.

$ python3 src/jax_llm/generate.py --prompt "国境の長いトンネルを抜けると雪国であった。" --data_name "aozora_10246" --max_new_tokens 200

Results

The model size is 80M parameters.

aozora_10246 (80M Tokens).

Loss Dynamics

Prompt: "国境の長いトンネルを抜けると雪国であった。"

Output: 国境の長いトンネルを抜けると雪国であった。このあたりは、一冬に満たねばならなかった。そして、その冬枯の季節、冬の日は、その季節、冬の間、寒の季節に、大気の冷える夜だった。「あ、そうだ。たしかに」と、半七は笑っていた。「だが、そんな訳でもない。あの時に、私は、あの、お邸のお女中さん ―― と、お名は、若い妓と、妓の吉次とが、その枕元から取り交している。しかし、彼女は、その母に向ってそれを否定しようと努めなかった。そうして、その反感を、どう解釈してよいか、また、自分の力でそれに従うべきかを、自分ではっきり知っていながらでも、やはり自分を信じなければならない。それは、自分の知らないうちに、自分の過去の記憶で、この小説の中に残っている「昔の人」には一種のロマンが含まれている。だが、そこには何かこう神秘的な力が潜んでいる。それは、私がこの眼で見た、最も純粋な、また同時に美しい ―― それは、その創造した芸術の美しさである。そして、そこに一つの現実への飛躍がある

Data Parallelism

You can train the model with multiple GPUs using data parallelism.

python3 src/jax_llm/train_data_parallel.py --data_name "aozora_10246" --batch_size 128 --n_iterations 5000 --n_freq_eval 100 --dropout_rate 0.1 --learning_rate 0.001 --num_layers 12 --embed_size 512  --head_size 64 --num_heads 8 --block_size 256 --n_devices 2

Using 2 GPUs with the same settings as in the previous example reduces the training time by about 1.5 times compared to using a single GPU. You can get a similar loss dynamics as the single GPU case.

Projects Using jax-llm

input-method: First-two-char input method using transformer-based language model and n-gram model.

References

Special thanks to the following repositories, papers, and datasets.

Dataset

akeyhero, aozora bunko-clean 青空文庫, https://www.aozora.gr.jp/
Mori et al., Wikitext-JA
Tsukagoshi et al., WikiSplit++

Appendix

Training with WikiSplit++ dataset.

Prepare the WikiSplit++ dataset.

We construct input.txt by concatenating the simple_reversed (or any other) fields from the train dataset with <|endoftext|> as a separator.

python3 src/jax_llm/prepare_wikisplit-pp.py --field "simple_reversed"

Substitute "aozora_10246" with "wikisplit-pp" in the commands in the How to use section.

Results

wikisplit-pp-simple_reversed (20M Tokens)

Loss Dynamics

Prompt: "The train came out of the long tunnel into the snow country."

Output: The train came out of the long tunnel into the snow country . In the 1970s a pedestrian was also used . However , it was widely viewed as an alternative to the American film industry . The original " Superman III " was based on the first game in series . However , he is a good friend for the character , and is reluctant to let anyone out of school . A year later the station began operation in the late 1980s , but in late 2002 to be converted to DVD . As a result , " The Voice " had some of the first international albums . He is a long - time contributor to the " New York Times ". The Daily Telegraph is an imprint of the American National Media Enterprise Association . She studied literature and art history . She was an undergraduate in the classical languages of Germany and the Austrian - Polish language . Born in the southern German region of the Czech state of the Kingdom of Hungary . The most extensive was the United States ' s entry in the 1960s , where the U . S . Supreme Court is being investigated . The

mahi97/jax-llm

jax-llm

How to use

Prepare the aozora bunko-clean dataset.

Train the BPE (Byte Pair Encoding) tokenizer.

Train the NanoLM model with aozora bunko-clean dataset.

Generate text with the trained model.

Results

Data Parallelism

Projects Using jax-llm

References

Dataset

Appendix

Training with WikiSplit++ dataset.

Prepare the WikiSplit++ dataset.

Results