ALLaMo

This repository is intended as a simple, hackable and fast implementation for training/finetuning/inference LLaMA-based models (arXiv).

install

Dependencies:

training

Use the script train.py to start your training. It reads a train.bin and val.bin files from the dataset directory. You can create the both files with a single script like this:

import numpy as np
import tiktoken

def encode_file(input_file_path, output_file_path, tokenizer_name):
    tokenizer = tiktoken.get_encoding(tokenizer_name)
    with open(input_file_path, 'r') as f:
        data = f.read()
    enc_data = tokenizer.encode(data)
    enc_data = np.array(enc_data, dtype=np.uint32)
    enc_data.tofile(output_file_path)
    
encode_file('raw/dataset1/train.txt', 'data/dataset1/train.bin', 'cl100k_base')

The training script can be run on both a single node with one or more GPUs, as well as on multiple nodes with Distributed Data Parallel (DDP).

To run on a single node with 1 GPU, example:

$ python train.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --wandb_log=True

To run on a single node with 8 GPU with DDP, example:

$ torchrun --standalone --nproc_per_node=8 train.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --wandb_log=True

To run on 2+ nodes with DDP, example:

Run on the first (master) node with example IP 123.456.123.456:

$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --wandb_log=True

Run on the worker node(s):

$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --wandb_log=True

Note: in case your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1.

finetuning

The process of finetuning is similar to regular training, but we initialize from a pretrained model and use a smaller learning rate during training. In addition, it is essential to ensure that the model parameters used for finetuning are consistent with those used during pre-training.

sampling / inference

Use the script sample.py to sample from a model you trained. For example:

$ python sample.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --max_new_tokens=100 \
    --temperature=0.7 \
    --top_k=200 \
    --num_samples=5 \
    --prompt="Long long time ago"

You can also prompt the model with some text from a file prefixing its path with FILE:, example:

$ python sample.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --max_new_tokens=100 \
    --temperature=0.7 \
    --top_k=200 \
    --num_samples=5 \
    --prompt="FILE:prompt.txt"

Specify the tokenizer using --tiktoken_tokenizer_name for Tiktoken (e.g. cl100k_base), or thanks to HuggingFace Transformers, you can easily use your own pretrained tokenizer using --custom_tokenizer_path to provide your tokenizer's JSON config file.

Use the script 'sample_api.py' to expose 3 API endpoints. Then you will be able to query a pretrained model for text embeddings and completions.

To run the API with a pretrained model, example:

$ python sample_api.py \
    --config="../config/train_allamo_cl100k_base.json" \
    --max_new_tokens=10 \
    --temperature=0.7 \
    --top_k=200 \
    --num_samples=5

Query for text embeddings, example:

$ curl -X POST -H "Content-Type: application/json" http://localhost:5000/embeddings -d '{"prompt": "Long long time ago"}'

Query for text completions, example:

$ curl -X POST -H "Content-Type: application/json" http://localhost:5000/completions -d '{"prompt": "Long long time ago", "num_samples": 3}'

Query for tokens to see how your prompt is tokenized, example:

$ curl -X POST -H "Content-Type: application/json" http://localhost:5000/tokens -d '{"prompt": "Long long time ago"}'

To run the UI at top of the API, example:

$ python sample_u.py

References: