/InstructLLaMa.cpp

Fast inference of Instruct tuned LLaMa on your personal devices.

Primary LanguageCMIT LicenseMIT

InstructLLaMa.cpp

Discord: https://discord.gg/peBU7yWa

Inference of LLaMA model with Instruct finetuning with LoRA fine-tunable adapter layers.

Dev-notes: We are switching away from our C++ implementation of LLaMa to the more recent one llama.cpp by @ggerganov that now offers nearly same performance (and output quality) on Macbook as well as support over Linux and Windows.

Supported platforms: Mac OS, Linux, Windows (via CMake)

License: MIT

If you use LLaMa weights, then it should only be used for non-commercial research purposes.

Description & Usage

Here is a typical run using the adapter weights uploaded by tloen/alpaca-lora-7b under MIT license:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin --instruction "Write an email to your friend about your plans for the weekend." -t 8 -n 128
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin --instruction "Calculate the area of the a circle given its radius." --input "radius = 3" -t 8 -n 128

These follow the Stanford's Alpaca format for instruction prompt (https://github.com/tatsu-lab/stanford_alpaca#data-release)

Setup

Here are the step for the LLaMA-7B model (same as llama.cpp), defaults to the adapter weights uploaded by tloen/alpaca-lora-7b under MIT license:

# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece transformers

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize.sh 7B

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 --instruction <instruction> --input <input_to_instruction>

How this differs from original LLaMa.cpp:

  • convert-pth-to-ggml.py has been updated to download and handle LoRA weights.
  • utils.h and utils.cpp have been modified to support input prompts in the style of Alpaca.