Discord: https://discord.gg/peBU7yWa
Inference of LLaMA model with Instruct finetuning with LoRA fine-tunable adapter layers.
Dev-notes: We are switching away from our C++ implementation of LLaMa to the more recent one llama.cpp by @ggerganov that now offers nearly same performance (and output quality) on Macbook as well as support over Linux and Windows.
Supported platforms: Mac OS, Linux, Windows (via CMake)
License: MIT
If you use LLaMa weights, then it should only be used for non-commercial research purposes.
Here is a typical run using the adapter weights uploaded by tloen/alpaca-lora-7b
under MIT license:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin --instruction "Write an email to your friend about your plans for the weekend." -t 8 -n 128
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin --instruction "Calculate the area of the a circle given its radius." --input "radius = 3" -t 8 -n 128
These follow the Stanford's Alpaca format for instruction prompt (https://github.com/tatsu-lab/stanford_alpaca#data-release)
Here are the step for the LLaMA-7B model (same as llama.cpp), defaults to the adapter weights uploaded by tloen/alpaca-lora-7b
under MIT license:
# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
# install Python dependencies
python3 -m pip install torch numpy sentencepiece transformers
# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1
# quantize the model to 4-bits
./quantize.sh 7B
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 --instruction <instruction> --input <input_to_instruction>
convert-pth-to-ggml.py
has been updated to download and handle LoRA weights.utils.h
andutils.cpp
have been modified to support input prompts in the style of Alpaca.