Inference of HuggingFace's BLOOM-like models in pure C/C++.
The repo was built on top of the amazing llama.cpp repo by @ggerganov and bloomz.cpp repo by @Nouamane Tazi to support BLOOM models in GGUF format. It supports all models that can be loaded using BloomForCausalLM.from_pretrained()
.
First, you need to clone the repo and build it:
git clone https://github.com/AndrewNgo-ini/bloom.cpp
cd bloom.cpp
make
I have converted these model
https://huggingface.co/hiieu/bloomz-560m-GGUF
https://huggingface.co/hiieu/bloomz-7b1-GGUF
Then, you must convert the model weights to the ggml format. Any BLOOM model can be converted.
If you prefer, you can manually convert the weights on your machine:
# install required libraries
python3 -m pip install torch numpy transformers accelerate
# download and convert the 7B1 model to ggml FP16 format
python convert-bloom-hf-to-gguf.py models/bloomz-560m
# Note: you can add --use-f32 to convert to FP32 instead of FP16
Optionally, you can quantize the model to 4-bits.
./quantize ./models/bloomz-560m/ggml-model-f16.gguf ./models/bloomz-560m/ggml-model-q4_0.gguf q4_0
Finally, you can run the inference.
./main -m models/bloomz-560m/bloomz-560m.gguf -n 128 -p "Translate to English: Je t’aime."
./server -m models/bloomz-560m/ggml-model-f16.gguf -c 2048
Example curl
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'