ggerganov/llama.cpp for the Internet Computer.
Currently, the canister can only be build on a mac
!
-
Install icpp-pro, the C++ Canister Development Kit (CDK) for the Internet Computer
-
Clone the repo and it's children:
# Clone this repo git clone git@github.com:onicai/llama_cpp_canister.git # Clone llama_cpp_onicai_fork, our forked version of llama.cpp # Into the ./src folder cd src git clone git@github.com:onicai/llama_cpp_onicai_fork.git # Initialize the submodules of the llama_cpp_onicai_fork repo cd llama_cpp_onicai_fork git submodule init git submodule update
-
Create the file src/llama_cpp_onicai_fork/common/build-info.cpp
make build-info-cpp-wasm
TODO: recipe for Windows.
-
Create a Python environment with dependencies installed
# We use MiniConda conda create --name llama_cpp_canister python=3.11 conda activate llama_cpp_canister # Install the python dependencies # From root of llama_cpp_canister repo: pip install -r requirements.txt
-
Install dfx:
sh -ci "$(curl -fsSL https://internetcomputer.org/install.sh)" # Configure your shell source "$HOME/.local/share/dfx/env"
(Note: On Windows, just install dfx in wsl, and icpp-pro in PowerShell will know where to find it. ) (Note 2: It does not yet work on Windows... Stay tuned... )
-
Build & Deploy a pre-trained model to canister
llama_cpp
:-
Compile & link to WebAssembly (wasm):
icpp build-wasm
Note:
The first time you run this command, the tool-chain will be installed in ~/.icpp
This can take a few minutes, depending on your internet speed and computer.
-
Start the local network:
dfx start --clean
-
Deploy the wasm to a canister on the local network:
dfx deploy
-
Check the health endpoint of the
llama_cpp
canister:$ dfx canister call llama_cpp health (variant { Ok = record { status_code = 200 : nat16 } })
-
Upload the 260K parameter model: (We included this fine-tuned model in the repo)
python -m scripts.upload --network local --canister llama_cpp models/stories260Ktok512.gguf
-
-
Test it with dfx.
-
Generate 20 tokens, using the
run_query
orrun_update
call:$ dfx canister call llama_cpp run_query '(record { args = vec {"--model"; "models/stories260Ktok512.gguf"; "--prompt"; "Patrick loves ice-cream. On a hot day "; "--n-predict"; "20"; "--ctx-size"; "128"; "--verbose-prompt"} })' $ dfx canister call llama_cpp run_update '(record { args = vec {"--model"; "models/stories260Ktok512.gguf"; "--prompt"; "Patrick loves ice-cream. On a hot day "; "--n-predict"; "20"; "--ctx-size"; "128"; "--verbose-prompt"} })' -> See token generation in the dfx log window
-
You can find a lot of models in the llama.cpp *.gguf format on HuggingFace.
Don't try them out yet though, they will not yet run, but hit the instruction limit.
See next steps on how we will fix this.
We will start by expanding our tests to the tiny stories models:
TinyStories - onicai/llama_cpp_canister_models
model | notes |
---|---|
stories260Ktok512.guff | Works! Use this for development & debugging |
stories15Mtok4096.guff | todo |
stories42Mtok4096.guff | todo |
stories42Mtok32000.guff | todo |
stories110Mtok32000.guff | todo |
We focus on two types of models:
-
the TinyStories models:
These small models are fantastic for fleshing out the implementation:
- 260K, 15M, 42M, 110M
- non-quantized
- quantized with 4-bits
-
Larger models fine-tuned for chat:
This will allow you to have a conversation, as you do in ChatGPT:
- microsoft/phi-3, quantized with 4-bits
In order to run the larger models:
Today - in progress, stay tuned:
- Reduce number of instructions & Increase speed:
- Read model only once and store in OP
- SIMD
- New wasm float handling
- Quantized models
Coming - once released as part of IC roadmap:
- wasm64
- GPU
Because a single update call is never enough, due to the instructions limit, a sequence of update calls is required. This is non-trivial, because the state of the LLM at the end of each update call must be saved.
The llama.cpp code has a caching mechanism that likely can be used for this purpose.
Implement an endpoint that is very similar to the industry standard completions API. This will ensure that the LLM canister can be easily integrated into both Web3 and Web2 applications.
Once finished, we will integrate it all into ICGPT.
Currently, the build process only works on a mac. We will expand it to also work on Ubuntu & Windows
This project is sponsored by a DFINITY DeAI grant.
The status for milestone 1 & 2 are summarized below.
The following tasks were completed:
- Implement stable memory in icpp-pro
- Accomplished by integrating wasi2ic
- Release in icpp-pro 4.1.0
- Port the ggerganov/llama.cpp C/C++ code to the IC
- Create an icpp-pro project that encapsulates the llama.cpp code
- Deploy it to local network
- Write python scripts to upload the model weights and tokenizer
- Test it locally
- We were able to run the
stories260Ktok512.gguf
model - Token generation works, until it hits the instruction limit
- We were able to run the
- Implement CI/CD pipeline using GitHub actions
As part of an initial 30 day sprint, the following tasks were completed:
The procedure to compile & run & debug a CLANG++ compiled native version is described below.
In the native version of llama.cpp, we tested the following models:
- All TinyStories models stored on huggingface/onicai/llama_cpp_canister_models
- The 4q model on huggingface/microsoft/Phi-3-mini-4k-instruct-gguf
We trained several TinyStories models and converted them into the *.guff format required by llama.cpp All models are stored on huggingface/onicai/llama_cpp_canister_models
We dug deep into the code and studied it by stepping through it in the debugger with VS Code. We have gained sufficient understanding to create a solid implementation plan.