A self-hosted, offline, ChatGPT-like chatbot, powered by Llama 2. 100% private, with no data leaving your device.
New: Support for Code Llama models and Nvidia GPUs.
umbrel.com (we're hiring) »
- Demo
- Supported Models
- How to install
- OpenAI-compatible API
- Benchmarks
- Roadmap and contributing
- Acknowledgements
LlamaGPT.mp4
Currently, LlamaGPT supports the following models. Support for running custom models is on the roadmap.
Model name | Model size | Model download size | Memory required |
---|---|---|---|
Nous Hermes Llama 2 7B Chat (GGML q4_0) | 7B | 3.79GB | 6.29GB |
Nous Hermes Llama 2 13B Chat (GGML q4_0) | 13B | 7.32GB | 9.82GB |
Nous Hermes Llama 2 70B Chat (GGML q4_0) | 70B | 38.87GB | 41.37GB |
Code Llama 7B Chat (GGUF Q4_K_M) | 7B | 4.24GB | 6.74GB |
Code Llama 13B Chat (GGUF Q4_K_M) | 13B | 8.06GB | 10.56GB |
Phind Code Llama 34B Chat (GGUF Q4_K_M) | 34B | 20.22GB | 22.72GB |
Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.
Make sure your have Docker and Xcode installed.
Then, clone this repo and cd
into it:
git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt
Run LlamaGPT with the following command:
./run-mac.sh --model 7b
You can access LlamaGPT at http://localhost:3000.
To run 13B or 70B chat models, replace
7b
with13b
or70b
respectively. To run 7B, 13B or 34B Code Llama models, replace7b
withcode-7b
,code-13b
orcode-34b
respectively.
To stop LlamaGPT, do Ctrl + C
in Terminal.
You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.
Then, clone this repo and cd
into it:
git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt
Run LlamaGPT with the following command:
./run.sh --model 7b
Or if you have an Nvidia GPU, you can run LlamaGPT with CUDA support using the --with-cuda
flag, like:
./run.sh --model 7b --with-cuda
You can access LlamaGPT at http://localhost:3000
.
To run 13B or 70B chat models, replace
7b
with13b
or70b
respectively. To run Code Llama 7B, 13B or 34B models, replace7b
withcode-7b
,code-13b
orcode-34b
respectively.
To stop LlamaGPT, do Ctrl + C
in Terminal.
Note: On the first run, it may take a while for the model to be downloaded to the
/models
directory. You may also see lots of output like this for a few minutes, which is normal:llama-gpt-llama-gpt-ui-1 | [INFO wait] Host [llama-gpt-api-13b:8000] not yet available...
After the model has been automatically downloaded and loaded, and the API server is running, you'll see an output like:
llama-gpt-ui_1 | ready - started server on 0.0.0.0:3000, url: http://localhost:3000
You can then access LlamaGPT at http://localhost:3000.
First, make sure you have a running Kubernetes cluster and kubectl
is configured to interact with it.
Then, clone this repo and cd
into it.
To deploy to Kubernetes first create a namespace:
kubectl create ns llama
Then apply the manifests under the /deploy/kubernetes
directory with
kubectl apply -k deploy/kubernetes/. -n llama
Expose your service however you would normally do that.
Thanks to llama-cpp-python, a drop-in replacement for OpenAI API is available at http://localhost:3001
. Open http://localhost:3001/docs to see the API documentation.
We've tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: "How does the universe expand?" at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.
Feel free to add your own benchmarks to this table by opening a pull request.
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 54 tokens/sec |
GCP c2-standard-16 vCPU (64 GB RAM) | 16.7 tokens/sec |
GCP c2-standard-4 vCPU (16 GB RAM) | 4.3 tokens/sec |
Umbrel Home (16GB RAM) | 2.7 tokens/sec |
Raspberry Pi 4 (8GB RAM) | 0.9 tokens/sec |
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 20 tokens/sec |
GCP c2-standard-16 vCPU (64 GB RAM) | 8.6 tokens/sec |
GCP c2-standard-4 vCPU (16 GB RAM) | 2.2 tokens/sec |
Umbrel Home (16GB RAM) | 1.5 tokens/sec |
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 4.8 tokens/sec |
GCP e2-standard-16 vCPU (64 GB RAM) | 1.75 tokens/sec |
GCP c2-standard-16 vCPU (64 GB RAM) | 1.62 tokens/sec |
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 41 tokens/sec |
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 25 tokens/sec |
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 10.26 tokens/sec |
We're looking to add more features to LlamaGPT. You can see the roadmap here. The highest priorities are:
- Moving the model out of the Docker image and into a separate volume.
- Add Metal support for M1/M2 Macs.
- Add support for Code Llama models.
- Add CUDA support for NVIDIA GPUs.
- Add ability to load custom models.
- Allow users to switch between models.
If you're a developer who'd like to help with any of these, please open an issue to discuss the best way to tackle the challenge. If you're looking to help but not sure where to begin, check out these issues that have specifically been marked as being friendly to new contributors.
A massive thank you to the following developers and teams for making LlamaGPT possible:
- Mckay Wrigley for building Chatbot UI.
- Georgi Gerganov for implementing llama.cpp.
- Andrei for building the Python bindings for llama.cpp.
- NousResearch for fine-tuning the Llama 2 7B and 13B models.
- Phind for fine-tuning the Code Llama 34B model.
- Tom Jobbins for quantizing the Llama 2 models.
- Meta for releasing Llama 2 and Code Llama under a permissive license.