Run sick LLM apps hyper fast on your local machine for funzies.
- Git clone https://github.com/ggerganov/llama.cpp
- Run the make commands:
- Mac:
cd llama.cpp && make
- Windows (from here ):
- Download the latest fortran version of w64devkit.
- Extract
w64devkit
on your pc. - Run
w64devkit.exe
. - Use the
cd
command to reach thellama.cpp
folder. - From here you can run:
make
- pip install openai 'llama-cpp-python[server]' pydantic instructor streamlit
- Start the server:
- Single Model Chat
python -m --model models/mistral-7b-instruct-v0.1.Q4_0.gguf
- Single Model Chat with GPU Offload
python -m --model models/mistral-7b-instruct-v0.1.Q4_0.gguf --n_gpu -1
- Single Model Function Calling with GPU Offload
python -m --model models/mistral-7b-instruct-v0.1.- Q4_0.gguf --n_gpu -1 --chat functionary
- Multiple Model Load with Config
python -m --config_file config.json
- Multi Modal Models
python -m llama_cpp.server --model models/llava-v1.5-7b-Q4_K.gguf --clip_model_path models/llava-v1.5-7b-mmproj-Q4_0.gguf --n_gpu -1 --chat llava-1-5
- Mistral: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF
- Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
- LLaVa: https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
👨🏾💻 Author: Nick Renotte
📅 Version: 1.x
📜 License: This project is licensed under the MIT License