This project uses llama.cpp to load model from a local file, delivering fast and memory-efficient inference.
The project is currently designed for Google Gemma, and will support more models in the future.
-
Download Gemma model from Google repository (https://huggingface.co/google/gemma-2b-it).
-
Quantize the Gemma model (highly recommended if target machine has limited memory).
- Download Gemma model from Google repository.
- Edit the model-path config.yaml, this should point to the actual model path.
- Start the web-ui by command:
screen -S "webui" bash ./start-ui.sh