/Local_LLM

Deploy and Query a Local LLM

Primary LanguagePython

Local_LLM

Python Contributions welcome

This is a tutorial to deploy and query a Local LLM on a Mac M1 or Linux system.

Photo by Mika Baumeister on Unsplash

Table of Contents

Installing requirements
Downloading and activating the LLAMA-2 model
Querying the model
Quickstart
Building your own StreamLit Deployable Model

Installing requirements

The requirements for running this on an M1 are in part obtained through the GitHub requirements.txt file which can be used to build an Anaconda environment. For those that do not have Anaconda find it here. Build the chatbot-llm environment with the following command:

conda create -n chatbot-llm --file requirements.txt python=3.10 
conda activate chatbot-llm

Next, we need to install some other packages using pip that are not available via conda. In addition, for the LLM to work on a Mac or Linux system we must set the cmake arguments using the command below.

# Linux and Mac
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir 
pip install sse_starlette
pip install starlette_context
pip install pydantic_settings

Downloading and activating the LLAMA-2 model

Now it is time to download the model. For this example, we are using a relatively small LLM (only?!?! about 4.78 GB). You can download the model from Hugging Face.

mkdir -p models/7B
wget -O models/7B/llama-2-7b-chat.Q5_K_M.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Once the model and the packages have been installed, we are now ready to run the LLM locally. We begin by calling the llama_cpp.server with the downloaded LLAMA-2 model. This combination acts like ChatGPT (server) and GPT-4 (model) respectively.

python3 -m llama_cpp.server --model models/7B/llama-2-7b-chat.Q5_K_M.gguf

Querying the model

The server and model are now ready for user input. We are querying the server and model using query.py with our question of choice. To begin querying, we should open a new terminal tab and activate our conda environment again.

conda activate chatbot-llm
export MODEL="models/7B/llama-2-7b-chat.Q5_K_M.gguf"
python query.py

Recap and acknowledgments

In this demostration, we installed an LLM server (llama_cpp.server) and model (LLAMA-2) locally on a Mac. We were able to deploy our very own local LLM. Then we were able to query the server/model and adjust the size of the response. Congratulations you have built your very own LLM! The inspiration for this work and some of the code building blocks are derived from Youness Mansar. Feel free to use or share the code.