GGUF Model Deployment Guide

Sample deployment configuration

Prerequisites

  • Python 3.8 or higher

Installation

  1. Install the required Python packages.

Ubuntu CPU

pip install llama-cpp-python
pip install flask

Ubuntu with CUDA

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
pip install flask

Windows

Download and install Anaconda python from here

conda create -n deeplearning python=3.8
conda activate deeplearning
conda config --add channels conda-forge
conda config --set channel_priority strict
conda install llama-cpp-python
pip install flask
  1. Download the required model file from here.

Deployment

CPU

  1. Run the app_cpu.py script to start the Flask server.
python app_cpu.py

CUDA GPU

  1. Run the app_cuda_gpu.py script to start the Flask server.
python app_cuda_gpu.py
  1. In a new terminal, run the post_request.py script to send a POST request to the server.
python post_request.py

Code Explanation

app.py is the main server file that uses Flask to create a web API. It uses the Llama library to generate responses based on the input message.

post_request.py is a script that sends a POST request to the server with a message. The server then uses the Llama library to generate a response and sends it back to the client.

Sample Request

You can send a POST request to http://localhost:5000/api/deployment with the following JSON body:

{
    "message": "what are the best pesticides for crops in Kerala?"
}

The server will respond with the AI-generated response.

Deploying flask in production

Clicke here to learn more on how to deploy a flask application in production