LLM App Dev Workshop

Introduction

2024-07-03: Streamlit app changes: The chatbot app code now uses Ollama embeddings and has a configurable system prompt.

This repository demonstrates how to build a simple LLM-based chatbot that can answer questions based on your documents (retrieval augmented generation - RAG) and how to deploy it using Podman or on the OpenShift Container Platform (k8s).

The corresponding workshop - first run at Red Hat Developers Hands-On Day 2023 in Darmstadt, Germany - teaches participants the basic concepts of LLMs & RAG, and how to adapt this example implementation to their own specific purpose GPT.

The software stack only uses open source tools streamlit, LlamaIndex and local open LLMs via Ollama. Real open AI for the GPU poor.

Everyone is invited to fork this repository, create their own specific purpose chatbot based on their documents, improve the setup or even hold your own workshop.

Setup

For the local setup a Mac M1 with 16GB unified memory and above are recommended. First download Ollama from ollama.ai and install it.

On Linux you can disable the Ollama service for better debugging:

sudo systemctl disable ollama
sudo systemctl stop ollama

and then manually run ollama serve.

For the local example have a look at the folder streamlit and install the requirements.

Create a virtual environment first:

python -m venv venv
source venv/bin/activate

Install the requirements:

pip install -r requirements.txt

Then start streamlit with:

streamlit run app.py

Modify the system prompt and copy different data sources to docs/ in order to create your own version of the chatbot. You can set the ollama host via the enviroment variable OLLAMA_HOST.

You can download models locally with ollama pull zephyr or via API:

curl -X POST http://ollama:11434/api/pull -d '{"name": "zephyr"}'

First start the ollama service as described and download the Zephyr model. To test the ollama server you can call the generate API:

curl -X POST http://ollama:11434/api/generate -d '{"model": "zephyr", "prompt": "Why is the sky blue?"}'

All of these commands are also documented in our cheat sheet.

Deployment

Podman

Build the container based on UBI9 Python 3.11:

podman build -t linuxbot-app .

If you're building on arm64 Mac and deploy on amd64 then generally don't forget to add --platform (in this case our base image is amd64 anyways):

podman build --platform="linux/amd64" -t linuxbot-app .

We will create a network for our linuxbot and ollama:

podman network create linuxbot

Check if DNS is enabled (it's not on the default net):

podman network inspect linuxbot

Now you can either start Ollama locally with ollama serve or start a Ollama container with

podman run --net linuxbot --name ollama -p 11434:11434 --rm docker.io/ollama/ollama:latest

Note: We just forward the port so we can curl it more easily locally as well.

Click to unfold the details for

GPU support

This ollama service won't have GPU support enabled and much slower compared to running it locally on a Mac M1 for example. In order to run this container with NVIDIA GPU support we recommend to use the NVIDIA Container Toolkit with Container Device Interface (CDI). Follow the instructions from NVIDIA then run podman with:

podman run --rm --net linuxbot --name ollama --device nvidia.com/gpu=all --security-opt=label=disable ollama

In order to test if your graphics card is recognized you can test it using a base image that contains nvidia-smi, e.g:

podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L

For AMD graphic cards you need to forward the Kernel Fusion Driver (KFD) and Direct Rendering Infrastructure (DRI) to the container:

podman run -it --device=/dev/kfd --device=/dev/dri --security-opt=label=disable docker.io/ollama/ollama

Since we create the embeddings locally in the streamlit app we need to increase shared memory for Pytorch in order to get it running:

podman run --net linuxbot --name linuxbot-app -p 8080:8080 --shm-size=2gb -e OLLAMA_HOST=ollama -it --rm localhost/linuxbot-app

You can set the Ollama server via the environment variable OLLAMA_HOST, the default is localhost.

NOTE: It would be much better to generate the embeddings with the ollama service, this is not yet supported in LlamaIndex though.

OpenShift

Create a new project (namespace) for your workshop and deploy the ollama service in it:

oc new-project my-workshop
oc apply -f deployments/ollama.yaml

If you want to enable GPU support you have to have to install and instantiate the NVIDIA GPU Operator and Node Feature Discovery (NFD) Operator as described on the AI on OpenShift page, then deploy ollama-gpu.yaml instead.

oc apply -f deployments/ollama-gpu.yaml

The streamlit application (linuxbot) can deployed as:

oc apply -f deployments/linuxbot.yaml

We have published a preconfigured container image on quay.io/sroecker that is used in this deployment.

In order to debug your application and ollama service you can deploy a curl image like this:

oc run mycurl --image=curlimages/curl -it -- sh
oc attach mycurl -c mycurl -i -t
oc delete pod mycurl