Warning This is a prototype for development only. No security considerations have been made. All services run as root!
To build and run the container locally with hot reload on python files do:
DOCKER_BUILDKIT=1 docker build . -t gbnc
docker run \
--env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
--volume "$(pwd)/gswikichat":/workspace/gswikichat \
--volume gbnc_cache:/root/.cache \
--publish 8000:8000 \
--rm \
--interactive \
--tty \
--name gbnc \
gbnc
Point your browser to http://localhost:8000/ and use the frontend.
The container works on runpod.io GPU instances. A template is available here.
python -m venv .venv
. ./.venv/bin/activate
pip install -r requirements.txt
cd frontend
yarn dev
One container running all the components. No separation to keep it simple. Based on Nvidia CUDA containers in order to support GPU acceleration. Small models work on laptop CPUs too (tested i7-1260P).
The container runs Ollama for LLM inference. Will probably not scale enough when run as a service for multiple users, but enough for testing.
The Microsoft Phi2 2.7B model is run by default. The model runs locally using Ollama. Can be switched with the MODEL
docker build arg.
The Haystack RAG framework is used to implement Retrieval Augmented Generation on a minimal test dataset.
A FastAPI server is running in the container. It exposes an API to receive a question from the frontend, runs the Haystack RAG and returns the response.
A minimal frontend lets the user input a question and renders the response from the system.