This project is a small demo of a knowledge question and answering system that uses Marqo with LLaMa to perform question answering. This can run locally on an M1 or M2 Mac or with a CUDA capable GPU on linux or windows. If you want to run this on an M1 or M2 Mac please be sure to have the ARM64 version of Python installed, this will make llama.cpp
builds for ARM64 and utilises Metal for inference rather than building for an x86 CPU and being emulated with Rosetta.
This folder contains the code for the frontend of the application, the frontend is written with NextJS and TypeScript.
This folder contains the backend code, the backend is written as a webserver using flask.
cd frontend
npm i
npm run dev
You will need to get the models to run this locally. If you have 16GB of RAM I recommend starting with 7B parameter LLaMa GGML models, 13B parameter models do work but you must limit the memory usage of Marqo with Docker and remove the ViT-L/14 model from the pre-loading. 32GB RAM will give you enough headroom for 13B or potentially more.
There are a number of models that are commented out in the code, you can find them on hugging face by searching the name. I recommend starting with llama-2-7b-chat.Q4_K_M.gguf.
Download the model and place it in a new directory backend/models/7B/
.
cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3
import nltk
nltk.download("all")
python3 -m flask run --debug -p 5001
docker run --name marqo -it -p 8882:8882 marqoai/marqo:2.12
If you have a GPU then you should add the --gpus all
flag to the docker run command.
docker run --name marqo -it -p 8882:8882 --gpus all marqoai/marqo:2.12
cd frontend
npm run format
cd backend
black .