/prefect_docs_rag_app

Use Prefect to create and deploy a simple RAG-trained chatbot

Primary LanguagePythonApache License 2.0Apache-2.0

prefect_docs_rag_app

Uses the Prefect workflow orchestration framework to generate and deploy a very rudimentary chatbot trained with RAG on the Prefect documentation. This a project to experiment with Prefect.

  • Based on the Google BERT model, which the script will download
  • Leverages the ONNX Runtime for platform/architecture portability
  • Clones or pulls the Prefect repository then extracts and converts .mdx files to create embeddings, poorly
  • Uses FastAPI to serve an API to the model
  • Uses Flask to create a simple chatbot interface

Screenshots

image

image

Installation

git clone https://www.github.com/sirredbeard/prefect_docs_rag_app
cd prefect_docs_rag_app
pip install -r requirements.txt
python3 main.py

Use

Browse to 127.0.0.1:5000 after deploying.

TODO / Known Issues

  • Improve handling of the global embeddings variable, currently hacky
  • The CUDA provider works fine with the ONNX runtime for creating embeddings but not for inferencing, it reports as unavailable and then reverts to CPU, has the runtime not freed up the CUDA provider?
  • Instead of using process from multiprocessing use Prefect-native task runners to serve the API and web frontend simulteanously
  • Switch to a better model than BERT, preference would be another SLM, like Phi-3, but because ONNX has built-in tokenizer support for BERT it was easier to start with BERT, did some testing with Phi-3 but without CUDA on inferencing it was very slow to respond, so reverted to BERT to complete the initial goals
  • Improve conversion of .mdx files to plain text, currently using html2txt, tried BeautifulSoup but wasn't much better, still a fair bit of HTML and CSS gets through to responses or implement markdown/html rendering in the responses to parse HTML/CSS/.mdx
  • Tune the embeddings logic to provide better, more accurate responses, currently using simple cosine similarity, sometimes answers are completely random or non-responsive