Generative-AI Chatbot in Spark

This project showcases how easy it is to build a live chatbot application using your internal datasets on Spark. It features two main components:

batch ingestion - set of Spark pipelines that ingest unstructured data from your applications, pre-process, vectorizes it, store it within your vector database of choice
live inference - a Spark streaming pipeline that reads messages from Slack (soon also Teams) and answers them live using gathered knowledge.

This template works best with Databricks and Prophecy. However, you can run it on any Spark. Spark allows us to easily scale and operationalize our chatbot to big datasets and large user bases, and Prophecy allows for easy development and debugging of the pipelines.

Requirements

External dependencies

Optional, but recommended for best results:

Pinecone - allows for efficient storage and retrieval of vectors. To simplify, it's possible to use Spark-ML cosine similarity alternatively; however, since that doesn't feature KNNs for more efficient lookup, it's only recommended for small datasets.
OpenAI - for creating text embeddings and formulating questions. Alternatively, one can use Spark's word2vec for word embeddings and an alternative LLM (e.g., Dolly) for answer formulation based on context.
Slack or Teams (support coming soon) - for the chatbot interface. An example batch pipeline is present for fast debugging when unavailable.

Cluster dependencies

Required:

Spark-AI - Toolbox for building Generative AI applications on top of Apache Spark.

Platform recommendations:

Below is a platform recommendation. The template is entirely code-based and also runs on open-source projects like Spark.

Prophecy Low-Code (version 3.1 and above) - for building the data pipelines. A free account is available.
Databricks (DBR 12.2 ML and above) - for running the data pipelines. A free community edition is available, or Prophecy provides Databricks' free trial.

Getting started

1. Dependencies setup

Ensure that the above dependencies are satisfied. Create appropriate accounts on the services you want to use above. After that, save the generated tokens within the .env file (you can base it on the sample.env file).

Slack - quick video here
1. Setup Slack application using the manifest file in apps/slack/manifest.yml.
2. Generate App-Level Token with connections:write permission. This token is going to be used for receiving messages from Slack. Save it as SLACK_APP_TOKEN.
3. Find the Bot User OAuth Token. This token is going to be used for sending messages to Slack. Save it as SLACK_TOKEN
OpenAI - Create an account here and generate an api key. Save it as OPEN_AI_API_KEY.
Pinecone - Create an account here and generate the api key. Save it as PINECONE_TOKEN.

2. Setup Databricks secrets & schemas

Ensure that your .env file contains all secrets and run setup_databricks.sh to create the required secrets and schemas.

3. Load the repository

Fork this repository to your personal GitHub account. Afterward, create a new project in Prophecy pointing to the forked repository.

4. Setup databases

This project runs on Databrick's Unity Catalog by default. However, you can also reconfigure Source & Target gems to use alternative sources.

For Databricks Unity Catalog, create the following catalog: gen_ai and the following databases are required: web_bronze and web_silver. The tables are going to be created automatically on the first boot-up.

5. Run the pipelines

You can check out the end result on the video here.

prophecy-samples/gen-ai-chatbot-template