This project showcases how easy it is to build a live chatbot application using your internal datasets on Spark. It features two main components:
- batch ingestion - set of Spark pipelines that ingest unstructured data from your applications, pre-process, vectorizes it, store it within your vector database of choice
- live inference - a Spark streaming pipeline that reads messages from Slack (soon also Teams) and answers them live using gathered knowledge.
This template works best with Databricks and Prophecy. However, you can run it on any Spark. Spark allows us to easily scale and operationalize our chatbot to big datasets and large user bases, and Prophecy allows for easy development and debugging of the pipelines.
Optional, but recommended for best results:
- Pinecone - allows for efficient storage and retrieval of vectors. To simplify, it's possible to use Spark-ML cosine similarity alternatively; however, since that doesn't feature KNNs for more efficient lookup, it's only recommended for small datasets.
- OpenAI - for creating text embeddings and formulating questions. Alternatively, one can use Spark's word2vec for word embeddings and an alternative LLM (e.g., Dolly) for answer formulation based on context.
- Slack or Teams (support coming soon) - for the chatbot interface. An example batch pipeline is present for fast debugging when unavailable.
Required:
- Spark-AI - Toolbox for building Generative AI applications on top of Apache Spark.
Below is a platform recommendation. The template is entirely code-based and also runs on open-source projects like Spark.
- Prophecy Low-Code (version 3.1 and above) - for building the data pipelines. A free account is available.
- Databricks (DBR 12.2 ML and above) - for running the data pipelines. A free community edition is available, or Prophecy provides Databricks' free trial.
Ensure that the above dependencies are satisfied. Create appropriate accounts on the services you want to use above. After that, save the generated tokens within the .env
file (you can base it on the sample.env
file).
-
Slack - quick video here
- Setup Slack application using the manifest file in apps/slack/manifest.yml.
- Generate App-Level Token with
connections:write
permission. This token is going to be used for receiving messages from Slack. Save it asSLACK_APP_TOKEN
. - Find the Bot User OAuth Token. This token is going to be used for sending messages to Slack. Save it as
SLACK_TOKEN
-
OpenAI - Create an account here and generate an api key. Save it as
OPEN_AI_API_KEY
. -
Pinecone - Create an account here and generate the api key. Save it as
PINECONE_TOKEN
.
Ensure that your .env
file contains all secrets and run setup_databricks.sh
to create the required secrets and schemas.
Fork this repository to your personal GitHub account. Afterward, create a new project in Prophecy pointing to the forked repository.
This project runs on Databrick's Unity Catalog by default. However, you can also reconfigure Source & Target gems to use alternative sources.
For Databricks Unity Catalog, create the following catalog: gen_ai
and the following databases are required: web_bronze
and web_silver
. The tables are going to be created automatically on the first boot-up.
You can check out the end result on the video here.