Vector Databases - Query Chroma DB Collection

This custom step queries a Chroma vector database collection and writes results to a SAS Cloud Analytics Services (CAS) table.

Vector databases facilitate Generative AI and other applications, notably providing context to a Large Language Model (LLM). Examples of other applications include recommendation engines, similarity search and time series forecasting.

Chroma is an open-source vector database used in Generative AI pipelines. It shares similar constructs and concepts with other vector store offerings.

User Interface

Here's a quick idea:

Vector Databases - Query Chroma DB Collection


Table of Contents

  1. Assumptions
  2. Requirements
  3. Parameters
    1. Input Parameters
    2. Configuration
    3. Output Specifications
  4. Run-time Control
  5. Documentation
  6. SAS Program
  7. Installation and Usage
  8. Created/Contact
  9. Change Log

Assumptions

  1. Chroma DB follows client / server architecture. This step implicitly considers the client and server to be on the same machine (see comments in code). Users are free to modify the step for persistent, remote/external, or alternatively orchestrated (e.g. Docker container) servers based on their requirement. Chroma DB documentation provides some examples (refer Documentation).

  2. Embeddings are assumed to be created with SAS Visual Text Analytics (VTA) for this version of the step. This step requires a SAS Visual Text Analytics (VTA) license.

  3. This custom step runs on data loaded to a SAS Cloud Analytics Services (CAS) library (known as a caslib). Ensure you are connected to CAS before running this step. Also, ensure that your output caslib destination is writeable.

  4. Proc Python is required. Required Python packages are listed in prerequisites section. Also, consider build and install of Python and required packages through the SAS Configurator for Open Source.

  5. This custom step provides embeddings to Chroma at the time of query and does not use Chroma's embedding function. Embedding function support will be considered in future.


Requirements

  1. A SAS Viya 4 environment version 2023.12 or later.

  2. Python packages to be installed:

    1. chromadb
    2. pysqlite-binary
    3. pandas
    4. swat
  3. Suggested Python version is 3.10.x due to dependency on sqlite version >= 3.35.0 (refer documentation). However, a workaround suggested by Chroma has been followed in the code.

  4. Optional components, based on site-specific architecture, are to have a separate Chroma DB server for persistence and scale. Refer Chroma documentation for details.


Parameters


Input Parameters

  1. Name of Chroma DB collection (text field, required): provide the name of the Chroma DB collection you might have populated earlier. You can't query a collection if you don't know the name for the same.

  2. Query source (drop-down list, frozen): currently set at "Input table" for this version. Other options will be examined in future releases.

3 Input table containing a text column (input port, required): attach a CAS table to this port.

  1. Query column (column selector, required, maximum 1): select a text column which contains the query you wish to pass to the database.

Configuration

  1. Embedding model caslib (text field, required): provide the caslib containing a VTA embedding model which will be applied to the query in order to generate embeddings.

  2. Embedding model astore name (text field, required): provide the name of a VTA astore model to generate embeddings on the query text.


Follow this process to obtain the above values:

  1. In Model Studio, right click on the topics node you used to create an embeddings model. Select Results
  2. In the score code portion of the results, locate and copy the values of the following macro variables: input_astore_caslib_name & input_astore_name. Use in the above two fields.
  3. Some users may choose to develop embedding models programmatically. They would have specified an astore name and caslib while doing so, which can be used for above fields.

  1. Embedding pattern (text column, required, default of _Col): document embeddings tend to be long series involving 100s or sometimes 1000s of columns. Provide a text pattern which applies to all embedding column names. For example, _Col represents _Col_1, _Col_2..... _Col_n. A default of _Col is provided since this happens to be the default value for Visual Text Analytics-generated embeddings.

  2. Location for Chroma database (folder selector, required): select a location where the Chroma database is persisted. Note this needs to be on the filesystem (SAS Server) and not SAS Content.

  3. CAS server (text field, default entered): change this only if you need a CAS server name different from a typical Viya 4 installation.

  4. CAS port (numeric field, default entered): change this only if you know that the CAS server runs on a different port than the default.


Output Specifications

  1. Number of results (numeric stepper): provide the number of results you wish to have returned for each observation of the query column.

  2. Output table (output port, required): attach a CAS table to the output port of this node to hold results.

  3. Promote (check box): check this box if you wish to have the output table promoted to global scope (and be available beyond the SAS Studio session)

Upon successful completion, the output table will contain the query, the id of the result documents, distance measure and the document content.


Run-time Control

Note: Run-time control is optional. You may choose whether to execute the main code of this step or not, based on upstream conditions set by earlier SAS programs. This includes nodes run prior to this custom step earlier in a SAS Studio Flow, or a previous program in the same session.

Refer this blog (https://communities.sas.com/t5/SAS-Communities-Library/Switch-on-switch-off-run-time-control-of-SAS-Studio-Custom-Steps/ta-p/885526) for more details on the concept.

The following macro variable,

_qcd_run_trigger

will initialize with a value of 1 by default, indicating an "enabled" status and allowing the custom step to run.

If you wish to control execution of this custom step, include code in an upstream SAS program to set this variable to 0. This "disables" execution of the custom step.

To "disable" this step, run the following code upstream:

%global _qcd_run_trigger;
%let _qcd_run_trigger =0;

To "enable" this step again, run the following (it's assumed that this has already been set as a global variable):

%let _qcd_run_trigger =1;

IMPORTANT: Be aware that disabling this step means that none of its main execution code will run, and any downstream code which was dependent on this code may fail. Change this setting only if it aligns with the objective of your SAS Studio program.


Documentation

  1. Documentation for the chromadb Python package and Chroma DB

  2. An important note regarding sqlite

  3. SAS Communities article on configuring Viya for Python integration

  4. The SAS Viya Platform Deployment Guide (refer to SAS Configurator for Open Source within)

  5. Options for persistent clients and client connections in Chroma

  6. Documentation for the Analytic Store Scoring action set

  7. Details on the optional run-time trigger control

  8. SAS Communities article on connecting to CAS using the SWAT package in SAS Studio


SAS Program

Refer here for the SAS program used by the step. You'd find this useful for situations where you wish to execute this step through non-SAS Studio Custom Step interfaces such as the SAS Extension for Visual Studio Code, with minor modifications.


Installation & Usage


Created/contact:


Change Log

  • Version 1.0 (30JAN2024)
    • Initial version