/data_helper

Code to help generate SQL for stakeholders. Code at https://www.startdataengineering.com/post/data-democratize-llm/

Primary LanguagePython

Code for blog at: Democratize Data Access with RAGS

Set up

We will use LlamaIndex to build our RAG pipeline. The concepts used to RAG pipelines in general.

GitHub Repo: Data Helper

Pre-requisite

  1. Python 3.10+
  2. git
  3. Open AI API Key
  4. Poetry

Demo

We will clone the repo setup poetry shell as shown below:

git clone https://github.com/josephmachado/data_helper.git
cd data_helper
poetry install
poetry shell # activate the virtual env

# To run the code, please set your OPEN AI API key as shown below
export OPENAI_API_KEY=your-key-here
python run_code.py INDEX # Create an index with data from ./data folder
python run_code.py QUERY --query "show me for each buyers what date they made their first purchase"
# The above command uses the already existing index to make a request to LLM API to get results
# The code will return a SQL query with DuckDB format

python run_code.py QUERY --query "for every seller, show me a monthly report of the number of unique products that they sold, avg cost per product, max/min value of product purchased that month"
# The code will return a SQL query with DuckDB format

Next Steps

  1. Evaluate results and tune the pipeline
  2. Add observation system
  3. Monitor API costs
  4. Add additional documentation as input
  5. Explore other use cases such as RAGs for onboarding, DE training tool, etc

Further reading

  1. Production RAG tips
  2. Advanced RAG tuning
  3. What is a datawarehouse
  4. Conceptual data model

References

  1. LlamaIndex docs