/llm-driven-data-engineering

This is a public repository to go over all the LLM-driven data engineering concepts.

Primary LanguagePython

LLM-driven Data Engineering

Accounts to Follow

People

Libraries

Getting Started

Make an OpenAI account here and then generate an API Key. For Day 4, you'll need a Pinecone account and API key.

  • Day 1 (LLM-driven data engineering
    • Lecture Video is here
    • Lab video is here
  • Day 2 (LLM dev with LangChain)
    • Lecture Video is here
    • Lab Video is here
  • Day 3 (Using LLM to provide business value)
    • Auto Feedback Repo here
    • Lecture Video is here
    • Lab Video is here
  • Day 4 (Creating ZachGPT with RAG)
    • Vector Database Repo here
    • Lecture Video is here
    • Lab Video is here

Setup

Store the API key as an environment variable like: export OPENAI_API_KEY=<your_api_key> Or set it in Windows

The easiest way to install the dependencies is uv. Install it.

Run the command uv sync to install the python environment and all of the libraries under .venv folder.

You should configure your IDE to select the interpreter under the .venv folder, or activate it through the command on your terminal:

source .venv/bin/activate

PS: If you don't want to use uv, run

pip install .

Day 1 Lab

We'll be using the schemas from Dimensional Data Modeling Week 1 and generating the queries from the homework and labs except this time we'll do it via LLMs

Day 2 Lab

We'll be using Langchain to auto generate SQL queries for us based on tables and writing LinkedIn posts in Zach Wilson's voice

Setup

If you are watching live, you will be given a cloud database URL to use. export LANGCHAIN_DATABASE_URL=<value zach gives in Zoom>

If you aren't watching live, you'll need to use the halo_data_dump.dump file located in the data folder

Running pg_restore with your local database should get you up and running pretty quickly.

  • example command, assuming you got Postgres up and running either via Homebrew or Docker:
  • pg_restore -h localhost -p 5432 -d postgres -U <your laptop username> halo_data_dump.dump

Day 3 Lab

This lab leverages this repo

Day 4 Lab

This lab leverages this repo

Add it to the environment export PINECONE_API_KEY=<your pinecone API key>