This repository contains three Python scripts that together create a pipeline for:
- Fetching markdown files from a specified GitHub repository and splitting them into sections at each heading.
- Retrieving embeddings for each section from the OpenAI API.
- Storing the page URLs and their corresponding sections into tables on Supabase.
Each script is designed to be run independently and to pass data to the next script via JSON files.
-
01-github-scraper.py
: Fetches markdown files from a GitHub repository, splits them into sections at each heading, and stores the results in a JSON file. -
02-get-embeddings.py
: Retrieves embeddings for each section of the markdown files from the OpenAI API and stores them in a JSON file. -
03-store-data.py
: Stores page URLs and their corresponding sections into tables on Supabase. Each page section will have the page URL as a foreign key.
Python 3.6+ is required. All required Python packages can be installed via pip:
pip install -r requirements.txt
These scripts use environment variables to store sensitive information. Please create a .env
file at the root of the repository and add the following variables:
GITHUB_TOKEN=<your_github_token>
OPENAI_API_KEY=<your_openai_api_key>
SUPABASE_URL=<your_supabase_url>
SUPABASE_KEY=<your_supabase_key>
Replace <your_github_token>
, <your_openai_api_key>
, <your_supabase_url>
, and <your_supabase_key>
with your actual GitHub token, OpenAI API key, Supabase URL, and Supabase key, respectively.
After setting up the environment variables, you can run the scripts in the following order:
- Run
01-github-scraper.py
to fetch markdown files from the GitHub repository and split them into sections. The results are stored inoutput.json
.
python 01-github-scraper.py
- Run
02-get-embeddings.py
to retrieve embeddings for each section of the markdown files from the OpenAI API. The results are stored inembeddings.json
.
python 02-get-embeddings.py
- Run
03-store-data.py
to store the page URLs and their corresponding sections into tables on Supabase.
python 03-store-data.py
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
MIT
Feel free to reach out for any issues or concerns. Happy coding!