SEED: Domain-Specific Data Curation With Large Language Models

This is the code repo for the paper "SEED: Domain-Specific Data Curation With Large Language Models".

SEED is an approach that leverages Large Language Models (LLMs) to automatically generate domain-specific data curation solutions. By describing a task, input data, and expected output, the SEED compiler produces an executable pipeline consisting of LLM-generated code, small models, and data access modules.

SEED: Legay Version

Link to the Legacy Version

SEED: Dev Version

Info

Current Version: v0.2.0

Compatibility: Tested on MacBook M1 Pro, MacOS 12.6.2, Python 3.9, Pytorch 1.13.1

Hardware Requirements: None

Installation

First install the prerequisites and the SEED package.

git clone git@github.com:Magolor/SEED.git
cd ./SEED/SeeD/
pip install -r requirements.txt
pip install -e .
cd ..

The most basic config is the auth key to OpenAI API. You can create a openai_api.json file to store it:

{
    "model": "gpt-4-turbo-preview",
    "api_key": "<YOUR_API_KEY>",
    "organization": "<YOUR_ORGANIZATION>"
}

Or you will be asked to manually input them in terminals during SEED setup.

Then run the installer to initialize configurations:

python post_install.py

Tutorials

Recommended: A full tutorial for understanding how SEED works in general: amazon_google_full_tutorial.
A short version of the same amazon google project: amazon_google_tutorial.
A code generation agent tutorial: restaurant_tutorial.
Others:
- pubmed_tutorial
- ...

TODO

SEED is currently under development, many features and optimizations coming!