Lamini: Create Your Own ChatGPT For SQL

This is the repo for the Lamini project, which aims to build and share an instruction-following model with CC-BY license that allows commercial use. The repo contains:

The 52K data used for finetuning your own instruction-following SQL tuned LLM, a la ChatGPT.
The code for generating the data.

See our blogpost for layperson's terms on what's going on.

🦙🐪🦙🐫🦙🐪🦙🐫🦙🐪🦙🐫🦙🐪🦙🐫🦙🐪🦙🐫🦙🐪🦙🐫🦙🐪🦙🐫

Authentication to Lamini

Ready to configure your API key? It's easy-peasy! 🔑

First, navigate to your Lamini account page to retrieve your unique API key. Remember to keep this key a secret, and don't expose it in any client-side code or share it with others.

Next, create a config file, like so:

mkdir ~/.powerml
touch ~/.powerml/configure_llama.yaml # backend system names

Finally, open the file with a text editor and place your key in it:

production:
    key: "<YOUR-KEY-HERE>"

The best part? The Lamini python package will automatically load your key from this config file for you, so you don't have to worry about it 🙌

If you're running Lamini in a docker container, make sure to copy/mount this file inside the container 🐳

See our API docs for more details.

Run

Clone the repository:

git clone git@github.com:lamini-ai/lamini-sql.git

Using Python 🐍

In the repository, install python dependencies:

pip install -r requirements.txt

Run the program, to start generating data 📊📊📊

python3 sql-autocomplete.py

Using Docker 🐳

Make sure you have docker installed.

Then, run this command:

./run-sql-autocomplete.sh

Expected Outputs & Autosaved Data 🦙

When you run the program, you should start seeing output of a Seed Question, from the original small dataset in train_spider.json, and a Novel Question, which is a generated question based on that Seed Question.¹

====== Seed Question =====
 question='Show all movie titles, years, and directors, ordered by budget.'
===== Novel Question =====
 question='What are the names, birthdays and addresses of the 10 customers with the most orders?'

These generated questions are saved to data/questions.jsonl. This JSON file is a list of dictionaries with a question field.

Next, you'll see a Response generated for each Novel Question.

====== Question =====
 question='How many heads of the departments are older than 56 ?'
===== Query =====
 response='SELECT count(*) FROM head WHERE age  >  56'

These pairs are saved to data/dataset.jsonl. This JSON file is a list of dictionaries with question and query fields.

It's poggers 💥

Modify

I want to use my own seed data

We suggest creating your own dataset and changing the path to the train_spider.json in sql-autocomplete.py(./sql-autocomplete.py) --- or you can replace train_spider.json with your own data in the same format. You can of course also modify how the data is loaded or write your own script with the llama-llm library (pssst, API docs).

I only want to generate questions (to start)

In sql-autocomplete.py(./sql-autocomplete.py), you can just run generate_questions. This is a common use case for using human review after the question generation step to filter only the good ones for the next step of generating a response for each question.

I have my own instructions, and just want to generate responses

In sql-autocomplete.py(./sql-autocomplete.py), you can just use the function make_pairs to create the question-response pairs. This is a common use case step to run this stage separately, e.g. after human review of the generated questions, or if there was an error at this step last time.

I want to generate more than 100 instructions

Change the count flag -c for the number question-repsonse pairs to generate in total. The default is set to 100.

Data Release

We've run this script a few times and saved the results for you to freely use, at data/lamini_dataset.jsonl 💸

This file contains 52K instruction-following data for commercial use (ie. feel free to use it for your business! 💰📈). It's the same as the output, a list of dictionaries, each of which contains the following fields:

question: str, describes the task the model should perform. Each of the 52K instructions is unique, as generated by lamini/open.
query: str, the answer to the instruction as generated by lamini/open.

About Lamini

Lamini is the world's most powerful LLM engine, unlocking the power of generative AI for every company by putting their data to work. It is based on the lamini tribe, which includes llamas (LLMs!), alpacas, etc.

The Seed Questions in the Lamini seed dataset are instructions (combination of questions and commands), based on the spider. The generated questions are similar in nature to those and therefore don't have to be questions. You can find the seed dataset at data/spider/train_spider.json. ↩

pa-mc/lamini-sql