We release CodeS, a series of Code LLMs specifically optimized for SQL generation. CodeS is incrementally pre-trained based on StarCoder using a large SQL-related corpus.
The CodeS series encompasses four distinct scales: CodeS-1B, CodeS-3B, CodeS-7B, and CodeS-15B. CodeS-1B, 3B, and 7B are based on StarCoderBase-1B, 3B, and 7B and support the max length of 8,192. Meanwhile, CodeS-15B, derived from StarCoder-15B, accommodates sequences of up to 6,144 tokens due to computational constraints. The corpus is collected from different sources such as starcoderdata, CodeAlpaca-20k, StaQC, and more.
The original purpose of developing CodeS is to validate whether it is possible to strengthen the base model's abilities in a single domain (such as SQL generation) using a minimal amount of domain-specific corpus (which, of course, is comparatively small in quantity when compared to the data required for pre-training from scratch). This is particularly friendly for private deployments since we do not need to worry about sensitive data being fed into closed-source LLMs.
In addition, we also introduce a new few-shot in-context learning Text-to-SQL framework. Within this framework, we employ a demonstration retriever that is sensitive to question patterns, enhancing its ability to dynamically retrieve helpful samples from the training set. Additionally, we incorporate a schema item classifier to remain relevant schema (i.e., tables and columns) according to the question. This strategic approach could facilitate the avoidance of unwieldy input sequences.
A research paper is upcoming in which you can find more detailed information about CodeS! ✨
We evaluate CodeS on two challenging Text-to-SQL benchmarks: Spider and Bird. For Spider, we adopt the execution accuracy (EX) and test-suite accuracy (TS) as the evaluation metrics. As for Bird, we are considering EX along with the valid efficiency score (VES) to gauge its performance.
Question 1: Does incremental pre-training significantly enhance StarCoder's SQL generation capability?
To answer this question, we evaluate StarCoder and CodeS in the setting of 4-shot in-context learning. Let's delve into the following numerical results:
On Spider's development set:
LLM | Spider Dev (EX) | Spider Dev (TS) |
---|---|---|
StarCoderBase-1B | 57.5% | 49.3% |
CodeS-1B | 66.3% | 57.4% |
StarCoderBase-3B | 69.1% | 61.3% |
CodeS-3B | 75.7% | 68.5% |
StarCoderBase-7B | 73.0% | 65.3% |
CodeS-7B | 78.8% | 72.1% |
StarCoderBase-15B | 77.1% | 69.5% |
StarCoder-15B | 77.7% | 68.6% |
CodeS-15B | 80.8% | 72.4% |
On Bird's development set:
LLM | Orale Know. | Bird Dev (EX) | Bird Dev (VES) |
---|---|---|---|
StarCoderBase-1B | 21.32% | 22.73% | |
CodeS-1B | 27.12% | 31.09% | |
StarCoderBase-3B | 26.86% | 29.82% | |
CodeS-3B | 31.10% | 35.86% | |
StarCoderBase-7B | 32.01% | 34.91% | |
CodeS-7B | 34.81% | 36.75% | |
StarCoderBase-15B | 35.33% | 39.58% | |
StarCoder-15B | 35.27% | 40.25% | |
CodeS-15B | 38.14% | 40.80% | |
StarCoderBase-1B | ✔ | 23.92% | 28.15% |
CodeS-1B | ✔ | 30.57% | 37.18% |
StarCoderBase-3B | ✔ | 32.92% | 36.35% |
CodeS-3B | ✔ | 40.48% | 43.37% |
StarCoderBase-7B | ✔ | 40.09% | 42.68% |
CodeS-7B | ✔ | 43.29% | 47.27% |
StarCoderBase-15B | ✔ | 42.89% | 47.28% |
StarCoder-15B | ✔ | 42.83% | 47.44% |
CodeS-15B | ✔ | 45.57% | 49.03% |
The few-shot results underscore that CodeS outperforms StarCoder on both the Spider and Bird benchmarks in the few-shot setting, which demonstrates the effectiveness of the incrementally pre-training.
Question 2: Can CodeS be comparable with previous state-of-the-art Text-to-SQL methods?
To answer this question, we fully fine-tune the parameters of CodeS using the training sets. We then compare the fine-tuned CodeS with the existing state-of-the-art Text-to-SQL methods. The results are shown below:
On spider's development set:
Method | Type | Spider Dev (EX) | Spider Dev (TS) |
---|---|---|---|
T5-3B + PICARD [1] | Fine-tuning | 79.3% | 69.4% |
C3 + ChatGPT [2] | Prompting | 81.8% | 71.4% |
RESDSQL-3B + NatSQL [3] | Fine-tuning | 84.1% | 73.5% |
DIN-SQL + GPT-4 [4] | Prompting | 82.8% | 74.2% |
Graphix-T5-3B + PICARD [5] | Fine-tuning | 81.0% | 75.0% |
Few-shot SQL-PaLM [6] | Prompting | 82.7% | 77.3% |
Fine-tuned SQL-PaLM [6] | Fine-tuning | 82.8% | 78.2% |
Fine-tuned CodeS-1B | Fine-tuning | 79.7% | 73.4% |
Fine-tuned CodeS-3B | Fine-tuning | 82.6% | 77.4% |
Fine-tuned CodeS-7B | Fine-tuning | 85.3% | 79.8% |
Fine-tuned CodeS-15B | Fine-tuning | 84.5% | 79.5% |
On bird's development set (EX/VES):
Method | Orale Know. | Type | Bird Dev (EX) | Bird Dev (VES) |
---|---|---|---|---|
Fine-tuned T5-3B [7] | Fine-tuning | 10.37% | 13.62% | |
ChatGPT [7] | Prompting | 24.05% | 27.97% | |
ChatGPT + CoT [7] | Prompting | 25.88% | 32.33% | |
Fine-tuned CodeS-1B | Fine-tuning | 37.09% | 41.56% | |
Fine-tuned CodeS-3B | Fine-tuning | 40.87% | 46.01% | |
Fine-tuned CodeS-7B | Fine-tuning | 41.85% | 48.73% | |
Fine-tuned CodeS-15B | Fine-tuning | 44.13% | 50.31% | |
Fine-tuned T5-3B [7] | ✔ | Fine-tuning | 23.34% | 25.57% |
ChatGPT [7] | ✔ | Prompting | 37.22% | 43.81% |
ChatGPT + CoT [7] | ✔ | Prompting | 36.64% | 42.30% |
GPT-4 [7] | ✔ | Prompting | 46.35% | 49.77% |
DIN-SQL + GPT-4 [4] | ✔ | Prompting | 50.72% | 58.79% |
Fine-tuned CodeS-1B | ✔ | Fine-tuning | 49.93% | 53.93% |
Fine-tuned CodeS-3B | ✔ | Fine-tuning | 52.67% | 56.03% |
Fine-tuned CodeS-7B | ✔ | Fine-tuning | 53.78% | 56.99% |
Fine-tuned CodeS-15B | ✔ | Fine-tuning | 54.69% | 58.30% |
Evidently, the fine-tuned CodeS attains a groundbreaking level of performance on both the challenging Spider and Bird benchmarks.
(The performance of baselines are derived from their papers or the official leaderboards.)
Now, you can effortlessly replicate the results by utilizing the checkpoints and scripts we've released.
First, you should create the Anaconda environment and install the required modules:
conda create -n codes python=3.8.5
conda activate codes
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
Then, you'll need to install SimCSE, a tool employed for retrieving relevant demonstrations in the few-shot scenario:
git clone https://github.com/lihaoyang-ruc/SimCSE.git
cd SimCSE
python setup.py install
cd ..
Lastly, make sure to download the necessary datasets (including Spider and Bird) data.zip, as well as the schema item classifier checkpoints sic_ckpts.zip, along with Spider's evaluation scripts test_suite_sql_eval.zip. Once downloaded, simply unzip these files in the root folder.
unzip data.zip
unzip sic_ckpts.zip
unzip test_suite_sql_eval.zip
Few-shot CodeS
For your convenience, we offer a script text2sql_few_shot.py
to facilitate the reproduction of the few-shot results achieved by CodeS:
python -u text2sql_few_shot.py --model_path [model path] --sic_path [sic path] --dataset_path [dataset path] --demonstration_set_path [demon set path] --num_of_demonstrations [num of demon] --load_in_4bit --load_in_8bit
arguments:
[model path] path (or huggingface name) of the LLM
[sic path] path of the schema item classifier
[dataset path] path of the evaluation set
[demon set path] path of the demonstration pool (i.e., the training set)
[num of demon] number of demonstrations in the context
--load_in_4bit load LLM in 4bit quantization (less GPU memory but slower inference speed)
--load_in_8bit load LLM in 8bit quantization (less GPU memory but slower inference speed)
Here are the commands that will enable you to obtain the results for CodeS in the 4-shot setting:
# Spider
CUDA_VISIBLE_DEVICES=0 python -u text2sql_few_shot.py --model_path {model_path} --sic_path ./sic_ckpts/sic_spider --dataset_path ./data/sft_eval_spider_text2sql.json --demonstration_set_path ./data/sft_train_spider_text2sql.json --num_of_demonstrations 4
# Bird
CUDA_VISIBLE_DEVICES=0 python -u text2sql_few_shot.py --model_path {model_path} --sic_path ./sic_ckpts/sic_bird --dataset_path ./data/sft_eval_bird_text2sql.json --demonstration_set_path ./data/sft_train_bird_text2sql.json --num_of_demonstrations 4
# Bird (using orale knowledge)
CUDA_VISIBLE_DEVICES=0 python -u text2sql_few_shot.py --model_path {model_path} --sic_path ./sic_ckpts/sic_bird_with_evidence --dataset_path ./data/sft_eval_bird_with_evidence_text2sql.json --demonstration_set_path ./data/sft_train_bird_with_evidence_text2sql.json --num_of_demonstrations 4
You have the flexibility to select the --model_path
argument from the following options:
CodeS-1B | CodeS-3B | CodeS-7B | CodeS-15B |
---|---|---|---|
seeklhy/codes-1b | seeklhy/codes-3b | seeklhy/codes-7b | seeklhy/codes-15b |
For example, if you want to reproduce the results of 4-shot CodeS-15B on Spider, you can run:
CUDA_VISIBLE_DEVICES=0 python -u text2sql_few_shot.py --model_path seeklhy/codes-15b --sic_path ./sic_ckpts/sic_spider --dataset_path ./data/sft_eval_spider_text2sql.json --demonstration_set_path ./data/sft_train_spider_text2sql.json --num_of_demonstrations 4
Fine-tuned CodeS
Additionally, we supply a script named text2sql_zero_shot.py
, which facilitates the acquisition of results for the fine-tuned CodeS:
python -u text2sql_zero_shot.py --model_path [model path] --dataset_path [dataset path] --sic_path [sic path] --load_in_4bit --load_in_8bit
arguments:
[model path] path (or huggingface name) of the LLM
[dataset path] path of the evaluation set
[sic path] path of the schema item classifier
--load_in_4bit load LLM in 4bit quantization (less GPU memory but slower inference speed)
--load_in_8bit load LLM in 8bit quantization (less GPU memory but slower inference speed)
To obtain the results of the fine-tuned CodeS, simply execute the following commands:
# Spider
CUDA_VISIBLE_DEVICES=0 python -u text2sql_zero_shot.py --model_path {spider_model_path} --dataset_path ./data/sft_eval_spider_text2sql.json --sic_path ./sic_ckpts/sic_spider
# Bird
CUDA_VISIBLE_DEVICES=0 python -u text2sql_zero_shot.py --model_path {bird_model_path} --dataset_path ./data/sft_eval_bird_text2sql.json --sic_path ./sic_ckpts/sic_bird
# Bird (using orale knowledge)
CUDA_VISIBLE_DEVICES=0 python -u text2sql_zero_shot.py --model_path {bird_with_evidence_model_path} --dataset_path ./data/sft_eval_bird_with_evidence_text2sql.json --sic_path ./sic_ckpts/sic_bird_with_evidence
You can choose the --model_path
argument from the available options:
{spider_model_path} | {bird_model_path} | {bird_with_evidence_model_path} |
---|---|---|
seeklhy/codes-1b-spider | seeklhy/codes-1b-bird | seeklhy/codes-1b-bird-with-evidence |
seeklhy/codes-3b-spider | seeklhy/codes-3b-bird | seeklhy/codes-3b-bird-with-evidence |
seeklhy/codes-7b-spider | seeklhy/codes-7b-bird | seeklhy/codes-7b-bird-with-evidence |
seeklhy/codes-15b-spider | seeklhy/codes-15b-bird | seeklhy/codes-15b-bird-with-evidence |
For instance, if you aim to obtain the results of fine-tuned CodeS-7B on the Bird benchmark (using orale knowledge), you can use the following command:
CUDA_VISIBLE_DEVICES=0 python -u text2sql_zero_shot.py --model_path seeklhy/codes-7b-bird-with-evidence --dataset_path ./data/sft_eval_bird_with_evidence_text2sql.json --sic_path ./sic_ckpts/sic_bird_with_evidence
If your intention is to employ CodeS with your privacy data, the choice of strategy should align with the volume of training data at your disposal:
-
Few-shot in-context learning: In cases where you only have a limited number of training samples, opting for the few-shot in-context learning approach is advisable.
-
Fully fine-tuning: On the other hand, if your training data are substantial, embracing a fully fine-tuning method for CodeS could yield enhanced performance outcomes. This strategy allows for a more comprehensive adaptation of CodeS's parameters to the available data.
Important Note: We should emphasize that CodeS is a pre-trained LLM and has not aligned with humans via supervised fine-tuning and reinforcement learning from human feedback. Consequently, CodeS does not fall within the category of a "Chat" LLM.
- Submit the fine-tuned CodeS to the leaderboards for Spider and Bird benchmarks.⚪
- Evaluate the few-shot performance of closed-source LLMs like Text-davinci-003 and ChatGPT.⚪
- Make available the training code, data pre-processing code, and the data employed for the incremental pre-training process.⚪
The code and weights in this repository are open-sourced under the Apache-2.0 license.
CodeS is a collaborative effort between Renmin University of China and ai-finance (金科览智). If you have any questions, we encourage you to either create Github issues or get in touch directly with Haoyang Li at lihaoyang.cs@ruc.edu.cn.
[1] Scholak, T., Schucher, N., & Bahdanau, D. (2021). PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093.
[2] Dong, X., Zhang, C., Ge, Y., Mao, Y., Gao, Y., Lin, J., & Lou, D. (2023). C3: Zero-shot Text-to-SQL with ChatGPT. arXiv preprint arXiv:2307.07306.
[3] Li, H., Zhang, J., Li, C., & Chen, H. (2023, June). Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 11, pp. 13067-13075).
[4] Pourreza, M., & Rafiei, D. (2023). Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015.
[5] Li, J., Hui, B., Cheng, R., Qin, B., Ma, C., Huo, N., ... & Li, Y. (2023). Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing. arXiv preprint arXiv:2301.07507.
[6] Sun, R., Arik, S. O., Nakhost, H., Dai, H., Sinha, R., Yin, P., & Pfister, T. (2023). SQL-PaLM: Improved Large Language ModelAdaptation for Text-to-SQL. arXiv preprint arXiv:2306.00739.