This is the official implementation of our work "DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning" (ICML 2024). [arXiv Version] [Download Benchmark(Google Drive)]
We select 30 representative data science tasks covering three data modalities and two fundamental ML task types. Please download the datasets and corresponding configuration files via [Google Drive] here and unzip them to the directory of "development/benchmarks". Besides, we collect the human insight cases from Kaggle in development/data.zip. Please unzip it, too.
Warning
Non-Infringement: The pre-processed data we provide is intended exclusively for educational and research purposes. We do not claim ownership of the original data, and any use of this data must respect the rights of the original creators. Users are responsible for ensuring that their use of the data does not infringe on any copyrights or other intellectual property rights.
This project is built on top of the framework of MLAgentBench. First, install MLAgentBench package with:
cd development
pip install -e.
Then, please install neccessary libraries in the requirements.
# (ERROR: Failed building wheel for cchardet) might occurred because of not install cython: pip install cython
pip install -r requirements.txt
pip install tiktoken
Since DS-Agent mainly utilizes GPT-3.5 and GPT-4 for all the experiments, please fill in the openai key in development/MLAgentBench/LLM.py and deployment/generate.py
Run DS-Agent for development tasks with the following command:
cd development/MLAgentBench
python runner.py --task feedback --llm-name gpt-3.5-turbo-16k --edit-script-llm-name gpt-3.5-turbo-16k
During execution, logs and intermediate solution files will be saved in logs/ and workspace/.
Run DS-Agent for deployment tasks with the provided command:
cd deployment
bash code_generation.sh
bash code_evaluation.sh
For open-sourced LLM, i.e., mixtral-8x7b-Instruct-v0.1 in this paper, we utilize the vllm framework. First, enable the LLMs serverd with
cd deployment
bash start_api.sh
Then, run the script shell and replace the configuration --llm by mixtral.
A1. Assume there are two agents A and B. Given a data science task, both agents perform 5 random trials to build models. Then, we use the predefined evaluation metric to evaluate the built model in the testing set. As such, we can rank these ten built models via the evaluation results.
Assume the models built by Agent A attains the rank [1,3,5,7,9], and the models built by Agent B attains the rank [2,4,6,8,10].
As such, MeanRank(A)=mean([1,3,5,7,9])=5, BestRank(A)=min([1,3,5,7,9])=1. Similarly, MeanRank(B)=6, BestRank(B)=2.
Please consider citing our paper if you find this work useful:
@InProceedings{DS-Agent,
title = {{DS}-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning},
author = {Guo, Siyuan and Deng, Cheng and Wen, Ying and Chen, Hechang and Chang, Yi and Wang, Jun},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {16813--16848},
year = {2024},
volume = {235},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR}
}