This is the formal repo for paper: "AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions"
AutoKaggle is a powerful framework that assists data scientists in completing data science pipelines through a collaborative multi-agent system. The framework combines iterative development, comprehensive testing, and a machine learning tools library to automate Kaggle competitions while maintaining high customizability. The key features of AutoKaggle include:
- Multi-agent Collaboration: Five specialized agents (
Reader
,Planner
,Developer
,Reviewer
, andSummarizer
) work together through six key competition phases. - Iterative Development and Unit Testing: Robust code verification through debugging and comprehensive unit testing.
- ML Tools Library: Validated functions for data cleaning, feature engineering, and modeling.
- Comprehensive Reporting: Detailed documentation of workflow and decision-making processes.
- Clone the repository
git clone https://github.com/multimodal-art-projection/AutoKaggle.git
- Create and activate conda environment
conda create -n AutoKaggle python=3.11
conda activate AutoKaggle
- Install dependencies
pip install -r requirements.txt
- Configure OpenAI API
Create
api_key.txt
with:
sk-xxx # Your API key
https://api.openai.com/v1 # Base URL
We support evaluation of Tabular-type datasets from Kaggle. Please Place competition data in ./multi_agents/competition/
with the following structure:
competition/
├── train.csv
├── test.csv
├── sample_submission.csv
└── overview.txt # Competition overview and data description
overview.txt: Copy and paste the Overview and Data sections from the Kaggle competition homepage into this file. The Reader
will read this file to summarize relevant information.
To run AutoKaggle experiments, use the following command:
bash run_multi_agent.sh
-
Competition Selection
competitions
: Define target competitions in the script
-
Experiment Control
start_run
,end_run
: Define experiment iterations (default: 1-5)dest_dir_param
: Output directory specification (default: "all_tools")
-
Model Configuration
- Default:
gpt-4o
forPlanner
andDeveloper
,gpt-4o-mini
for other agents model
determines the base model ofPlanner
- Modify
_create_agent
inmulti_agents/sop.py
to change the base model of other agents
- Default:
multi_agents/experiments_history/
└── <competition>/
└── <model>/
└── <dest_dir_param>/
└── <run_number>/
We evaluated AutoKaggle across 8 diverse Kaggle competitions, achieving:
- 85% validation submission rate
- 0.82 comprehensive score
@misc{li2024autokagglemultiagentframeworkautonomous,
title={AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions},
author={Ziming Li and Qianbo Zang and David Ma and Jiawei Guo and Tianyu Zheng and Minghao liu and Xinyao Niu and Yue Wang and Jian Yang and Jiaheng Liu and Wanjun Zhong and Wangchunshu Zhou and Wenhao Huang and Ge Zhang},
year={2024},
eprint={2410.20424},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.20424},
}
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.