/AutoKaggle

Primary LanguagePythonApache License 2.0Apache-2.0

AutoKaggle

This is the formal repo for paper: "AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions"

kaggle_main

Introduction

AutoKaggle is a powerful framework that assists data scientists in completing data science pipelines through a collaborative multi-agent system. The framework combines iterative development, comprehensive testing, and a machine learning tools library to automate Kaggle competitions while maintaining high customizability. The key features of AutoKaggle include:

  • Multi-agent Collaboration: Five specialized agents (Reader, Planner, Developer, Reviewer, and Summarizer) work together through six key competition phases.
  • Iterative Development and Unit Testing: Robust code verification through debugging and comprehensive unit testing.
  • ML Tools Library: Validated functions for data cleaning, feature engineering, and modeling.
  • Comprehensive Reporting: Detailed documentation of workflow and decision-making processes.

unit_test

Quick Start with AutoKaggle

Set Environment

  1. Clone the repository
git clone https://github.com/multimodal-art-projection/AutoKaggle.git
  1. Create and activate conda environment
conda create -n AutoKaggle python=3.11
conda activate AutoKaggle
  1. Install dependencies
pip install -r requirements.txt
  1. Configure OpenAI API Create api_key.txt with:
sk-xxx                           # Your API key
https://api.openai.com/v1       # Base URL

Data Preparation

We support evaluation of Tabular-type datasets from Kaggle. Please Place competition data in ./multi_agents/competition/ with the following structure:

competition/
├── train.csv
├── test.csv
├── sample_submission.csv
└── overview.txt                 # Competition overview and data description

overview.txt: Copy and paste the Overview and Data sections from the Kaggle competition homepage into this file. The Reader will read this file to summarize relevant information.

Running AutoKaggle

To run AutoKaggle experiments, use the following command:

bash run_multi_agent.sh

Configuration Parameters

  • Competition Selection

    • competitions: Define target competitions in the script
  • Experiment Control

    • start_run, end_run: Define experiment iterations (default: 1-5)
    • dest_dir_param: Output directory specification (default: "all_tools")
  • Model Configuration

    • Default: gpt-4o for Planner and Developer, gpt-4o-mini for other agents
    • model determines the base model of Planner
    • Modify _create_agent in multi_agents/sop.py to change the base model of other agents

Output Structure

multi_agents/experiments_history/
└── <competition>/
    └── <model>/
        └── <dest_dir_param>/
            └── <run_number>/

Result

We evaluated AutoKaggle across 8 diverse Kaggle competitions, achieving:

  • 85% validation submission rate
  • 0.82 comprehensive score

Citation

@misc{li2024autokagglemultiagentframeworkautonomous,
      title={AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions}, 
      author={Ziming Li and Qianbo Zang and David Ma and Jiawei Guo and Tianyu Zheng and Minghao liu and Xinyao Niu and Yue Wang and Jian Yang and Jiaheng Liu and Wanjun Zhong and Wangchunshu Zhou and Wenhao Huang and Ge Zhang},
      year={2024},
      eprint={2410.20424},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.20424}, 
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.