Data Interpreter (DI)

What is Data Interpreter

Data Interpreter is an agent who solves data-related problems through codes. It understands user requirements, makes plans, writes codes for execution, and uses tools if necessary.

Experiments in the Paper

Installation

Ensure that Python 3.9+ is installed on your system. You can check this by using: python --version.
Recommend using a conda environment: conda create -n di python=3.9 && conda activate di
We use metagpt, a third-party package, as a dependency to develop Data Interpreter. Install metagpt and configure openai api key under config/config2.yaml

pip install metagpt==0.8.1
# pip install metagpt[rag]==0.8.1  # if you want to use experience
export PYTHONPATH="/absolute/path/to/this/repo:$PYTHONPATH"

Data Interpreter Dataset Structure

di_dataset

  • ml_benchmark
    • 04_titanic
    • 05_house-prices-advanced-regression-techniques
    • 06_santander-customer-transaction-prediction
    • 07_icr-identify-age-related-conditions
    • 08_santander-value-prediction-challenge
  • open_ended_tasks
    • 01_ocr
    • 02_ocr
    • 03_ocr
    • 14_image_background_removal
    • 16_image_2_code_generation
    • 17_image_2_code_generation
  • MATH
    • kept as original downloaded structure

ML-Benchmark Dataset and Requirements

Before running the experiments, you can download the datasets (links in table below), and place them in the specified path (di_dataset). We have downloaded the 04_titanic data set in di_dataset for demonstration.

You need to run split.py to split the data set into training and testing sets

cd di_dataset/ml_benchmark
python split.py
cd ../..

ML-Benchmark contains 8 typical machine learning datasets.

ID Task Name Dataset Name Link User Requirement
01 01_iris Iris Built-in sklearn dataset. No need to download. Run data analysis on sklearn Iris dataset, include a plot
02 02_wines_recognition Wine recognition Built-in sklearn dataset. No need to download. Run data analysis on sklearn Wine recognition dataset, include a plot, and train a model to predict wine class with 20% as test set, and show prediction accuracy
03 03_breast_cancer Breast Cancer Built-in sklearn dataset. No need to download. Run data analysis on sklearn Wisconsin Breast Cancer dataset, include a plot, train a model to predict targets (20% as validation), and show validation accuracy
04 04_titanic Titanic link This is a titanic passenger survival dataset, your goal is to predict passenger survival outcome. The target column is Survived. Perform data analysis, data preprocessing, feature engineering, and modeling to predict the target. Report accuracy on the eval data. Train data path: '{data_dir}/ml_benchmark/4_titanic/split_train.csv', eval data path: '{data_dir}/ml_benchmark/04_titanic/split_eval.csv'.
05 05_house_prices House Prices link This is a house price dataset, your goal is to predict the sale price of a property based on its features. The target column is SalePrice. Perform data analysis, data preprocessing, feature engineering, and modeling to predict the target. Report RMSE between the logarithm of the predicted value and the logarithm of the observed sales price on the eval data. Train data path: '{data_dir}/ml_benchmark/05_house-prices-advanced-regression-techniques/split_train.csv', eval data path: '{data_dir}/ml_benchmark/05_house-prices-advanced-regression-techniques/split_eval.csv'.
06 06_santander_customer Santander Customer link This is a customers financial dataset. Your goal is to predict which customers will make a specific transaction in the future. The target column is target. Perform data analysis, data preprocessing, feature engineering, and modeling to predict the target. Report AUC on the eval data. Train data path: '{data_dir}/ml_benchmark/06_santander-customer-transaction-prediction/split_train.csv', eval data path: '{data_dir}/ml_benchmark/06_santander-customer-transaction-prediction/split_eval.csv' .
07 07_icr_identify ICR - Identifying link This is a medical dataset with over fifty anonymized health characteristics linked to three age-related conditions. Your goal is to predict whether a subject has or has not been diagnosed with one of these conditions. The target column is Class. Perform data analysis, data preprocessing, feature engineering, and modeling to predict the target. Report F1 Score on the eval data. Train data path: '{data_dir}/ml_benchmark/07_icr-identify-age-related-conditions/split_train.csv', eval data path: '{data_dir}/ml_benchmark/07_icr-identify-age-related-conditions/split_eval.csv' .
08 08_santander_value Santander Value link This is a customers financial dataset. Your goal is to predict the value of transactions for each potential customer. The target column is target. Perform data analysis, data preprocessing, feature engineering, and modeling to predict the target. Report RMSLE on the eval data. Train data path: '{data_dir}/ml_benchmark/08_santander-value-prediction-challenge/split_train.csv', eval data path: '{data_dir}/ml_benchmark/08_santander-value-prediction-challenge/split_eval.csv' .

Note:

  1. data_dir is the directory where the di_dataset is stored.

To reproduce the results in the paper, run the following commands:

python examples/run_ml_benchmark.py --task_name 04_titanic

Some key arguments:

  • --task_name: required, specifies the task to run. e.g., 04_titanic and 14_image_background_removal. Refer to the table below for available task names.
  • --data_dir: optional, the directory that stores the di_dataset (default to ., the current working directory).
  • --use_reflection: optional, the flag to use reflection or not (default is True).
  • --use_experience: optional, the flag to use experience or not (default is False).

Open-Ended Tasks Dataset and Requirements

Open-Ended Tasks have collected and designed 20 moderately challenging open-ended tasks, requiring Data Interpreters to understand user requirements, plan and decompose tasks, and generate and execute code.

ID Task Name Scenario Scenario Description User Requirement
01 01_ocr OCR Scan all the necessary fields and amounts from the given file and then create an Excel sheet with the extracted data. This is an English invoice image. Your goal is to perform OCR on the image, extract the total amount from ocr result and save as table, using PaddleOCR. The PaddleOCR environment has been fully installed, try to use Paddleocr as much as possible. Image path: '{data_dir}/open_ended_tasks/01_ocr.png
02 02_ocr OCR Scan all the necessary fields and amounts from the given file and then create an Excel sheet with the extracted data. This is a Chinese invoice image. Your goal is to perform OCR on the image and only output the recognized text word results, nothing else is needed, then extract the total amount and receipt ID starting with 'No' from ocr text words results and save as table, using PaddleOCR. The PaddleOCR environment has been fully installed, try to use Paddleocr as much as possible. Image path: '{data_dir}/open_ended_tasks/02_ocr.jpg'
03 03_ocr OCR Scan all the necessary fields and amounts from the given file and then create an Excel sheet with the extracted data. This is an invoice image for OCR. Your goal is to perform OCR on the image, extract the total amount and save it into an Excel table format, using PaddleOCR with lang='en' The PaddleOCR environment has been fully installed, try to use Paddleocr as much as possible. Image path: '{data_dir}/open_ended_tasks/03_ocr.jpg'
04 04_web_search_and_crawling Web search and crawling Crawling and organizing web form information Get data from paperlist table in https://papercopic.com/statistics/iclr-statistics/iclr-2024-statistics/ , and save it to a csv file. paper title must include multiagent or large language model. notice: print key variables
05 05_web_search_and_crawling Web search and crawling Crawling and organizing web form information Obtain the CPI data from https://www.stats.gov.cn/sj/sjjd/202307/t20230718_1941322.html, please follow this plan step by step: 1. Detect the encoding type and HTML structure of the target webpage. 2. Crawl the webpage, de-duplicate the body content, convert it to a clear paragraph suitable for reading as plain text, and save it to target.txt. 3. Design multiple regular expressions to match key sentences in target.txt, use try-except statements to combine the various regular expression matches, note that the webpage text is in Chinese. 4. Finally, use a Chinese summary to summarize the key sentences to answer the user's request. Note: If it is a code block, print out the key variable results of the code block; if it is webpage text, print the first 200 characters.
06 06_web_search_and_crawling Web search and crawling Crawling and organizing web form information Get products data from website https://scrapeme.live/shop/ and save it as a csv file. Notice: Firstly parse the web page encoding and the text HTML structure; The first page product name, price, product URL, and image URL must be saved in the csv;
07 07_web_search_and_crawling Web search and crawling Crawling and organizing web form information 从36kr创投平台https://pitchhub.36kr.com/financing-flash所有初创企业融资的信息, 注意: 这是⼀个中⽂⽹站; 下⾯是⼀个⼤致流程, 你会根据每⼀步的运⾏结果对当前计划中的任务做出适当调整: 1. 爬取并本地保存html结构; 2. 直接打印第7个快讯关键词后2000个字符的html内容, 作为快讯的html内容示例; 3. 反思快讯的html内容示例中的规律, 设计正则匹配表达式来获取快讯的标题、链接、时间; 4. 筛选最近3天的初创企业融资快讯, 以list[dict]形式打印前5个。5. 将全部结果存在本地csv中
08 08_email_reply Email reply Filter through my emails and respond to them as necessary You are an agent that automatically reads and replies to emails. I will give you your Outlook email account and password. You need to check the content of the latest email and return it to me. If the email address suffix of this email is @xxx.xxx, please automatically reply with "I've received your email and will reply as soon as possible. Thank you!" Email account: xxx@xxx.xxx Email Password: xxxx
09 09_web_page_imitation Web page imitation Using Selenium and WebDriver to access a webpage and convert it to an image, with the assistance of GPT-4V to mimic the creation of a one-page website. This is a URL of webpage: https://medium.com/ . Firstly, utilize Selenium and WebDriver for rendering. Secondly, convert image to a webpage including HTML, CSS and JS in one go. Finally, save webpage in a text file. All required dependencies and environments have been fully installed and configured.
10 10_web_page_imitation Web page imitation Using Selenium and WebDriver to access a webpage and convert it to an image, with the assistance of GPT-4V to mimic the creation of a one-page website. This is a URL of webpage: https://pytorch.org/ . Firstly, utilize Selenium and WebDriver for rendering. Secondly, convert image to a webpage including HTML, CSS and JS in one go. Finally, save webpage in a file. NOTE: All required dependencies and environments have been fully installed and configured.
11 11_web_page_imitation Web page imitation Using Selenium and WebDriver to access a webpage and convert it to an image, with the assistance of GPT-4V to mimic the creation of a one-page website. This is a URL of webpage: https://www.kaggle.com/ . Firstly, utilize Selenium and WebDriver to render the webpage, ensuring the browser window is maximized for an optimal viewing experience. Secondly, convert image to a webpage including HTML, CSS and JS in one go. Finally, save webpage in a file. NOTE: All required dependencies and environments have been fully installed and configured.
12 12_web_page_imitation Web page imitation Using Selenium and WebDriver to access a webpage and convert it to an image, with the assistance of GPT-4V to mimic the creation of a one-page website. This is a URL of webpage: https://chat.openai.com/auth/login . Firstly, utilize Selenium and WebDriver to render the webpage, ensuring the browser window is maximized for an optimal viewing experience. Secondly, convert image to a webpage including HTML, CSS and JS in one go. Finally, save webpage in a file. NOTE: All required dependencies and environments have been fully installed and configured.
13 13_web_page_imitation Web page imitation Using Selenium and WebDriver to access a webpage and convert it to an image, with the assistance of GPT-4V to mimic the creation of a one-page website. This is a URL of webpage: https://deepmind.google/technologies/gemini/#introduction . Firstly, utilize Selenium and WebDriver to render the webpage, ensuring the browser window is maximized for an optimal viewing experience. Secondly, convert image to a webpage including HTML, CSS and JS in one go. Finally, save webpage in a file. NOTE: All required dependencies and environments have been fully installed and configured.
14 14_image_background_removal Image Background Removal Remove the background of a given image This is an image, you need to use python toolkit rembg remove the background of the image. image path:'{data_dir}/open_ended_tasks/14_image_background_removal.jpg'; save path:'{data_dir}/open_ended_tasks/14_image_background_removal.jpg'
15 15_text2img Text2Img Use SD tools to generate images I want to generate an image of a beautiful girl using the stable diffusion text2image tool, sd_url = "http://your.sd.service.ip:port"
16 16_image_2_code_generation Image2Code Generation Web code generation This is a image. First, convert the image to webpage code including HTML, CSS and JS in one go, and finally save webpage code in a file.The image path: '{data_dir}/open_ended_tasks/16_image_2_code_generation.png'. NOTE: All required dependencies and environments have been fully installed and configured.
17 17_image_2_code_generation Image2Code Generation Web code generation This is a image. First, convert the image to webpage code including HTML, CSS and JS in one go, and finally save webpage code in a file.The image path: '{data_dir}/open_ended_tasks/16_image_2_code_generation.png'. NOTE: All required dependencies and environments have been fully installed and configured.
18 18_generate_games Generate games using existing repo Game tool usage (pyxel) Create a Snake game. Players need to control the movement of the snake to eat food and grow its body, while avoiding the snake's head touching their own body or game boundaries. Games need to have basic game logic, user interface. During the production process, please consider factors such as playability, beautiful interface, and convenient operation of the game. Note: pyxel environment already satisfied
19 19_generate_games Generate games using existing repo Game tool usage (pyxel) You are a professional game developer, please use pyxel software to create a simple jumping game. The game needs to include a character that can move left and right on the screen. When the player presses the spacebar, the character should jump. Please ensure that the game is easy to operate, with clear graphics, and complies with the functional limitations of pyxel software. Note: pyxel environment already satisfied
20 20_generate_games Generate games using existing repo Game tool usage (pyxel) Make a mouse click game that click button as many times as possible in 30 seconds using pyxel. Note: pyxel environment already satisfied

Note:

  1. data_dir is the directory where the di_dataset is stored.
  2. The specific email account and password need to be replaced with the actual email account and password in requirements_prompt.py.
  3. The specific sd_url need to be replaced with the actual sd_url in requirements_prompt.py.
  4. Codes related to "Generate games using existing repo" and Math benchmark are being integrated. Stay tuned.

To reproduce the results in the paper, run the following commands:

python examples/run_open_ended_tasks.py --task_name 14_image_background_removal

Math Dataset and Requirements

  • Download the MATH dataset here

  • Extract the tar file to di_dataset/MATH

  • Use --categories to select category to run, The problems are randomly selected from level-5 difficulty. Here are the category names and IDs. For exmaple, you can test on level-5 problem from Number Theory (--categories 4):

ID Category Name
0 Algebra
1 Counting & Probability
2 Geometry
3 Intermediate Algebra
4 Number Theory
5 Prealgebra
6 Precalculus

To reproduce the results in the paper, run the following commands:

python examples/run_math_benchmark.py --categories 4 --level 5 --vote_num 3 --folder ./math_experiment --dataset_path ./di_dataset/MATH

You can find the experiment records in folder ./math_experiment.