
Scripts to create the MLB dataset introduced in the paper Data-to-text Generation with Entity Modeling

Primary LanguagePython


This repo contains scripts to create the MLB dataset introduced in the paper Data-to-text Generation with Entity Modeling (Puduppully, R., Dong, L., & Lapata, M.; ACL 2019).


pip install git+https://github.com/ratishsp/mlbgame-api.git

Steps to create the dataset

Run the following scripts in sequence

  • boxscore_data.py. It requires the argument '-year'. The values to be passed are 0, 1, 2..10. For 0 it will collect the records for the year 2018, for 1 the year 2017 and so on.
python boxscore_data.py -year 1 -output ~/mlb-data/api-output/  # get the data for year 2017

Alternatively you can download the dataset containing box/line/play-by-play scores from https://drive.google.com/drive/folders/1jLU5wYjic2BR21iOLn9Tkv415AWkFqfj?usp=sharing

python extract_summaries_from_recap_html -recaps ~/mlb-data/recap_file_names.txt -output_folder ~/mlb-data/html-output/
python clean_summaries.py -input_folder ~/mlb-data/html-output/ -output_folder ~/mlb-data/html-output-cleaned/
python create_combined_dataset.py -input_folder ~/mlb-data/api-output/ -input_summaries ~/mlb-data/html-output-cleaned/ -output_folder ~/mlb-data/combined/
python preproc.py -input ~/mlb-data/combined/ -mlb_split_keys ~/mlb-data/mlb_split_keys.txt -output ~/mlb-data/splits/

Alternatively you can download the json files from https://drive.google.com/drive/folders/1G4iIE-02icAU2-5skvLlTEPWDQQj1ss4?usp=sharing