This repo contains scripts to create the MLB dataset introduced in the paper Data-to-text Generation with Entity Modeling (Puduppully, R., Dong, L., & Lapata, M.; ACL 2019).
- Install the mlbgame-api
pip install git+https://github.com/ratishsp/mlbgame-api.git
Run the following scripts in sequence
- boxscore_data.py. It requires the argument '-year'. The values to be passed are 0, 1, 2..10. For 0 it will collect the records for the year 2018, for 1 the year 2017 and so on.
python boxscore_data.py -year 1 -output ~/mlb-data/api-output/ # get the data for year 2017
Alternatively you can download the dataset containing box/line/play-by-play scores from https://drive.google.com/drive/folders/1jLU5wYjic2BR21iOLn9Tkv415AWkFqfj?usp=sharing
- extract_summaries_from_recap_html.py extracts the recaps from the html. The names of the htmls to be downloaded is available in the file recap_file_names.txt
python extract_summaries_from_recap_html -recaps ~/mlb-data/recap_file_names.txt -output_folder ~/mlb-data/html-output/
- clean_summaries.py cleans the html of quotations and text incidental to the game.
python clean_summaries.py -input_folder ~/mlb-data/html-output/ -output_folder ~/mlb-data/html-output-cleaned/
- create_combined_dataset.py results in a dataset with boxscores and summaries.
python create_combined_dataset.py -input_folder ~/mlb-data/api-output/ -input_summaries ~/mlb-data/html-output-cleaned/ -output_folder ~/mlb-data/combined/
- preproc.py preprocesses the dataset into train, validation and test splits. The splits are defined in the file mlb_split_keys.txt.
python preproc.py -input ~/mlb-data/combined/ -mlb_split_keys ~/mlb-data/mlb_split_keys.txt -output ~/mlb-data/splits/
Alternatively you can download the json files from https://drive.google.com/drive/folders/1G4iIE-02icAU2-5skvLlTEPWDQQj1ss4?usp=sharing