README.md

Datasets

The Spotify Podcasts Dataset can be obtained by applying at this link, as access to the datset is controlled by Spotify.

spotify contains the transcripts and their transformations.

All paths in the script are based off of this location in the larger Spotify-Podcasts dataset: Spotify-Podcasts/EN/podcasts-no-audio-13GB

Copy podcasts-no-audio-13GB into a folder named podcasts-no-audio-13GB locally.

Go to utils_general.py and edit the PATHS at the top of the file.
Create the Spotify dataset in this project by running its notebook (Spotify.ipynb).

There are 2 types of transformations that we do on this data:

Synthetic Disfluency Augmentation (0 to 10 transformations): We run our repeats, interjections, and false starts transformations on the podcast transcripts.
Disfluency Annotation Model (-1 transformation): We run Jamshid Lou and Johnson's (2020) model on the 0 podcast transcripts and remove words marked as disfluent.
- The repository for Jamshid Lou and Johnson (2020) is here. On this page, they note here to use this repository for "us[ing] the trained models to disfluency label your own data."
- Steps to run it:
  1. Use these instructions to (1) clone the repo, (2) download the model and necessary resources for the model, and (3) unzip the model.
  2. Create a virtual environment which meets these requirements to run the script.
  3. Run the script using these instructions on these files:
    - mkdir spotify_test/annotated
    - cd english-fisher-annotations
    - python main.py --input-path ../spotify_test/0 --output-path ../spotify_test/annotated --model ./model/swbd_fisher_bert_Edev.0.9078.pt

pip install sentencepiece google protobuf to get T5 working right now, otherwise you get errors
conda install evaluate and pip install rouge_score to get calculate_rouge_scores.py working