The Spotify Podcasts Dataset can be obtained by applying at this link, as access to the datset is controlled by Spotify.
spotify
contains the transcripts and their transformations.
All paths in the script are based off of this location in the larger Spotify-Podcasts dataset:
Spotify-Podcasts/EN/podcasts-no-audio-13GB
Copy podcasts-no-audio-13GB
into a folder named podcasts-no-audio-13GB
locally.
- Go to utils_general.py and edit the PATHS at the top of the file.
- Create the Spotify dataset in this project by running its notebook (Spotify.ipynb).
There are 2 types of transformations that we do on this data:
- Synthetic Disfluency Augmentation (0 to 10 transformations): We run our repeats, interjections, and false starts transformations on the podcast transcripts.
- Disfluency Annotation Model (-1 transformation): We run Jamshid Lou and Johnson's (2020) model on the 0 podcast transcripts and remove words marked as disfluent.
- The repository for Jamshid Lou and Johnson (2020) is here. On this page, they note here to use this repository for "us[ing] the trained models to disfluency label your own data."
- Steps to run it:
- Use these instructions to (1) clone the repo, (2) download the model and necessary resources for the model, and (3) unzip the model.
- Create a virtual environment which meets these requirements to run the script.
- Run the script using these instructions on these files:
mkdir spotify_test/annotated
cd english-fisher-annotations
python main.py --input-path ../spotify_test/0 --output-path ../spotify_test/annotated --model ./model/swbd_fisher_bert_Edev.0.9078.pt
pip install sentencepiece google protobuf
to get T5 working right now, otherwise you get errorsconda install evaluate
andpip install rouge_score
to get calculate_rouge_scores.py working