Credits

Code adapted from SCHOLAR and ProdLDA.

Input

The script expects a CSV input file of song data:

head -n1 data/lyrics.csv 
index,song,year,artist,genre,lyrics

To prepare the song lyrics for use in modeling, we:

remove empty songs and unused genres
lowercase
stop
stem
up- and down-sample songs (with replacement) to obtain the same number of songs per genre

Preprocessing produces a number of files.

From the entire corpus:

full-bag-of-words.pickle
- This contains an "index" column to map the song back to its labels
- Each additional column is the number of occurrences of the word shown in the header
- Each row represents one song, showing the count of each word.
full-labels.pickle
- This associates the unique song numbers with other song labels
- The header is: ["index", "genre"]
genre-number-mapping.pickle
- This associates the numeric genre to the original genre string

From a train/test split of the data:

cd musicProject/
docker build -t music-pipeline .

time docker run -e SONGS_PER_GENRE=30 -v $(pwd)/output/:/music/output/ music-pipeline