/RapLyrics-Scraper

Data sourcing and pre-processing for raplyrics.eu - A rap music lyrics generation project

Primary LanguagePythonMIT LicenseMIT

PR welcome if you want to contribute to this project

RapLyrics-Scraper

CircleCI

Context

This project aims to provide high quality text dataset of rap music lyrics. Such dataset are then fed to a neural network to build lyrics-generation model. The resulting word-to-word lyrics-generative model is served on raplyrics.eu.

Feel free to tweak this scraper to fit your needs. Kudos to open source.

Setup

This project is built on python3 - I recommend using a virtual environment.

`which python3` -m venv RapLyrics-Scraper
source RapLyrics-Scraper/bin/activate
pip install -r requirements.txt

Run the lyrics scraper

  • Update the list of artists you want to get the lyrics from and the number of songs to get per artists. To do so, directly edit the artists list defined at lyrics_scraper.py:39.

  • To run the script: be sure to set the lyrics_dir and songs_per_artists arguments.

    • Specify the directory in which the scraped lyrics should be saved with lyrics_dir
    • Specify the number of songs to scrap per artist with the songs_per_artists arg. Run python lyrics_scraper.py --help for more information on the available arguments

Let's say you want to scrap 2 songs per artist and save them in the folder my_lyrics_folder with a verbose output, run:

python lyrics_scraper.py --verbose --lyrics_dir='my_lyrics_folder' --songs_per_artists=2
  • Once the scraping is done : one lyric file is generated per artist scraped. Merge the files with:
cat *_lyrics.txt > merged_lyrics.txt

Utils

A toolbox is also provided to analyze some of the dataset properties. To run a quick analysis of any .txt file, update the file to consider in pre_processing/analysis.py then run:

python pre_processing/analysis.py

Notes

Currently we get the songs by decreasing popularity order.

Related work

This project was intensively used to generate high quality text dataset that were consumed by: