Detecting sexual content on reggaeton song lyrics

One of the biggest critics to reggaeton is the amount of sexual content in its lyrics and videos. So, I collected over 8500 reggaeton lyrics from letras.com where the users upload them. Then, these lyrics were used to predict sexual content on the songs using weak supervision to label the data. Finally, I evaluated the different nlp models on a hand-labeled dataset with 600 songs.

Steps

Scraping the data
requests - BeautifulSoup
Preprocessing
Preprocessing was a big part of the project because a lot of the songs where upload using chat-language. This means, a lot of grammar and spelling errors.
- Text cleaning
  Removes numbers, special characters, repeated paragraphs and lines.
- Spelling correction
- Eliminate non-spanish lyrics
Labeling the songs
- Hand-labeling
- Weak supervision
  I used snorkel for the weak supervision part
Training
- Bag of words
  Trained on naive bayes and logistic regression
- Bag of ngrams
  Trained on naive bayes, logistic regression, SVM, and Gradient Boosting
Embeddings
I tried pre-trained embeddings, however, they were not meaningful due to the nature of our task.
- Embeddings on the lyrics
  Using gensim word2vec model with cbow and skipgram techniques
- ML models on the lyrics embeddings
- DL models using the lyrics embeddings
  Trained CNN, LSTM and GRU models.

Scores

model	parameters	accuracy	recall
naive bayes - BoW		0.75	0.74
logistic regression - Bag of n-grams - tfidf	{'C': 1.0}	0.74	0.74
naive bayes - Bag of ngrams		0.77	0.74
naive bayes - Bag of ngrams - tfidf		0.77	0.74
svm	{'C': 1.0}	0.73	0.75
Gradient Boosting on 120 svd scaled	{'n_estimators': 10, 'max_features': 'log2', 'max_depth': 15, 'learning_rate': 0.15, 'criterion': 'mae'}	0.72	0.73
Gradient Boosting on Tfidf	{'criterion': 'friedman_mse', 'learning_rate': 0.025, 'max_depth': 10, 'max_features': 'log2', 'n_estimators': 10}	0.76	0.76
Gradient Boosting on cbow	{'criterion': 'friedman_mse', 'learning_rate': 0.025, 'max_depth': 10, 'max_features': 'log2', 'n_estimators': 10}	0.76	0.76
Logistic regression on cbow	{'C': 100.0, 'penalty': 'l2'}	0.77	0.76
CNN on lyrics embeddings (word2vec - cbow)	{'optimizer': 'adam', 'Conv + pooling': 3, 'filters-size': (128, 5), 'batch_size': 256}	0.74	0.74
LSTM on lyrics embeddings (word2vec - cbow)	{'optimizer': 'adam', 'LSTM layers - units': (1, 100), 'Dense layers - units': (2, 1024), 'Dropout rate after dense layers': 0.8, 'batch_size': 32}	0.74	0.74
GRU on lyrics embeddings (word2vec - cbow)	{'optimizer': 'adam', 'GRU layers - units': (2, 300), 'Dense layers - units': (2, 1024), 'Dropout rate after dense layers': 0.8, 'batch_size': 128}	0.74	0.74

Improvements

As we can see, the scores could be better. These are some of the things that we can try:

improve spelling correction
improve the language detection. We still can find english words in the corpus
add labeling functions using snorkel to identify non-sexual content
check the errors and correct labels if necessary
use active learning

JonathanElejalde/reggaeton_songs_nlp

Detecting sexual content on reggaeton song lyrics

Steps

Scores

Improvements

Wordcloud on sexual content lyrics

Wordcloud on non-sexual content lyrics