rayraycano/WhatsTheWord

Given a dataset, replace a given word with it's closest neighbor w.r.t to the dataset using a w2v approach. Lyrics pulled from genius.com

Python

What's The Word

What's the word is a project attempting that attempts to

Build a language model
Leverage word prediction on the underlying language model.

Dataset

What's The Word is trained on a lyrics dataset generated and scraped from genius.com (scrape package).

Algorithms

After intial data preprocessing, we aim to build

A language model using word embeddings
A predictive network that can take in the context of a missing word and predict the missing word. (At the moment a CNN is used)

TODO

Methods

Learn the Embedding first
Learn the embedding with the prediction loss, but penalize the prediction loss less at the beginning
Use embeddings for labels rather than one hot (try both, weighting one hot more towards the beginning)

 Parameters to tweak

Embedding Size (try smaller, then bigger)
Model Context (8 to each side?)

Debugging

If we zoom towards sparsity on the larger dataset, try again with a dataset exclusively from one artist