What's The Word
What's the word is a project attempting that attempts to
- Build a language model
- Leverage word prediction on the underlying language model.
Dataset
What's The Word is trained on a lyrics dataset generated and scraped
from genius.com (scrape
package).
Algorithms
After intial data preprocessing, we aim to build
- A language model using word embeddings
- A predictive network that can take in the context of a missing word and predict the missing word. (At the moment a CNN is used)
TODO
Methods
- Learn the Embedding first
- Learn the embedding with the prediction loss, but penalize the prediction loss less at the beginning
- Use embeddings for labels rather than one hot (try both, weighting one hot more towards the beginning)

Parameters to tweak
- Embedding Size (try smaller, then bigger)
- Model Context (8 to each side?)
Debugging
- If we zoom towards sparsity on the larger dataset, try again with a dataset exclusively from one artist