gdomnijl/Neologisms

INFO 256 Applied NLP class final project

Jupyter Notebook

Neologisms

INFO 256 Applied NLP class final project

TODO:

Cleaning

Filter only-in-text neologisms
Spacy(): depunctuation, stemming, tokenization
Masking

Calculate entropy for each neologism

Take top 500 most frequent vocabularies in whole corpus
p = frequency of token X in tweets of neologism A / total number of tokens in tweets of neoglogism A
entropy = -sum(p x log(p)) over all 500 vocabularies

Labels

clustering on the time pattern of tweets

Model

biLstm + attention
concatenate entropy with attention scalar for final prediction