rcortx/kaggle-fake-news

Baseline TFIDF solution to https://www.kaggle.com/c/fake-news/data

Jupyter Notebook

kaggle-fake-news

Baseline TFIDF solution to https://www.kaggle.com/c/fake-news/data

Classifier uses a basic Text processing pipeline over just the text column to predict fake news:

Text cleaning: accent removal, lower case
Tokenization
Stopword removal
Lemmatization/Stemming
TFIDF vectorization
Experiments with tree classifiers like Decision Trees and Gradient Boosted Trees
Achieved F1 of 87-91.5% on 20% validation set (best F1 with Gradient Boosted Trees) (NOTE: this is not k-cross validated)

Next Steps:

Use other columns to improve classifier performance like author, title, etc
Use BERT based vectorization instead of TFIDF