Reddit Flair Predictor
Flask app to predict flair of posts in r/India subreddit. Click here to launch app.
Codebase
- data: contains python file for collecting data and data.json file
- model: contains model notebook for training models and saving best model
- static: contains css stylesheets
- templates: contains html files for webapp
- Procfile: for heruko
- analysis.py: data analysis of collected data
- app.py: main file for webapp
- final_model1.sav: saved model
- prediction.py: file for predicting flair on given post url
- requirements.txt: contains required dependencies
Installation
You can either run the app online here
OR
Install it on the machine:
git clone https://github.com/nitishp25/reddit-flair-prediction.git
- Create virtual env:
python3 -m venv new-env
- Activate this env:
source new-env/bin/activate
cd reddit-flair-prediction
- Install dependencies:
pip install -r requirements.txt
- Run
python3 app.py
- Open http://0.0.0.0:5000/ in browser.
Dependencies
- sklearn
- nltk
- PyMongo
- beautifulsoup
- flask
- pandas
- numpy
- praw
- lxml
- scipy
- gunicorn
Approach
Firstly, all of the data was collected using Praw library which is a python wrapper for Reddit API. The goal was to collect 200 posts of each flair, however, Reddiquette had only around 150 posts.
Secondly, the data was saved on MongoDB using PyMongo.
Thirdly, data was collected from MongoDB and cleaned to remove symbols and bad words using nltk and bs4. The timestamp was created and body, title and comments were cleaned. The comments were not in order so top comments were taken and combined together.
The cleaned data was converted to a DataFrame using Pandas and machine learning models from sklearn were used to train on features like title, body, comments and url to predict flair of a post.
Data was split into training(80%) and testing(20%) set.
The following models were considered for classification:
- Naive Bayes Classifier
- Stochastic Gradient Descent/LinearSVM
- Logistic Regression
- MLP Classifier
The following ensembles were also considered:
- AdaBoost
- Random Forest
The above models were considered due to their robustness and high accuracy and flexibility.
These models were trained on the following features:
- Title
- Body
- Comments
- URL
These features were used because of significant amount of natural language content in them. Title, Comments and Body were combined for multivariate classification to increase accuracy.
Following are the highest accuracies for a particular feature:
Feature | Model | Accuracy |
---|---|---|
Title | Logistic Regression | 0.6963 |
Body | SGD/LinearSVM | 0.3666 |
URL | SGD/LinearSVM | 0.2972 |
Comments | SGD/LinearSVM | 0.5879 |
Title, Comments & Body | Random Forest | 0.8113 |
Resources
https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568 http://www.storybench.org/how-to-scrape-reddit-with-python/ https://scikit-learn.org/ https://api.mongodb.com/python/current/tutorial.html https://www.tutorialspoint.com/flask/index.htm https://stackoverflow.com/