Predicting Reddit Comment Upvotes

Contributors:

Adam Reevesman
Gokul Krishna Guruswamy
Hai Le
Maximillian Alfaro
Prakhar Agrawal

Resources in this repository

The code for this project is divided into several notebooks, each of which regards a part of the workflow.

The finalized notebooks that combine into our entire work includes:

Step 1: Extracting data and scraping extra data (reddit api keys needed)
Step 2: Data processing and feature engineering. At the end of this process, we arrive at this list of final features which we used to fit our models.
Step 3: Model fitting and model comparison. In this notebook, we fitted different linear and non-linear models to the dataset, evaluate and compare their performance. The models were originally fit on data that differed slightly from the data obtained after the included data processing notebook. Therefore, an updated notebook is included to show the new results on some of the models using the updated dataset.

Objective

Original: To predict how many upvotes a comment will get, given the comment text, user history, sub-reddit and thread details.

Next Step: Improve current model performance.

Data Source

We use 2 sources of data:

Comments Dataset available here
Threads Dataset scraped using Reddit API using this code

Models and Metrics

We attempted linear and nonlinear regression models and compared their Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² values. The Random Forest had the lowest MAE, around 8, which suggests that on average, the model is off by about 8 upvotes.

RMSE penalizes large errors more heavily. The magnitudes of RMSE among our models suggests that they have lots of large errors.

Reference Papers/Write-ups

Predicting Comment Karma on Internet Forums
Predicting Comment Karma by Subreddit
- github link concerning this paper

karan6100/reddit-upvote-modeling