Contributors:
- Adam Reevesman
- Gokul Krishna Guruswamy
- Hai Le
- Maximillian Alfaro
- Prakhar Agrawal
The code for this project is divided into several notebooks, each of which regards a part of the workflow.
The finalized notebooks that combine into our entire work includes:
- Step 1: Extracting data and scraping extra data (reddit api keys needed)
- Step 2: Data processing and feature engineering. At the end of this process, we arrive at this list of final features which we used to fit our models.
- Step 3: Model fitting and model comparison. In this notebook, we fitted different linear and non-linear models to the dataset, evaluate and compare their performance. The models were originally fit on data that differed slightly from the data obtained after the included data processing notebook. Therefore, an updated notebook is included to show the new results on some of the models using the updated dataset.
Original: To predict how many upvotes a comment will get, given the comment text, user history, sub-reddit and thread details.
Next Step: Improve current model performance.
We use 2 sources of data:
- Comments Dataset available here
- Threads Dataset scraped using Reddit API using this code
We attempted linear and nonlinear regression models and compared their Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R2 values. The Random Forest had the lowest MAE, around 8, which suggests that on average, the model is off by about 8 upvotes.
RMSE penalizes large errors more heavily. The magnitudes of RMSE among our models suggests that they have lots of large errors.
- Predicting Comment Karma on Internet Forums
- Predicting Comment Karma by Subreddit
- github link concerning this paper