Reddit: Analysis of Langauge Patterns through Regularization

Introduction

Welcome to my project! The primary goal of this project is to analyze Reddit posts and their corresponding upvotes using various regression models. My aim is to identify the relationship between the textual features of posts, such as their term frequency-inverse document frequency (TF-IDF), and their popularity as indicated by the number of upvotes.

I have structured the project to facilitate the entire process, from data acquisition to model evaluation. I provide a streamlined pipeline that enables users to easily download data from specified subreddits, preprocess the data to extract relevant features, train a variety of regression models, and evaluate their performance using several metrics.

In the subsequent sections, I will guide you through the setup and execution of the project, including a detailed explanation of the project structure, how to run the code, and the metrics used for evaluating model performance. Additionally, I provide a list of references to further explore the underlying concepts and algorithms employed in this project.

Project Structure

├── datasets             <- Folder for storing datasets
|   ├── raw              <- Unprocessed data from subreddits: posts and upvotes
|   └── clean            <- TF-IDF of posts packed with upvote numbers
|
├── evaluation           <- Folder for metric tables and other visualization artifacts
├── refs                 <- Paper references for regularization algorithms
├── src                  <- Source code of the project (all scripts)
├── weights              <- Folder with serialized models
|
├── .gitignore           <- List of ignored files and directories
└── requirements.txt     <- List of dependencies of the project

How to run?

Install the project dependencies:

pip install -r requirements.txt

The project uses Reddit API and relies on having .env file with the following variables:

Variable	Description
`REDDIT_APP_ID`	Application ID on reddit dev portal
`REDDIT_APP_SECRET`	Secret token for interaction with dev application
`REDDIT_APP_NAME`	Alias of the reddit dev application
`REDDIT_USERNAME`	Your username
`REDDIT_PASSWORD`	Your password

Download datasets from subreddits. You can specify any subreddits of your like as long as they exist:

python src/parse.py --sub-reddits [SUB_REDDIT_NAMES] 
                    --output-dir <folder_for_data> 
                    --env <path_to_env>

Once you have downloaded posts from your favourite sub-reddits, preprocess them to extract TF-IDF features:

python src/preprocess.py --input-dir <folder_with_raw_datasets> 
                         --output-dir <folder_to_output>

Once the data is preprocessed, you can train a bunch of simple regression models with one line:

python src/train.py --datasets-dir <processed_datasets_dir> 
                    --weights-dir <folder_to_save_weights>

Once the models were trained, you can run inference on them, compute the table of metrics, and draw the the word clouds to visualize important words to get more upvotes:

python src/evaluate.py --datasets-dir <processed_datasets_dir> 
                       --weights-dir <weights_folder> 
                       --output-dir <folder_to_save_artifacts>

Visualization

Examples of words that positively correlate with number of upvotes in /r/MachineLearning:

Examples of words that negatively correlate with number of upvotes in /r/chess:

Examples of feature importance plot for /r/chess:

Metrics

Loss values for various regularization techniques and subreddits

Subreddit	Method	MSE	L1	L2
	MSE	10.7	318570.8	6494.9
MachineLearning	Lasso	687.5	278854.1	6204.1
	Ridge	2261.8	135578.7	3752.9
	MSE	47.7	455806.7	8408.4
cscareerquestions	Lasso	95.9	348375.5	7510.0
	Ridge	4007.1	152264.1	3830.3
	MSE	13.4	168560.1	4583.4
compsci	Lasso	396.9	138771.6	4263.0
	Ridge	2819.2	80455.7	2736.6
	MSE	3373.2	1200434.6	25490.1
chess	Lasso	3869.9	1056797.5	24319.7
	Ridge	39884.5	248024.9	8363.5
	MSE	192.5	229601.1	3882.6
python	Lasso	208.8	175321.1	3509.7
	Ridge	1253.9	77761.7	1793.7

Dmmc123/sparse-reg