/sparse-reg

Regularization algorithms for sparse data

Primary LanguagePython

Reddit: Analysis of Langauge Patterns through Regularization

Introduction

Welcome to my project! The primary goal of this project is to analyze Reddit posts and their corresponding upvotes using various regression models. My aim is to identify the relationship between the textual features of posts, such as their term frequency-inverse document frequency (TF-IDF), and their popularity as indicated by the number of upvotes.

I have structured the project to facilitate the entire process, from data acquisition to model evaluation. I provide a streamlined pipeline that enables users to easily download data from specified subreddits, preprocess the data to extract relevant features, train a variety of regression models, and evaluate their performance using several metrics.

In the subsequent sections, I will guide you through the setup and execution of the project, including a detailed explanation of the project structure, how to run the code, and the metrics used for evaluating model performance. Additionally, I provide a list of references to further explore the underlying concepts and algorithms employed in this project.

Project Structure

├── datasets             <- Folder for storing datasets
|   ├── raw              <- Unprocessed data from subreddits: posts and upvotes
|   └── clean            <- TF-IDF of posts packed with upvote numbers
|
├── evaluation           <- Folder for metric tables and other visualization artifacts
├── refs                 <- Paper references for regularization algorithms
├── src                  <- Source code of the project (all scripts)
├── weights              <- Folder with serialized models
|
├── .gitignore           <- List of ignored files and directories
└── requirements.txt     <- List of dependencies of the project

How to run?

  1. Install the project dependencies:

pip install -r requirements.txt

  1. The project uses Reddit API and relies on having .env file with the following variables:
Variable Description
REDDIT_APP_ID Application ID on reddit dev portal
REDDIT_APP_SECRET Secret token for interaction with dev application
REDDIT_APP_NAME Alias of the reddit dev application
REDDIT_USERNAME Your username
REDDIT_PASSWORD Your password
  1. Download datasets from subreddits. You can specify any subreddits of your like as long as they exist:
python src/parse.py --sub-reddits [SUB_REDDIT_NAMES] 
                    --output-dir <folder_for_data> 
                    --env <path_to_env>
  1. Once you have downloaded posts from your favourite sub-reddits, preprocess them to extract TF-IDF features:
python src/preprocess.py --input-dir <folder_with_raw_datasets> 
                         --output-dir <folder_to_output>
  1. Once the data is preprocessed, you can train a bunch of simple regression models with one line:
python src/train.py --datasets-dir <processed_datasets_dir> 
                    --weights-dir <folder_to_save_weights>
  1. Once the models were trained, you can run inference on them, compute the table of metrics, and draw the the word clouds to visualize important words to get more upvotes:
python src/evaluate.py --datasets-dir <processed_datasets_dir> 
                       --weights-dir <weights_folder> 
                       --output-dir <folder_to_save_artifacts>

Visualization

Examples of words that positively correlate with number of upvotes in /r/MachineLearning:

Positively correlated words

Examples of words that negatively correlate with number of upvotes in /r/chess:

Negatively correlated words

Examples of feature importance plot for /r/chess:

Word importance

Metrics

Loss values for various regularization techniques and subreddits
Subreddit Method MSE L1 L2
MSE 10.7 318570.8 6494.9
MachineLearning Lasso 687.5 278854.1 6204.1
Ridge 2261.8 135578.7 3752.9
MSE 47.7 455806.7 8408.4
cscareerquestions Lasso 95.9 348375.5 7510.0
Ridge 4007.1 152264.1 3830.3
MSE 13.4 168560.1 4583.4
compsci Lasso 396.9 138771.6 4263.0
Ridge 2819.2 80455.7 2736.6
MSE 3373.2 1200434.6 25490.1
chess Lasso 3869.9 1056797.5 24319.7
Ridge 39884.5 248024.9 8363.5
MSE 192.5 229601.1 3882.6
python Lasso 208.8 175321.1 3509.7
Ridge 1253.9 77761.7 1793.7

References