Author: Sudeep Choudhary
This project delves into classifying Reddit posts from the "dating_advice" and "relationship_advice" subreddits using Natural Language Processing (NLP) techniques, aiming to identify the most suitable advertisements for each page. It leverages web scraping and machine learning algorithms to achieve this goal.
As a data scientist at Reddit, the objective is to categorize posts effectively to serve targeted advertising on relevant subreddits. This project focuses on the "dating_advice" and "relationship_advice" communities, aiming to:
- Identify key terms and phrases that hold predictive power in distinguishing between the two categories.
- Develop a classification model (Logistic Regression and Bayes models are explored) to achieve accurate post classification.
- Data source: Reddit subreddits: https://www.reddit.com/r/relationship_advice/ and https://www.reddit.com/r/dating_advice/
- Scraping method: The
requests
library was employed to scrape post content. Around 2000 unique posts were collected (approximately 1000 each from both subreddits) using the "Hot" and "New" filters to diversify the data. - Preprocessing: Duplicate rows were removed using the
drop_duplicates
function with theid
column. A 1-second delay was implemented between requests to avoid overloading Reddit's servers. - Saved data: The scraped content is stored as CSV files in the
dataset
folder of this repository.
- Null entries: Rows with missing values in the
selftext
column were removed as they lacked valuable information. - Feature creation: The
title
andselftext
columns were combined into a singleall_text
column for analysis. - Dummy variables: The
subreddit
column was converted into dummy variables, wheredating_advice
is represented by 0 andrelationship_advice
by 1. - Distribution analysis: The frequency distribution of word counts in titles and full text was examined for both subreddits to identify potential differences.
- Baseline accuracy: The baseline accuracy, achieved by always predicting the majority class ("relationship_advice"), was found to be 63.3%.
- Modeling approach: Logistic Regression and Bayes models were considered as potential classification algorithms.
- Initial attempts:
- Logistic regression with default
CountVectorizer
parameters yielded poor results. - Using
CountVectorizer
with Stemmatizer preprocessing yielded low accuracy and high variance. - Employing a Lemmatizer instead slightly improved the test score (74%).
- Logistic regression with default
- Optimized model:
- Simply increasing the dataset size and employing stratification in the train/test split significantly improved the accuracy (C-Vectorizer test score: 81, cross-validation score: 80).
- Additional improvements were achieved by fine-tuning
CountVectorizer
parameters:min_df
: 3 (minimum document frequency)ngram_range
: (1, 2) (considering single words and bigrams)
- Grid Search with Pipeline was used to identify optimal hyperparameters for Logistic Regression and
CountVectorizer
. The best parameters were:cvec__max_df
: 0.95 (maximum document frequency)cvec__max_features
: 4000 (maximum number of features)cvec__min_df
: 3 (minimum document frequency)cvec__ngram_range
: (1, 2) (considering single words and bigrams)
- Results:
- Cross-validation score: 76%
- Test score: 78%
- Experimentation with TF-IDF vectorization was conducted:
TfidfVectorizer
with custom stop words tailored to exclude irrelevant or overused words further reduced overfitting.- Final parameters:
- Stop words: Customized list excluding common words like "relationship", "girlfriend", etc.
ngram_range
: (1, 2) (considering single words and bigrams)max_df
: 0.9 (maximum document frequency)min_df
: 2 (minimum document frequency)max_features
: 5000 (maximum number of features)