Title: NLP, Web APIs & Classification - Reddit Dating/Relationship Advice Post Classification

Author: Sudeep Choudhary

Introduction:

This project delves into classifying Reddit posts from the "dating_advice" and "relationship_advice" subreddits using Natural Language Processing (NLP) techniques, aiming to identify the most suitable advertisements for each page. It leverages web scraping and machine learning algorithms to achieve this goal.

Problem Statement

As a data scientist at Reddit, the objective is to categorize posts effectively to serve targeted advertising on relevant subreddits. This project focuses on the "dating_advice" and "relationship_advice" communities, aiming to:

  • Identify key terms and phrases that hold predictive power in distinguishing between the two categories.
  • Develop a classification model (Logistic Regression and Bayes models are explored) to achieve accurate post classification.

Data Collection

  • Data source: Reddit subreddits: https://www.reddit.com/r/relationship_advice/ and https://www.reddit.com/r/dating_advice/
  • Scraping method: The requests library was employed to scrape post content. Around 2000 unique posts were collected (approximately 1000 each from both subreddits) using the "Hot" and "New" filters to diversify the data.
  • Preprocessing: Duplicate rows were removed using the drop_duplicates function with the id column. A 1-second delay was implemented between requests to avoid overloading Reddit's servers.
  • Saved data: The scraped content is stored as CSV files in the dataset folder of this repository.

Data Cleaning and Exploratory Data Analysis (EDA)

  • Null entries: Rows with missing values in the selftext column were removed as they lacked valuable information.
  • Feature creation: The title and selftext columns were combined into a single all_text column for analysis.
  • Dummy variables: The subreddit column was converted into dummy variables, where dating_advice is represented by 0 and relationship_advice by 1.
  • Distribution analysis: The frequency distribution of word counts in titles and full text was examined for both subreddits to identify potential differences.

Preprocessing and Modeling

  • Baseline accuracy: The baseline accuracy, achieved by always predicting the majority class ("relationship_advice"), was found to be 63.3%.
  • Modeling approach: Logistic Regression and Bayes models were considered as potential classification algorithms.

Hyperparameter Tuning

  • Initial attempts:
    • Logistic regression with default CountVectorizer parameters yielded poor results.
    • Using CountVectorizer with Stemmatizer preprocessing yielded low accuracy and high variance.
    • Employing a Lemmatizer instead slightly improved the test score (74%).
  • Optimized model:
    • Simply increasing the dataset size and employing stratification in the train/test split significantly improved the accuracy (C-Vectorizer test score: 81, cross-validation score: 80).
    • Additional improvements were achieved by fine-tuning CountVectorizer parameters:
      • min_df: 3 (minimum document frequency)
      • ngram_range: (1, 2) (considering single words and bigrams)
    • Grid Search with Pipeline was used to identify optimal hyperparameters for Logistic Regression and CountVectorizer. The best parameters were:
      • cvec__max_df: 0.95 (maximum document frequency)
      • cvec__max_features: 4000 (maximum number of features)
      • cvec__min_df: 3 (minimum document frequency)
      • cvec__ngram_range: (1, 2) (considering single words and bigrams)
    • Results:
      • Cross-validation score: 76%
      • Test score: 78%

TF-IDF Vectorization

  • Experimentation with TF-IDF vectorization was conducted:
    • TfidfVectorizer with custom stop words tailored to exclude irrelevant or overused words further reduced overfitting.
    • Final parameters:
      • Stop words: Customized list excluding common words like "relationship", "girlfriend", etc.
      • ngram_range: (1, 2) (considering single words and bigrams)
      • max_df: 0.9 (maximum document frequency)
      • min_df: 2 (minimum document frequency)
      • max_features: 5000 (maximum number of features)