/NLP-with-Reddit-Comment

This project focuses on the use of big data platforms, specifically Spark (PySpark, SparkML, Spark NLP). We will use the comment text a user posted, categorize the sentiment and predict scores of each comment. Our objective is to understand the dynamics of the Reddit online community and how the way people communicate online leads to different reactions from the community.

Primary LanguageJupyter Notebook

NLP-with-Reddit-Comment

Problem Statement

To help Reddit understand their topics, categorize comments attitude, and predict comments’ likes and dislike scores. We would like to conduct text analysis to better understand the current Reddit community through the subreddits.

  • Target popular topics using word clouds.
  • Categorize the arritude based on the comments.
  • Predict scores for each comment accordingly.

Big Data Platforms

  • Google Cloud Platform (BigQuery, Dataproc, Cloud SQL)
  • UChicago Research Computing Center

Data and Analysis

All analysis and data collection is found within jupyter_notebooks folder.

Modelling

Sentiment Analysis

  • Understand people's opinions from a post
  • Potentially help Reddit gain an overview of the wider public opinion behind certain topics
  • Transformer pipeline:
    • Regular Expression Tokenizer
    • StopWords Tokenizer
    • CountVectorizer
    • StringIndexer
    • HashingTF
    • DF

Regression Analysis

  • Based on the body of the post, predict a post’s success before it’s submitted
  • Potentially help Redditors gain upvotes, and predict which posts will get popular enough to hit the front page