/Blog-Sentiment-Analysis

Eva's term project. Analysis of the sentiments and word frequencies in blogs, including variance across different author demographic groups.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Blog Sentiment Analysis

Created by Eva Bacas.

About

This is Eva's term project for LING 1340: Data Science for Linguists. This project is an analysis of the sentiments and word frequencies in blogs, including variance across different author demographic groups.

Data

This project uses the Blog Authorship Corpus.

J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. URL: http://www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf

Directory

About the project:

Jupyter notebook files:

  • progress_report1.ipynb - Click here to view on nbviewer
    • In my first progress report, I read the CSV file into a data frame and explored the data. I calculated basic stats about blogger demographics, but I forgot to consider that there are multiple blogs per blogger.
  • progress_report_part2.ipynb - Click here to view on nbviewer
    • In my second progress report, I continued exploring the data frame and corrected my issues from my first progress report. I began my analysis by looking at word frequencies and topic modeling.
  • progress_report_part3.ipynb - Click here to view on nbviewer
    • In my third progress report, I explored sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner). I categorized blogs as positive, negative, or neutral, and then investigated variation across blogger groups and most frequent words per sentiment category.
  • progress_report_part3b.ipynb - Click here to view on nbviewer
    • This is a continuation of the third progress report. I switched to a new Jupyter notebook so I could use R. I attempted to create a mixed effects regression model using the demographic info as predictors and polarity score as an outcome. It didn't work and nothing was significant.

Folders:

  • /data_samples
    • Contains a 100 blog sample of the dataset
  • /images
    • PNG versions of all graphs and additional images in my project presentation and final report

Other files:

Code

This code is licensed under the GNU General Public License v3.0.

Guestbook

Please leave comments, suggestions, and questions in my guestbook.