/Tweet-Classifier

Classifying where a tweet comes from

Primary LanguageJupyter Notebook

Tweet Classification

Collaborators: Marco Santos & Harris Nathel

Table of Contents:

Goal

Given a specific subject and two different cities. Using NLP and classification models; can a tweet be classified as coming from the first or second city by regarding its language towards the specific subject?

Data Gathering

  • Used the module Twint to webscrape tweets from Twitter
  • Specific tweets were scraped based on a user-defined subject
  • Tweets were selected by two user-defined cities
  • Any subject and two cities can be inputted within the program to test the classifiers
  • 10,000 tweets were grabbed from the two different cities totaling 20,000 tweets

For the sake of consistency, the focus for this specific iteration was on the subject of Trump within Seattle and Jacksonville.

Data Cleaning

  • Lowercased all words for every tweet
  • URLs and special characters such as emojis and punctuations were removed
  • Lemmatization with the nltk library was used for the remaining words

Data Exploration

After gathering and cleaning all of the tweets we wanted to look at how what words were more common in each of the cities.

WordCloud

Because our subject was "Trump", we wanted to look at how specific words were in each city and compare the two cities. WordtoVec was used to get similarity scores.

SimilarityToTrump

It was interesting to see that words associated with negative things in the news were more common in Seattle (the more liberal city) than in Jacksonville. Words that we thought would be associated pro-trump tweets were more common in Jacksonville.

Classification Modeling

Vectorizing

Both CountVectorizer and tf_idfVectorizer were used in order to compare the performance of the models with each. In the end, the performance for both were similar but tf_idfVectorizer had slightly better overall results. As a result, tf_idfVectorizer was chosen as the default Vectorizer.

Dummy Classifier

Results:

Training Score - 50%

Testing Score - 49%

Random Forest

Results:

Training Score - 96%

Testing Score - 60%

Naive Bayes

Results:

Training Score - 79%

Testing Score - 62%

Logistic Regression

Results:

Training Score - 82%

Testing Score - 61%

Support Vector Machine

Results:

Training Score - 87%

Testing Score - 59%

Deep Learning with Keras

A Sequential model was used with only 3 layers within the neural network. After training for 300 epochs with a batch size of 256, the results were similar to the other classification models. No significant changes warranted the need for a neural network for the tweets.

Potential Improvements

  • More models could be applied such as XGBoost, KNN, etc. for more comparisons and potential improved results.
  • Feature engineering such as ngrams for possibly better results.
  • More cleaning with other techniques or different modules such as SpaCy
  • More experimentation with the neural network

Closing

Due to the nature of the question, it is inherently difficult to classify whether a tweet comes from one location or not. However, these classification models did perform better than randomly guessing (dummy classifier). Most models performed at least 10% better than the Dummy Classifier and the best performing model was Naive Bayes. Even though these models performed this way for this dataset, a new subject and cities could significantly alter the overall results.