This is a course project that analyzed the sentiment of tweets posted in 2016 U.S. Election Day.
We try to figure out whether using the social media can help predict the election result.
Due to the Twitter's ToS, the data published only contains tweet IDs, so we need to hydrator it (aka, get the full tweet information).
Install requirements:
pip install -r requirements.txt
To hydrate, first you need a CSV file with only ID in each row. Then edit the tweets_fetch.py
to fill information, and run it.
Usage: tweets_fetch.py -i input_file -o output_file -p proxy_address
Options:
-h, --help show this help message and exit
-p str, --proxy=str Proxy address
-i FILE, --in=FILE Input CSV file
-o FILE, --out=FILE Output CSV file
For example, I have a CSV file called "tweet_id_1.csv" and want to get an output of "full_tweets_1.csv", then run:
python tweets_fetch.py -i tweet_id_1.csv -o full_tweets_1.csv
It also supports proxy. Use the -p
option.
In this project, we utilized https://github.com/aalind0/NLP-Sentiment-Analysis-Twitter, which uses nltk
and Sklearn
to train and provides the best optimized sentiment analysis. To run the analysis, you need to do the following...
-
Install required packages and data
- Install
sklearn
withpip install scikit-learn
- Install
nltk
withpip install nltk
- Open a fresh python interpreter, run
> import nltk > nltk.download('stopwords') > nltk.download('movie_reviews') > nltk.download('averaged_perceptron_tagger') > nltk.download('punkt')
- Install
-
Run the
train_classifiers.py
file to train models. Or you may use the pretrained models in this repo. -
Run
sentiment_calculation_multithread.py
(it will use 1/4 of all your CPU cores to calculate) orsentiment_calculation.py
(it will only utilize one core using one thread) to calculate the sentiment. You need to use this syntax:python xxx.py <index>
and replace the<index>
with the number of csv file. The filename is hardcoded so you may change it yourself.
The accuracy varies because we randomly our training sets. But it should be stable at around
- Original Naive Bayes: 72.9607250755287
- Sklearn Multinomial Naive Bayes: 70.2416918429003
- Sklearn Bernoulli Naive Bayes: 72.35649546827794
- Sklearn Logistic Regression: 70.69486404833837
- Sklearn Linear SVC: 67.97583081570997
- Sklearn SGD classifier: 67.06948640483384
Voted Classifier: 71.75226586102718
- Tweet IDs from https://github.com/chrisalbon/election_day_2016_twitter