/candidates

Polling the 2016 Presidential Race with Twitter and Natural Language Processing

Primary LanguageJavaScriptGNU General Public License v3.0GPL-3.0

Polling the 2016 Presidential Race with Twitter and Natural Language Processing: candidates.drewatkinson.me

Questions and Answers

Q. How is each graph or map calculated?

A.

  • The map of states takes all positive tweets that were geocoded, and takes the average share of positive tweets about each candidate to determine a rough "vote share" for every state.
  • The popularity graph charts the popularity of each candidate, counted by number of tweets per hour in 5 minute incriments.
  • The sentiment grap charts the percentage of tweets about each candidate that are positive, calculated every 5 minutes by a Natural Language Classifier (Naive Bayes).

Q. Why did you make this project?

A. I wanted to use my knowledge of Node.js and Express, but also learn to incorporate an SQL Database. I also have been getting more interested in Natural Language Processing and Data Science. The natural intersection of all of these things was a project involving the presidential election. I chose twitter as the source of the data because I think that traditional phone polls are going to, eventually, become less reliable as less people have landline phones. I think polling of general public opinion online is an important step in the political process that we need to figure out, and work has already started with projects like BeHeardPhilly.

Q. How accurate is this data?

A. Probably, not at all. Here's what needs improved:

  • The Naive Bayes model that classifies tweets as positive or negative is trained by a very small sample taken from a single day.
  • The data should be averaged or fit to a regression to give a better overview of these metrics over larger samples of time.
  • A very small fraction of tweets are encoded with geographic coordinates (less than 1%), so there is a very small sample size to work with.
  • In addition to the above, the state that the tweet is in is calculated by the shortest distance to the average coordinates of each state. This could be improved easily with a geocoding service or taking the state's entire area in to account.

Credits

Front End

Back End

Data Sources

Special Thanks

vitaly-t (Author of pg-promise) for the pull requests and helping me understand PostgreSQL