To answer the question "Which category of emotion is most frequent for Barak Obama?" I've done sentiment Analysis on Barak Obama's Tweet. I've categorised Tweets into 3 catergory as Positive, Negative and Neutral. To categorized the tweets I followed below steps.<BR>
- Tweet is a Positive Tweet if number of Positive words in a Tweet is greater than number of Negative words.<BR>
- Tweet is a Negative tweet if Negative words are greater than Positive words. <BR>
- If number of Positive and negative words are equal in a tweet then its a Neutral Tweet.<BR>
To do this I build a vocabulary of Positive and negative word list from Datasets provided below.
- Positive Words Dataset: https://gist.github.com/mkulakowski2/4289437
- Negative Words Dataset: https://gist.github.com/mkulakowski2/4289441count
- Twitter Sentiment Analysis Dataset: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
Results: Positive Tweets are highest with count of 24331.
To identify the Hypothesis: "Most of Barak Obama's tweets will be regarding healthcare".
I scraped the health related words from "http://www.english-for-students.com/Health-Vocabulary.html" and created healthWordList.
And considered a tweet as a healtcare related tweet if atleast one word in a tweet is also is healthWordList.
Results: The Hypothesis is false as only 6768 Tweets out of 27346 overall tweets are related to Health care.
After completing this I checked the accuracy of categorised tweets using Naive Bayes Algorithm. I trained the model with Sentiment Analysis Dataset with different tweets and checked the accuracy of my categorization of tweets.