Sentiment-Analysis

Recreation of the paper Thumbs up? Sentiment Classification using Machine Learning Techniques by Bo Pang and Lillian Lee.

Features are constructed as mentioned in the paper. Features are:
1. unigrams (occuring atleast 4 times) [frequncies as values]
2. unigrams (occuring atleast 4 times) [presence as values]
3. bigrams (top ~ 15000)
4. unigrams + bigrams [presence as values]
5. POS tagged unigrams (occuring atleast 4 times)
  - used nltk pos tagger for tagging words
6. adjectives (all of them)
7. top unigrams (according to the number of adjectives, aroung 3000)
8. unigrams along with positions
Program for Processing the data is in /setup.ipynb
Three models:
1. Multinomial Navie Bayes
2. Support Vector Classifier
3. Logistic Regression
are trained on 3-fold and mean accuracies are printed.

Code is in /Sentiment_Analysis.ipynb

Features	freq / pres	#of features	Naive Bayes	SVC	Logistic Regression
unigram	freq.	15521	79.86	70.29	82.07
unigram	pres.	15521	82.14	83.21	84.79
uni + bigrams	pres.	31042	83.14	81.71	*85.14*
bigrams	pres.	15521	81.07	75.50	78.14
POS Tags	pres.	17380	81.50	82.36	85.00
adjectives	pres.	3065	78.29	76.71	77.50
top unigrams	pres.	3065	82.36	*83.64*	83.00
uni positions	pres.	21744	81.71	76.93	79.50

Average three-fold cross-validation accuracies, in percent.

CaptainO5/Sentiment-Analysis