Recreation of the paper Thumbs up? Sentiment Classification using Machine Learning Techniques by Bo Pang and Lillian Lee.
-
Features are constructed as mentioned in the paper. Features are:
- unigrams (occuring atleast 4 times) [frequncies as values]
- unigrams (occuring atleast 4 times) [presence as values]
- bigrams (top ~ 15000)
- unigrams + bigrams [presence as values]
- POS tagged unigrams (occuring atleast 4 times)
- used nltk pos tagger for tagging words
- adjectives (all of them)
- top unigrams (according to the number of adjectives, aroung 3000)
- unigrams along with positions
Program for Processing the data is in
/setup.ipynb
-
Three models:
- Multinomial Navie Bayes
- Support Vector Classifier
- Logistic Regression
are trained on 3-fold and mean accuracies are printed.
Code is in
/Sentiment_Analysis.ipynb
Features | freq / pres | #of features | Naive Bayes | SVC | Logistic Regression |
---|---|---|---|---|---|
unigram | freq. | 15521 | 79.86 | 70.29 | 82.07 |
unigram | pres. | 15521 | 82.14 | 83.21 | 84.79 |
uni + bigrams | pres. | 31042 | 83.14 | 81.71 | 85.14 |
bigrams | pres. | 15521 | 81.07 | 75.50 | 78.14 |
POS Tags | pres. | 17380 | 81.50 | 82.36 | 85.00 |
adjectives | pres. | 3065 | 78.29 | 76.71 | 77.50 |
top unigrams | pres. | 3065 | 82.36 | 83.64 | 83.00 |
uni positions | pres. | 21744 | 81.71 | 76.93 | 79.50 |
Average three-fold cross-validation accuracies, in percent.