The data and relation between suggestions about buying the goods or not, and scores studied and for conclusion: suggestions supposed to be as sentiment labels and scores column removed.
Resource of Corpus: https://github.com/minasmz/Sentiment-Analysis-with-LSTM-in-Persian
score == 1: POSITIVE
count: 2382
max: 100
min: 0
mean: 83
score == 2: NEGATIVE
count: 419
max: 100
min: 0
mean: 63
score == 3: NEUTRAL
count: 460
max: 100
min: 0
mean: 45
- review_sentiment.csv made from totalReviewWithSuggestion.csv
- 12289 tokens were in the corpus before any normalization.
- classifier trained with cross validation:
mean_score=0.7389705882352942, n_splits=7, test_size=0.25
- stopwords.csv file made from STOPWORDS
- naive_bayes_model.pkl made by classifier
- vocabulary made from corpus in vocab.csv file
- normal_review_sentiment.csv made from review_sentiment.csv after normalizing and removing some stop words. normal_review_sentiment.csv includes first 1500 most frequent words in the corpus
- classifier trained with cross validation: mean_score=0.7389705882352942, n_splits=7, test_size=0.25
- naive_bayes_model.pkl made by classifier
- 0.7389705882352942
- micro: 0.7389705882352942
- macro: 0.24632352941176472
- micro recall scores: 0.7389705882352942
- macro recall scores: 0.3333333333333333
- micro: 0.7389705882352943
- macro: 0.2832694901118499
Score \ Round | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
0.75367647 | 0.70710784 | 0.74632353 | 0.7377451 | 0.73897059 | 0.75490196 | 0.73406863 |
Changing the amount of the data, changed scores, but normalization and removing stopwords kept them unchanged.
-
About 25 wrong predictions were studied. Some labels were corrected and changed.
-
logestic regression classifier trained
mean: 0.7762605042016805
mean: 0.7764355742296919 for normalized data
-
Improvements were seen in naive bayes classifier after label correction
mean: 0.7400210084033614
- 875 false predictions
mean: 0.7400210084033614 after normalize
data / classifier | Logistic Regression | Naive Bayes |
---|---|---|
Dat | 0.7762605042016805 | 7400210084033614 |
Normalized Data | 0.7764355742296919 | 7400210084033614 |
- logistic regression were tested on data to get some false predictions
- 161 false predictions studied and their labels revised and in majority of cases labeling was wrong, so corrected.
- classifiers trained with data and normalized data with new labels
data / classifier | Logistic Regression | Naive Bayes |
---|---|---|
Dat | 0.8040966386554621 | 0.769782913165266 |
Normalized Data | 0.8063725490196079 | 0.769782913165266 |
Classifier / Test on | Data | Normalized Data |
---|---|---|
Logistic Regression trained with data | 109 | 576 |
Logistic Regression trained with normalized data | 560 | 213 |
Naive Bayes trained with data | 776 | 776 |
Naive Bayes trained with normalized data | 776 | 776 |
Train set score: 0.8248125426039536
Test set score: 0.6330275229357798
Train set score: 0.5170415814587593
Test set score: 0.4036697247706422
Train set score: 0.8057259713701431
Test set score: 0.7767584097859327
Train set score: 0.8265167007498296
Test set score: 0.7033639143730887
classifier trained with
time: 0:02:45.553099
mean: 0.8221288515406163
normalized data and mean: 0.8091736694677872
time: 0:03:44.719173
mean Score: 0.8296568627450981
time: 0:04:07.382019
mean Score: 0.7949929971988795
- setting max_df=0.5 for initializing Naive Bayes and Logistic Regression
- kept Naive Bayes Score unchanged and Logistic Regression Score decreased (0.00017507002801120386).
- In Naive Bayes V2, 1gram word based tfidf vectorizer used.
- In Logistic Regression V2, 1gram word based tfidf vectorizer used.
- In Logistic Regression V3, (1 -> 5)-gram word based tfidf vectorizer used
- In Logistic Regression V4, (3 -> 5)-gram word based tfidf vectorizer used
- In Logistic Regression V5, (3 -> 5)-gram character based tfidf vectorizer used
- In Logistic Regression V6, (3 -> 15)-gram character based tfidf vectorizer used
- In Logistic Regression V7, (3 -> 10)-gram character based tfidf vectorizer used
- In Logistic Regression V8, (3 -> 7)-gram character based tfidf vectorizer used
- In Logistic Regression V9, (3 -> 6)-gram character based tfidf vectorizer used
- In Logistic Regression V10, (2 -> 5)-gram character based tfidf vectorizer used
- In Logistic Regression V11, (1 -> 5)-gram character based tfidf vectorizer used
Time | Mean Score | |
---|---|---|
Naive Bayes | 0:00:03.901275 | 0.769782913165266 |
Naive Bayes V2 | 0:00:00.165345 | 0.769782913165266 |
Logistic Regression | 0:00:21.461250 | 0.8039215686274509 |
Logistic Regression V2 | 0:00:00.862117 | 0.7729341736694677 |
Logistic Regression V2 (on Normalized data) | 0:00:00.278024 | 0.7823879551820728 |
Logistic Regression V3 | 0:00:15.747416 | 0.769782913165266 |
Logistic Regression V4 | 0:00:13.402116 | 0.7710084033613445 |
Logistic Regression V5 | 0:00:15.329675 | 0.8249299719887955 * |
Logistic Regression V6 | 0:04:38.927389 | 0.8044467787114845 |
Logistic Regression V7 | 0:02:03.405992 | 0.810049019607843 |
Logistic Regression V8 | 0:00:45.646399 | 0.8182773109243698 |
Logistic Regression V9 | 0:00:24.831395 | 0.8219537815126049 |
Logistic Regression V10 | 0:00:16.128916 | 0.8247549019607842 |
Logistic Regression V11 | 0:00:17.159857 | 0.8224789915966386 |
SVM Linear kernel | 0:02:45.553099 | 0.8221288515406163 |
SVM RBF kernel | 0:03:44.719173 | 0.8296568627450981 |
SVM Polynomial kernel | 0:04:07.382019 | 0.7949929971988795 |
Time | Train Mean Score | Test Mean Score | |
---|---|---|---|
Gaussian NB | 0:00:02.659921 | 0.8248125426039536 | 0.6330275229357798 |
Complement NB | 0:00:00.469990 | 0.8057259713701431 | 0.7767584097859327 |
With the same data sets in this classification task with our vectorization method
- Normalizing and removing stopwords did not change the scores meaningfully
- SVM training duration time is significantly longer than Naive Bayes and Logistic Regression
- Using n-grams for vectorizing is beneficent for this task
- In the Naive Bayes, the amount of data has the most impact on the scores.