Assignment 6 training results discrepancy
Closed this issue · 1 comments
In assignment 6 SVM, section 2.4, the top predictors I got are not exactly the same as in MATLAB codes from class; also in section 2.5, my classifier marks 'emailSample1' as spam, but MATLAB does not; section 2.3, the accuracies are not exactly same, either.
Suspicions are:
- The small differences on the email pre-processing step?
- The kernel used for SVM learning?
-
Root cause found: value of regularization parameter C is not consistent with MATLAB codes, causing discrepancies in training and validation accuracy.
-
Fix: set C=0.1, same as MATLAB codes (previously C=1). Now training results are almost completely the same. 14/15 top words are same. Also noticed that change C from 1 to 0.1, the training set accuracy decreases a bit, while cross validation set accuracy increases. This makes sense because we add more regularization to the training to prevent overfitting.
-
Other worth mentioning:
I also tweaked the processEmail() function, especially Porter stemmer, to make it behave almost exactly the same as in MATLAB codes:
- change stemmer mode to 'ORIGINAL_ALGORITHM', default is 'NLTK_EXTENSIONS'
- ignore words that are of length 1 or 2, in order to be consistent with MATLAB codes
However, this does not affect later training results, because the training dataset are preprocessed already. There's no need to run the function I wrote on the dataset.