Assignment 6 training results discrepancy

Question

Assignment 6 training results discrepancy

Closed this issue 6 years ago · 1 comments

In assignment 6 SVM, section 2.4, the top predictors I got are not exactly the same as in MATLAB codes from class; also in section 2.5, my classifier marks 'emailSample1' as spam, but MATLAB does not; section 2.3, the accuracies are not exactly same, either.

Suspicions are:

The small differences on the email pre-processing step?
The kernel used for SVM learning?

Answer 1 · 2018-09-08T22:02:24.000Z

Root cause found: value of regularization parameter C is not consistent with MATLAB codes, causing discrepancies in training and validation accuracy.
Fix: set C=0.1, same as MATLAB codes (previously C=1). Now training results are almost completely the same. 14/15 top words are same. Also noticed that change C from 1 to 0.1, the training set accuracy decreases a bit, while cross validation set accuracy increases. This makes sense because we add more regularization to the training to prevent overfitting.
Other worth mentioning:
I also tweaked the processEmail() function, especially Porter stemmer, to make it behave almost exactly the same as in MATLAB codes:

change stemmer mode to 'ORIGINAL_ALGORITHM', default is 'NLTK_EXTENSIONS'
ignore words that are of length 1 or 2, in order to be consistent with MATLAB codes
However, this does not affect later training results, because the training dataset are preprocessed already. There's no need to run the function I wrote on the dataset.