oeyh/NN

Assignment 6 training results discrepancy

Closed this issue · 1 comments

oeyh commented

In assignment 6 SVM, section 2.4, the top predictors I got are not exactly the same as in MATLAB codes from class; also in section 2.5, my classifier marks 'emailSample1' as spam, but MATLAB does not; section 2.3, the accuracies are not exactly same, either.

Suspicions are:

  • The small differences on the email pre-processing step?
  • The kernel used for SVM learning?
oeyh commented
  • Root cause found: value of regularization parameter C is not consistent with MATLAB codes, causing discrepancies in training and validation accuracy.

  • Fix: set C=0.1, same as MATLAB codes (previously C=1). Now training results are almost completely the same. 14/15 top words are same. Also noticed that change C from 1 to 0.1, the training set accuracy decreases a bit, while cross validation set accuracy increases. This makes sense because we add more regularization to the training to prevent overfitting.

  • Other worth mentioning:
    I also tweaked the processEmail() function, especially Porter stemmer, to make it behave almost exactly the same as in MATLAB codes:

  1. change stemmer mode to 'ORIGINAL_ALGORITHM', default is 'NLTK_EXTENSIONS'
  2. ignore words that are of length 1 or 2, in order to be consistent with MATLAB codes
    However, this does not affect later training results, because the training dataset are preprocessed already. There's no need to run the function I wrote on the dataset.