Sotera/webpageclassifier

Test whether max(scores) would outperform sequential rules.

ctwardy opened this issue · 3 comments

Also check whether the two "short-circuit" rules are doing well.

NPR article would improve: [fo: 0.46, ne: 0.66, cl: 0.00, sh: 0.33]
Amazon would improve: [fo: 0.18, ne: 0.51, cl: 0.54, sh: 0.73] ---> news <---

Forums are having trouble now, might need to revert some changes:

http://grahamcluley.com
	[fo: 0.32, ne: 0.44, cl: 0.20, sh: 0.00]
	---> news <--- 

http://erratasec.com
	[fo: 0.00, ne: 0.00, cl: 0.00, sh: 0.00]
	---> undecided <--- 

http://krebsonsecurity.com
	[fo: 0.21, ne: 0.00, cl: 0.53, sh: 0.62]
	---> shopping <--- 

http://joelonsoftware.com
	[fo: 0.31, ne: 0.27, cl: 0.14, sh: 0.33]
	---> undecided <--- 

http://schneier.com
	[fo: 0.26, ne: 0.27, cl: 0.44, sh: 0.33]
	---> classified <--- 

http://troyhunt.com
	[fo: 0.26, ne: 0.00, cl: 0.19, sh: 0.00]
	---> undecided <--- 

Try creating a branch that uses max() rather than sequence, and just compare.

Also, maybe remove MEMEX "backpage" escort ads from test: they just display an error message now. (Actually might be worth manually viewing all the pages in the test set.)

Test code should also generate per-class stats and a confusion matrix.

  • Wrapped for scikit and tested on ~500 pages.
  • blog has excellent precision and OK recall. f1 is 78%
  • forum is 73%, also OK.
  • classified is on 10% though, usually UNDEFINED.
  • and search_engine bc not trained on that.

So closing this and can open new test/improvement ticket if desired.

               precision    recall  f1-score   support

    UNDEFINED       0.00      0.00      0.00         0
         blog       0.98      0.65      0.78        62
   classified       0.26      0.06      0.10        82
        forum       0.97      0.58      0.73        53
         news       0.84      0.64      0.72        88
search_engine       0.00      0.00      0.00        66
     shopping       0.35      0.70      0.47        53
         wiki       0.96      0.73      0.83        64

  avg / total       0.61      0.46      0.51       468

Confusion Matrix:
           UNDEFINED: [0 0 0 0 0 0 0 0]
                blog: [11 40  1  0  3  0  7  0]
          classified: [48  0  5  0  0  0 28  1]
               forum: [14  0  1 31  2  0  5  0]
                news: [12  1  2  1 56  0 16  0]
       search_engine: [45  0  5  0  6  0  9  1]
            shopping: [12  0  4  0  0  0 37  0]
                wiki: [12  0  1  0  0  0  4 47]