Make it possible to directly compare two training runs

Question

Make it possible to directly compare two training runs

biancadanforth opened this issue 5 years ago · 3 comments

fathom-train outputs a table of each sample's confidence score after a training run[1]. If I want to determine whether or not a change has triggered a net improvement in my ruleset's accuracy, I currently have to manually compare the outputs of two separate training runs. It'd be very helpful if fathom-train or a new tool would perform these comparisons for me and tell me what the deltas are for all the confidences between the two runs.

Edit: A prerequisite for this may be https://github.com/mozilla/fathom-fox/issues/64.

[1]: Example fathom-train output currently:

venv) bdanforth ~/Projects/fathom-smoot/articles (master) $ fathom-train -l 0.01 -i 5000 -c V24_paragraph_with_validation -a vectors/vectors_validation_paragraph.json -s vectors/vectors_training_paragraph.json
  [#########---------------------------]   25%  00:00:03
Stopping early at iteration 1273, just before validation error rose.
{"coeffs": [
        ["pElementHasListItemAncestor", -3.24098539352417],
        ["hasLongTextContent", 5.606229305267334],
        ["containsElipsisAtEndOfText", -0.09499679505825043],
        ["classNameIncludesCaption", -2.4309158325195312]
    ],
 "bias": -4.5716118812561035}
  Training accuracy per tag:  0.96018    95% CI: (0.95336, 0.96699)
                         FP:  0.02750    95% CI: (0.00975, 0.04525)
                         FN:  0.01233    95% CI: (0.00827, 0.01639)
                  Precision:  0.76738    Recall: 0.88037

Validation accuracy per tag:  0.94430    95% CI: (0.93409, 0.95451)
                         FP:  0.03559    95% CI: (0.00574, 0.06543)
                         FN:  0.02011    95% CI: (0.01361, 0.02662)
                  Precision:  0.61236    Recall: 0.73649

  Training accuracy per page: 0.60000    95% CI: (0.29636, 0.90364)
Validation accuracy per page: 0.60000    95% CI: (0.29636, 0.90364)

Training per-page results:
 success  on 45.html. Confidence: 0.36664742 No target nodes. Assumed negative sample.
 success  on 13.html. Confidence: 0.73781013
 success  on 54.html. Confidence: 0.73781013
 failure  on 33.html. Confidence: 0.50709236 There were no right choices, but highest-scorer had high confidence anyway.
 success  on 47.html. Confidence: 0.04786294 No target nodes. Assumed negative sample.
 failure  on 19.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
    First target at index 5: 0.73781013
 failure  on 49.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
    First target at index 3: 0.73781013
 failure  on 36.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
    First target at index 1: 0.73781013
 success  on 35.html. Confidence: 0.49511391 No target nodes. Assumed negative sample.
 success  on 26.html. Confidence: 0.73781013

Validation per-page results:
 success  on 37.html. Confidence: 0.22838420 No target nodes. Assumed negative sample.
 success  on 45.html. Confidence: 0.73781013
 failure  on 16.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
    First target at index 1: 0.40638763
 success  on 27.html. Confidence: 0.73781013
 failure  on 58.html. Confidence: 0.73781013 There were no right choices, but highest-scorer had high confidence anyway.
 success  on 78.html. Confidence: 0.73781013
 success  on 89.html. Confidence: 0.73781013
 failure  on 86.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
    First target at index 1: 0.73781013
 success  on 20.html. Confidence: 0.73781013
 failure  on 19.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
    First target at index 2: 0.73781013

Answer 1 · 2020-01-16T17:18:56.000Z

I really like this. If the user saves the output of fathom-train to a file, we could add a --compare-results-to option (or something like that) that takes the output file from a previous run and compares the results. Overall accuracy metrics would be simple enough to compare. It'd be a little trickier with sample specific comparisons since we should not assume that the samples are consistent between runs.

Answer 2 · 2020-01-21T17:38:25.000Z

Yes, we cannot assume the samples are the same run to run, but if we had a fuller path for the sample names (https://github.com/mozilla/fathom-fox/issues/64), we could assume if a sample in one run has the exact same path as a sample in another run, and we've opted to compare the runs, that it would compare sample-to-sample.

Sample-to-sample comparison would be the most valuable output of this feature for me, as the majority of my ruleset development time is spent eking out the last few percentage points in overall accuracy, which means caring about the sample-to-sample deltas in cases where the overall accuracy doesn't move much or at all. Especially when a particular change to a ruleset was made targeting a specific sample.

Answer 3 · 2020-01-21T19:59:26.000Z

I completely agree and definitely see the value. I spent a lot of time flipping between training outputs looking at the confidences for specific files. I'm just highlighting a input case that will need to be handled.