Make it possible to directly compare two training runs
biancadanforth opened this issue · 3 comments
fathom-train
outputs a table of each sample's confidence score after a training run[1]. If I want to determine whether or not a change has triggered a net improvement in my ruleset's accuracy, I currently have to manually compare the outputs of two separate training runs. It'd be very helpful if fathom-train
or a new tool would perform these comparisons for me and tell me what the deltas are for all the confidences between the two runs.
Edit: A prerequisite for this may be https://github.com/mozilla/fathom-fox/issues/64.
[1]: Example fathom-train
output currently:
venv) bdanforth ~/Projects/fathom-smoot/articles (master) $ fathom-train -l 0.01 -i 5000 -c V24_paragraph_with_validation -a vectors/vectors_validation_paragraph.json -s vectors/vectors_training_paragraph.json
[#########---------------------------] 25% 00:00:03
Stopping early at iteration 1273, just before validation error rose.
{"coeffs": [
["pElementHasListItemAncestor", -3.24098539352417],
["hasLongTextContent", 5.606229305267334],
["containsElipsisAtEndOfText", -0.09499679505825043],
["classNameIncludesCaption", -2.4309158325195312]
],
"bias": -4.5716118812561035}
Training accuracy per tag: 0.96018 95% CI: (0.95336, 0.96699)
FP: 0.02750 95% CI: (0.00975, 0.04525)
FN: 0.01233 95% CI: (0.00827, 0.01639)
Precision: 0.76738 Recall: 0.88037
Validation accuracy per tag: 0.94430 95% CI: (0.93409, 0.95451)
FP: 0.03559 95% CI: (0.00574, 0.06543)
FN: 0.02011 95% CI: (0.01361, 0.02662)
Precision: 0.61236 Recall: 0.73649
Training accuracy per page: 0.60000 95% CI: (0.29636, 0.90364)
Validation accuracy per page: 0.60000 95% CI: (0.29636, 0.90364)
Training per-page results:
success on 45.html. Confidence: 0.36664742 No target nodes. Assumed negative sample.
success on 13.html. Confidence: 0.73781013
success on 54.html. Confidence: 0.73781013
failure on 33.html. Confidence: 0.50709236 There were no right choices, but highest-scorer had high confidence anyway.
success on 47.html. Confidence: 0.04786294 No target nodes. Assumed negative sample.
failure on 19.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
First target at index 5: 0.73781013
failure on 49.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
First target at index 3: 0.73781013
failure on 36.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
First target at index 1: 0.73781013
success on 35.html. Confidence: 0.49511391 No target nodes. Assumed negative sample.
success on 26.html. Confidence: 0.73781013
Validation per-page results:
success on 37.html. Confidence: 0.22838420 No target nodes. Assumed negative sample.
success on 45.html. Confidence: 0.73781013
failure on 16.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
First target at index 1: 0.40638763
success on 27.html. Confidence: 0.73781013
failure on 58.html. Confidence: 0.73781013 There were no right choices, but highest-scorer had high confidence anyway.
success on 78.html. Confidence: 0.73781013
success on 89.html. Confidence: 0.73781013
failure on 86.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
First target at index 1: 0.73781013
success on 20.html. Confidence: 0.73781013
failure on 19.html. Confidence: 0.73781013 Highest-scoring element was a wrong choice.
First target at index 2: 0.73781013
I really like this. If the user saves the output of fathom-train
to a file, we could add a --compare-results-to
option (or something like that) that takes the output file from a previous run and compares the results. Overall accuracy metrics would be simple enough to compare. It'd be a little trickier with sample specific comparisons since we should not assume that the samples are consistent between runs.
Yes, we cannot assume the samples are the same run to run, but if we had a fuller path for the sample names (https://github.com/mozilla/fathom-fox/issues/64), we could assume if a sample in one run has the exact same path as a sample in another run, and we've opted to compare the runs, that it would compare sample-to-sample.
Sample-to-sample comparison would be the most valuable output of this feature for me, as the majority of my ruleset development time is spent eking out the last few percentage points in overall accuracy, which means caring about the sample-to-sample deltas in cases where the overall accuracy doesn't move much or at all. Especially when a particular change to a ruleset was made targeting a specific sample.
I completely agree and definitely see the value. I spent a lot of time flipping between training outputs looking at the confidences for specific files. I'm just highlighting a input case that will need to be handled.