On validation metrics and thresholds

Question

On validation metrics and thresholds

Opened this issue a year ago · 2 comments

First of all, nice job!

I noticed in the validation arena you're using my suggested thresholds for my models, and a "default" one for yours.
That's doing your work a disservice.
I think a fairer way to compare the models would be to try and find some fixed performance point, and see how the other metrics fare.

For my models, for example, I used to choose (by bisection) a threshold where micro averaged recall and precision matched: if both were higher than the last model then I had a better model.
You could do the same, or bisect towards a threshold that gives a desired precision and evaluate recall for example.
This also has the side effect of being more fair to augmentations like mixup, that skew predictions confidence towards lower values.

If I may go on a slight tangent about the discrepancy between my stated scores and the ones in the Arena: I used to use micro averaging, while you're calculating macro averages. Definitely keep using macro averaging for the metrics, I started using it too in my newer codebase over at https://github.com/SmilingWolf/JAX-CV (posting the repo in case you consider using it if you decide to apply to TRC).

Answer 1 · 2024-01-02T20:27:55.000Z

@SmilingWolf slightly OT but might you please advise how to convert to ONNX so it can be used in A1111 and ComfyUI extensions?

Answer 2 · 2024-01-12T21:36:02.000Z

Thank you for the kind comments and help, @SmilingWolf!

I noticed in the validation arena you're using my suggested thresholds for my models, and a "default" one for yours.

Actually 0.4 is the optimized threshold for my model as well, where recall = precision. Or roughly thereabouts.

If I may go on a slight tangent about the discrepancy between my stated scores and the ones in the Arena

Well both models perform worse in the Validation Arena, so I'm not worried about discrepancies there. I think it's likely due to a slight domain shift in Danbooru. At a cursory glance a good amount of tags have changed in how they're used, and some tags are being used more now and some less.

The Validation Arena was really just a way for me to compare models as apples to apples as I can.