Metrics on validation set better than on training set in later iterations

Question

Metrics on validation set better than on training set in later iterations

Closed this issue 3 years ago · 2 comments

Hey, first off thanks for your great work.
I ran Pixelpick on the CamVid Datset and was able to reproduce your results.
However, I noticed that in the later iterations (e.g. 8th and 9th query) the metrics are getting worse on the training set compared to earlier queries. Also the metrics on the validation set are better than on the training set in later iterations.

This was somewhat confusing to me as I would expect the train metrics to always be better or at least close to the validation metrics. Have you also experienced this behaviour? Could you explain why it happens?

Here are some example outputs of the log files (log_train and log_val) to clarify what I mean:

1_query: (train metrics are better than val metrics (mIoU))
log_train

epoch,mIoU,pixel_acc,loss
1,0.1647861916095027,0.452034670192449,1.6727225210497287
2,0.22060275543385285,0.5612143566463328,1.3061175287746993
...
49,0.541220343050904,0.7966300366300366,0.6058177067770985
50,0.5305067612137754,0.8005161682101108,0.5683277458603916

log_val

epoch,mIoU,pixel_acc
1,0.25997502362990466,0.7165194302324359
2,0.3002034544983505,0.7339188130731727
...
49,0.5081947105419546,0.8559528512823557
50,0.5070833871048923,0.8561115071697197

9_query: (train metrics are worse than val metrics)
log_train

epoch,mIoU,pixel_acc,loss
1,0.14555334407154222,0.3848359444280225,1.7204595146283426
2,0.190205061287549,0.46715755745886994,1.43651252691863
...
49,0.464088471069229,0.6884144658139321,0.7766113927781256
50,0.458601080099059,0.6860887275303286,0.7839057154017068

log_val

epoch,mIoU,pixel_acc
1,0.3229489249178415,0.7533149149938388
2,0.32805722647346275,0.7657497348137738
...
49,0.5605398407886102,0.8780724211051123
50,0.5584485101927593,0.8780123472659442

Best regards,
Marcel

Answer 1 · 2021-06-07T13:49:28.000Z

Dear Marcel,

Hi, Marcel, thanks for your enquiry. For your questions,

(1) Why do training metrics show lower values than validation metrics?
It's because when we compute the metrics during training time, we use the masked groundtruths. This means that, we only consider the queried pixels for computing mIoU, pixel accuracy, and cross-entropy loss. If we take into account that those queried pixels are non-trivial ones (meaning uncertain from the model's perspective), it's understandable that training metrics are lower than the validation counterpart which is calculated on the whole pixels rather than a few uncertain (thus difficult) pixels within an image.

(2) Why do later query stages show lower training metrics than early ones?
It's also related to difficulty of queried pixels. As later query stages contain more difficult labelled pixels than earlier ones, it's harder to get a better score. That is, intuitively speaking, as the active learning proceeds, the model gradually asks harder pixels, e.g., ones near boundaries, which makes the overall difficulty of its labelled data pool harder.

I hope these answers make sense to you. :)
Please let us know if the answers are not clear enough or if you find any other questions/issue.

Kind regards,
Gyungin

Answer 2 · 2021-06-08T10:26:52.000Z

Thank you for your prompt reply.
Yes that makes perfect sense.

Best Regards,
Marcel