rkosti/emotic

AAE calculation during model validation and testing

Closed this issue · 4 comments

Hello Ronak,
I have a question about the continuous annotations and the way you are using them during the validation phase of each epoch and during testing. The continuous annotations that you provide range from 0 to 10. But in Figure [15] of your PAMI 19' paper, you present the groundtruth labels and the predicted ones in the scale of [0, 1]. So I assume that the AAE values you are reporting for the test set in Table [4], are based on a scale of [0, 1]. In addition, an other question arises, as in Figure [15] you present groundtruth annotations with a precision of 2 decimal digits while the publicly available continuous annotations are integers. Also, by saying AAE, do you mean that you just calculate the average of abs{groundtruth - predicted} over all validation or test samples. Please correct if I'm wrong.

Thanks in advance.
Regards,
John.

Hi John,
Glad that you are interested in the work.
The continuous labels are learned as regression values in the range of [0,1], so the AAE values reported in Table [4] are from the range of [0,1] of continuous annotations. We mapped the ground truth values from the range of [1,10] to [0,1].
This is the same reason why you also see the AAE values with precision of 3 decimal digits in Table [4]; and in Figure [15] you see the max error below 1.
AAE means exactly what you mention.

Thanks Ronak.

Also one last thing. In the 17' paper, for the continuous dimensions you present "mean error rate"(scaled in [1-10]) metrics while in the 19' paper you present the "average absolute error" (scaled in [0-1]) which appear to be a lot lower (despite the difference in scale). In addition you mention that the network architecture is the same for both the 17' and 19' versions. How come such a big difference in the continuous dimension performance and what does mean error rate actually mean. (is it the same with AAE??)

Thanks in advance.
Regards,
John

Thanks John for highlighting that aspect.
For the 17' paper, we used mean of squared error (like the L2 loss) which we called "mean error rate" (not a good name, I guess).
For the 19' paper, we used mean (or average) absolute errors. The big difference in performance, we believe through our experiments, is due to the change in dataset. For 19' we have a much bigger and more diverse dataset. When we experimented with only continuous labels, we could get quite low errors.