Results may be overestimated

Question

Results may be overestimated

Opened this issue 4 years ago · 0 comments

Hi!

I was able to reproduce the ensemble results. Thanks a lot for your scripts.

I have a question/suggestion on how you get the test accuracies,

For what I see in the code, test results are computed at the end of each hyperparameter configuration run, based on the best epoch (given its validation scores), and also, I can see that the filenames contain the test results.

Then, there are two places in which I think you may be overestimating the reported results:

Choosing the best acoustic and linguistic model based on test results may be overestimating the results.
If when choosing the best linguistic model and the best acoustic model for the ensemble, you use the ones that got the best test result, those sub-models will perform better than if you choose the one with the best validation scores. In that case, for a fair evaluation of the ensemble model, you would need a new test set. Unseen in previous runs.

I suggest instead, using validation scores in all the intermediate reports (for choosing the best models for reporting and for the ensemble method) and then, finally, and only once, run the best linguistic, the best acoustic, and the best ensemble method on the test data. Then report those numbers. In that way, you know you are not overestimating the results.

I apologize in advance if you did it in a different way and I got it wrong :)

Thanks again! Great repo!