y-axis in Fig 7(left)
bkj opened this issue · 4 comments
The left plot in Fig 7 in the paper shows test regret -- can you explain how that's computed exactly?
I know it's log10(y - y_best)
-- but what is y_best
exactly? Is that the best validation/test accuracy for a single model run / averaged across the 3 model runs?
I think the four possibilities would be:
test acc mean across 3 runs 0.943175752957662
test acc maximum across 3 runs 0.9466145634651184
validation acc mean across 3 runs 0.9505542318026224
validation acc maximum across 3 runs 0.9518229365348816
Thanks!
it's the mean test accuracy, i.e 0.943175752957662
yes exactly it's log(best_mean_test_acc - arch_mean_test_acc)
I attached Figure 7 just with the validation regret on the y-axis.
comparison_time_all_mean.pdf
comparison_time_all_mean_valid.pdf
Note that, we found some slightly better hyperparameters for SMAC and BOHB that's why they improved. For comparison I also added the original Fig 7 with the updated test regret.
I put code that attempts to reproduce the results of the random search here:
https://gist.github.com/bkj/8ae8da3c84bbb0fa06d144a6e7da8570
The results don't look exactly the same as in the paper -- the best regret is around 5.5 * 1e-3
vs what looks like about 4.1 * 1e-3
in the paper. Any thoughts on where the differences might be coming from?
Roughly the procedure is:
- sample sequence N random architectures
- sample a validation accuracy per architecture
- plot
log10(best_mean_test_acc - arch_mean_test_acc)
for the architecture w/ the best validation accuracy seen so far
Edit: Perhaps the issue is line 73 -- do you use the mean validation accuracy across the 3 runs for model selection, as opposed to a sample of a single run? Updated the plot above to show the difference.