y-axis in Fig 7(left)

Question

y-axis in Fig 7(left)

bkj opened this issue 6 years ago · 4 comments

The left plot in Fig 7 in the paper shows test regret -- can you explain how that's computed exactly?

I know it's log10(y - y_best) -- but what is y_best exactly? Is that the best validation/test accuracy for a single model run / averaged across the 3 model runs?

I think the four possibilities would be:

test acc           mean across 3 runs           0.943175752957662
test acc           maximum across 3 runs        0.9466145634651184
validation acc     mean across 3 runs           0.9505542318026224
validation acc     maximum across 3 runs        0.9518229365348816

Thanks!

Answer 1 · 2019-04-05T10:03:21.000Z

it's the mean test accuracy, i.e 0.943175752957662

Answer 2 · 2019-04-05T15:27:59.000Z

Ok thanks. So it sounds like it’s ``` log(best_mean_test_acc - arch_mean_test_acc) ``` then? Otherwise it would be possible to have `-inf` regret? I was also wondering — do you have a similar plot for validation accuracy that you could share?

Answer 3 · 2019-04-08T12:54:13.000Z

yes exactly it's log(best_mean_test_acc - arch_mean_test_acc)

I attached Figure 7 just with the validation regret on the y-axis.
comparison_time_all_mean.pdf
comparison_time_all_mean_valid.pdf

Note that, we found some slightly better hyperparameters for SMAC and BOHB that's why they improved. For comparison I also added the original Fig 7 with the updated test regret.

Answer 4 · 2019-04-11T20:18:55.000Z

I put code that attempts to reproduce the results of the random search here:
https://gist.github.com/bkj/8ae8da3c84bbb0fa06d144a6e7da8570

The results don't look exactly the same as in the paper -- the best regret is around 5.5 * 1e-3 vs what looks like about 4.1 * 1e-3 in the paper. Any thoughts on where the differences might be coming from?

Roughly the procedure is:

sample sequence N random architectures
sample a validation accuracy per architecture
plot log10(best_mean_test_acc - arch_mean_test_acc) for the architecture w/ the best validation accuracy seen so far

Plot of results:

Edit: Perhaps the issue is line 73 -- do you use the mean validation accuracy across the 3 runs for model selection, as opposed to a sample of a single run? Updated the plot above to show the difference.