google-research/nasbench

y-axis in Fig 7(left)

bkj opened this issue · 4 comments

bkj commented

The left plot in Fig 7 in the paper shows test regret -- can you explain how that's computed exactly?

I know it's log10(y - y_best) -- but what is y_best exactly? Is that the best validation/test accuracy for a single model run / averaged across the 3 model runs?

I think the four possibilities would be:

test acc           mean across 3 runs           0.943175752957662
test acc           maximum across 3 runs        0.9466145634651184
validation acc     mean across 3 runs           0.9505542318026224
validation acc     maximum across 3 runs        0.9518229365348816

Thanks!

it's the mean test accuracy, i.e 0.943175752957662

bkj commented

yes exactly it's log(best_mean_test_acc - arch_mean_test_acc)

I attached Figure 7 just with the validation regret on the y-axis.
comparison_time_all_mean.pdf
comparison_time_all_mean_valid.pdf

Note that, we found some slightly better hyperparameters for SMAC and BOHB that's why they improved. For comparison I also added the original Fig 7 with the updated test regret.

bkj commented

I put code that attempts to reproduce the results of the random search here:
https://gist.github.com/bkj/8ae8da3c84bbb0fa06d144a6e7da8570

The results don't look exactly the same as in the paper -- the best regret is around 5.5 * 1e-3 vs what looks like about 4.1 * 1e-3 in the paper. Any thoughts on where the differences might be coming from?

Roughly the procedure is:

  1. sample sequence N random architectures
  2. sample a validation accuracy per architecture
  3. plot log10(best_mean_test_acc - arch_mean_test_acc) for the architecture w/ the best validation accuracy seen so far

Plot of results:
Screen Shot 2019-04-11 at 1 38 35 PM

Edit: Perhaps the issue is line 73 -- do you use the mean validation accuracy across the 3 runs for model selection, as opposed to a sample of a single run? Updated the plot above to show the difference.