stanford-futuredata/dawn-bench-entries

Blacklisted or non-blacklisted validation set:

bignamehyp opened this issue · 20 comments

The ImageNet validation set consists of 50,000 images. In the 2014 devkit, there is a list of 1762 "blacklisted" files. When we report the top-5 accuracy, should we use the blacklisted or non-blacklisted version? In Google's submission, results are obtained by using the full 50,000 including those blacklisted images. But some submissions used the blacklisted version. Just make sure we're comparing the same thing.

jph00 commented

Oh that's awkward, since we did blacklist those files - that's what has been documented since 2014 as the recommended approach in the imagenet devkit (although in a somewhat confusing manner, it must be said!)

Perhaps each entry on the leaderboard should be labeled with a column saying whether it was evaluated with or without the blacklisted images? It's seems too late to create a new rule at this stage saying which set is required, since which validation set is used impacts what hyperparams are required to hit 93% accuracy. (Our group has used up our AWS credits so we can't run any more models - and I can't imagine other teams would be thrilled at the idea of having to train new models with a new validation set...)

Adding an extra field to ImageNet submissions indicating whether or not the blacklisted images were included would be useful. @bignamehyp, @jph00, @congxu1987, @daisyden, @ppwwyyxx can each of you update your respective submissions with an additional field usedBlacklist and submit a PR? Use true for submissions that used the blacklisted files and false for those that didn't. From this information, we can update the leaderboards with a separate column, annotation, or filter.

@jph00 how much would it cost to rerun your experiments with the same hyperparameters you used for your submission? Assuming the performance is similar and everyone else used the blacklisted images, this might be a simple and cheap solution.

Also just so everyone in this discussion is on the same page the devkit instructions are in this readme.txt.

The relevant section to this discussion says:

-----------------------------
3.2.2 CLS-LOC validation data
-----------------------------

There are a total of 50,000 validation images. They are named as

      ILSVRC2012_val_00000001.JPEG
      ILSVRC2012_val_00000002.JPEG
      ...
      ILSVRC2012_val_00049999.JPEG
      ILSVRC2012_val_00050000.JPEG

There are 50 validation images for each synset.

The classification ground truth of the validation images is in 
    data/ILSVRC2014_clsloc_validation_ground_truth.txt,
where each line contains one ILSVRC2014_ID for one image, in the
ascending alphabetical order of the image file names.

The localization ground truth for the validation images can be downloaded 
in xml format.

Notes: 
(1) data/ILSVRC2014_clsloc_validation_ground_truth.txt is unchanged
since ILSVRC2012.
(2) As in ILSVRC2012 and 2013, 1762 images (3.5%) in the validation
set are discarded due to unsatisfactory quality of bounding boxes
annotations. The indices to these images are listed in
data/ILSVRC2014_clsloc_validation_blacklist.txt. The evaluation script
automatically excludes these images. A similar percentage of images
are discarded for the test set.

"used the blacklisted files" sounds a little bit ambiguous to me.
Do we use "true" for validation on 50000 images?

Yes, "used the blacklisted files" is equivalent to using all 50,000 images in the validation set.

Also, to be clear, we have linked to ILSVR2012 as the dataset in the task descriptions for ImageNet Training and Inference, which uses all 50,000 images. I don't consider this a new rule.

Hi, let me double confirm with you, since we used the whole imageNet training set (1281167 images) and validation set (50000 images) for both training and inference, in this case what value should we set to "usedBlacklist" field? Thanks!

Thanks @daisyden. You should say "usedBlacklist": true

@codyaustun, "used the blacklisted files" sounds like using the blacklisted file to exclude images. Agreed with @ppwwyyxx and @daisyden, it's a bit ambiguous. Maybe change the filename to excludeBlacklistedImages? And it's hard for the readers to understand the difference blacklisted validation set and un-blacklisted version.

@bignamehyp I agree. Once everyone in this thread has confirmed whether or not they used all 50,000, we can easily update the field name to make it clearer.

@codyaustun thank you very much for your effort. Amoeba net submissions used all 50,000 images. I will create a PR adding the new filed shortly.

We used the entire 50,000 images in our single Cloud TPU tests for both TF 1.7 and TF 1.8 (cc @sb2nov).

As an experiment, I ran the validation dataset with the blacklisted images excluded on some of the training run checkpoints we did, and on average it improved Top 1 accuracy by 0.25 - 0.35% and Top 5 accuracy by 0.08 - 0.12%.

Thanks @frankchn! That is useful to know. Based on those numbers, it looks like fast.ai's original ResNet50 submission would likely be unchanged, but the current one might not make the threshold.

jph00 commented

@codyaustun that's not how training to hit a threshold is done - at least not by us. We find the parameters necessary to hit the threshold in the minimal # epochs, but no more. If we had to hit the equivalent of 94.1% accuracy for the current (2014 onwards) Imagenet validation set, we would use slightly different hyperparams. It wouldn't change the time much with suitable hyperparams (we can get 94.1 with one extra epoch with suitable hyperparams).

@jph00 I understand. My observation was more that the original submission hit 93.132% when it first crossed over the 93% threshold. If @frankchn results generalize to your code, you would still be at the 93% threshold at the same epoch even after including the blacklisted images. The result of that submission wouldn't change both in terms of time and cost.

However, the current submission would change because you only reach 93.003%, so it seems the same hyperparameters won't work and you can't simply rerun your submission or revalidate from checkpoints. Is that correct?

My goal is to find a solution to this problem that makes for a fair comparison. We are willing to let you update your submission to correct the validation set. I want to get a sense of whether or not that is feasible. Do you know how much it would cost to tune the parameters to hit the threshold on the full validation set? If cost is the only issue to doing this, and it isn't unreasonable, we could simply rerun your experiments or give you credits to resolve this issue without everyone else spending time or money to update their submissions.

jph00 commented

Thanks! We can also help reproduce the experiments if that would be easier or faster.

Looks like this is resolved with #42. Thanks @bignamehyp for raising the issue, @jph00 for the timely update, and everyone else for your quick responses.