modestyachts/ImageNetV2

Mapping between old and new filenames

junsukchoe opened this issue · 13 comments

Hello,

It seems that the file names of ImageNetV2 have been changed.
Could you provide the mappings between old and new filenames?

Thanks!

Hello! What are the file name changes you are seeing? I had changed the filenames in the public release temporarily and rolled it back. Could you check now to see if the file names are still different?

Hi!
It looks like not only the file names have changed, but also the number of files.

I downloaded the dataset from here on December 17th 2020. After unpacking it consisted of 3 directories (imagenet-matched-frequency-format-val, imagenet-threshold-0.7-format-val, imagenet-top-images-format-val), each containing 1000 directories with names like n03461385. The directories in imagenet-matched-frequency-format-val for example each contains the images 0.jpeg ... 19.jpeg (which amounts to 20,000 images instead of the mentioned 10,000).

I downloaded the dataset again 1 or 2 days ago and all of a sudden the directory imagenet-matched-frequency-format-val contains directories 0 ... 999 and each of them contains 10 images each with names like 7e4a8987a9a330189cc38c4098b1c57ac301713f.jpeg.

At first I thought I mixed something up but I had documented everything in December when I first downloaded it and even my browser remembered, that I downloaded it from exactly the same URL.

So, what's going on? Could you clarify what the correct version is (I assume the latter with 10k images)? But where do the additional images in the directory I downloaded in December come from?

Best regards
Verena

Hi @expectopatronum & @junsukchoe

The current dataset release (the one you can download right now with 10k images is the correct one). We had a mixup with our S3 bucket in October 2020 and all our files got deleted, and we re-uploaded the dataset to the same locations.

The long names like "7e4a8987a9a330189cc38c4098b1c57ac301713f" are our internal candidate ids and were added to the release to allow you can merge the images with the data structures/labels found in this repository, and our other project: https://github.com/modestyachts/evaluating_machine_accuracy_on_imagenet. You were right @junsukchoe this is indeed a change in our current release from our old release from pre October 2020.

The extra 10k images are duplicates so you can ignore them!

I can dig up the exact mapping between the filenames old release (from Pre October 2020) and the new release if you need it!

Thanks,
Vaishaal Shankar

Hi! The new directory names 0, 1, ..., 999 cause trouble for using torchvision.datasets.ImageFolder, which sorts the names into 0, 1, 10, 100, ..., 999, different from the original order. To get around, I padded zeros to all directory names to 4 digits and it worked. In Python:

import os, glob

for path in glob.glob('../dataset/imagenetv2*'):
    if os.path.isdir(path):
        for subpath in glob.glob(f'{path}/*'):
            dirname = subpath.split('/')[-1]
            os.rename(subpath, '/'.join(subpath.split('/')[:-1]) + '/' + dirname.zfill(4))

hi @Vaishaal

just downloaded the dataset imagenetv2-threshold0.7 using this link from here.
when untared, there is only one folder imagenetv2-threshold0.7-format-val.
and as others mentioned, names of files are faa7b8da1c2a3f0fee1814d01d1afffb4b5952f7.jpeg.

I can dig up the exact mapping between the filenames old release (from Pre October 2020) and the new release if you need it!

any news on the mapping?

@expectopatronum did you find a way around this?

i really appreciate your help
thanks

The tar.gz should have 1000 sub-folders which correspond to each of the 1000 imagnet classes (https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a)

Is this not what you see?

if you are using pytorch you can use https://github.com/modestyachts/ImageNetV2_pytorch to load the dataset.

hi,
the issue has nothing to do with the dataset.
but, the issue is the name of files that have changed and caused a problem somewhere else in some repo that uses the old names.
it was 0.jpeg for example, and now it is faa7b8da1c2a3f0fee1814d01d1afffb4b5952f7.jpeg.
someone had made additional annotation based on the old naming.
so, as you said above that you changed the naming system.
so, i was wondering if you have the old naming (mapping between the old naming to the new naming).

thanks

Oh did not realize there was a dependency on the filenames! We actually lost the old version of the dataset because the newer version with the candidate ids allows us to associate each image in the release to the rest of the metadata we've released in https://github.com/modestyachts/ImageNetV2.

If you have a copy of the old dataset lying around I can probably generate the mapping quite easily but right now I don't have access to the old dataset.

i dont have the old dataset, but probably the author of the additional annotation might @junsukchoe

thanks

Hi! The new directory names 0, 1, ..., 999 cause trouble for using torchvision.datasets.ImageFolder, which sorts the names into 0, 1, 10, 100, ..., 999, different from the original order. To get around, I padded zeros to all directory names to 4 digits and it worked. In Python:

import os, glob

for path in glob.glob('../dataset/imagenetv2*'):
    if os.path.isdir(path):
        for subpath in glob.glob(f'{path}/*'):
            dirname = subpath.split('/')[-1]
            os.rename(subpath, '/'.join(subpath.split('/')[:-1]) + '/' + dirname.zfill(4))

Thank you for your snippet! It solves the problem. I made the following minor adjustments to make it more robust w.r.t. OS. (Windows 10 has a different path separator from Linux.)

import glob
for path in glob.glob('../dataset/imagenetv2*'):
    if os.path.isdir(path):
        for subpath in glob.glob(f'{path}/*'):
            dirname = os.path.basename(subpath)
            os.rename(subpath, os.path.sep.join([os.path.dirname(subpath), dirname.zfill(4)]))

So what's the mapping between old and new filenames? Why not just keep consistent with the original valset

Ah sorry we lost the old filenames. You can use the ImageNetV2 pytorch dataloader: https://github.com/modestyachts/ImageNetV2_pytorch if you'd like code that loads the dataset correctly (so it is compatible with ImageNet-Val)