Only 20% of images with image-level labels downloadable through toolkit
KieranLitschel opened this issue · 3 comments
I noticed that the image-level labeled images downloader couldn't find a lot of images in the bucket for image-level labeled images. On further inspection, I've realized that it downloads them from the same bucket as the bounding boxes, meaning that only around 2 million of the 10 million images are available. I appreciate that there is no bucket for the image-level labels, so I'm assuming this was intentional, but I think this limitation should be emphasized in the documentation for transparency.
The same issue I encountered while I was downloading dataset via this toolkit, even if you specify the limit of desired dataset, it founds less amount of images online than the the desired amount.
It's worth noting the labels are Zipfian distributed (as you can see below), so if it is an uncommon label the dataset may just not have as many examples as you want.
You should be able to get around 5 times more images for each label from the image level dataset though (given that it is 5 times bigger), which might still be less than you want, but is quite significant.
I ended up writing my own implementation to allow downloading the image level V5 dataset, which I think is what you want, and you can find here https://github.com/KieranLitschel/OpenImagesV5Tools . It does some stuff you don't need, but if you do Construct.classes_subset (to select the classes you want), followed by Construct.images_sample (to specify how many images you want) (see the README.md for details) that should do what you want. Feel free to create an issue on the repo if anything isn't clear / doesn't work, and I'll try my best to help.