Downsampled Open Images Dataset V4
The Open Images V4 dataset contains 15.4M bounding-boxes for 600 categories on 1.9M images and 30.1M human-verified image-level labels for 19794 categories.
The dataset is available at this link. This total size of the full dataset is 18TB
. There's also a smaller version which contains rescaled images to have at most 1024 pixels on the longest side. However, the total size of the rescaled dataset is still large (513GB
for training, 12GB
for validation and 36GB
for testing).
I provide a much smaller version of the Open Images Dataset V4, as inspired by Downsampled ImageNet datasets
@PatrykChrabaszcz. These downsampled dataset are much smaller in size so everyone can download it with ease (59GB
for training with 512px version and 16GB
for training with 256px version). Experiments on these downsampled dataset are also much faster than the original.
Dataset | Train Size | Validation Size | Test Size | Test Challenge Size | Google Drive | AcademicTorrents |
---|---|---|---|---|---|---|
Original | 513 GB | 12 GB | 36 GB | 9.7 GB | ||
512px | 52.8 GB | 1.23 GB | 3.72 GB | 3.08 GB | Link | Link |
256px | 16 GB | 0.4 GB | 1.14 GB | 0.95 GB | Link | Link |
- pillow
- loguru
image_resizer.py [-h] -i IN_DIR -o OUT_DIR [-s SIZE] [-ext EXTENSION]
[-a ALGORITHM] [-j PROCESSES] [-l LOG]
optional arguments:
-h, --help show this help message and exit
-i IN_DIR, --in_dir IN_DIR
Input directory with source images
-o OUT_DIR, --out_dir OUT_DIR
Output directory for resized images
-s SIZE, --size SIZE Size of an output image (e.g. 512 results in (512x512)
image)
-ext EXTENSION, --extension EXTENSION
Extension of the output image (jpg, png). Default
empty means the same as source
-a ALGORITHM, --algorithm ALGORITHM
Algorithm used for resampling: lanczos, nearest,
bilinear, bicubic, box, hamming
-j PROCESSES, --processes PROCESSES
Number of sub-processes that run different folders in
the same time
-l LOG, --log LOG Path of the output log
-v VERBOSE, --verbose VERBOSE
Number of images between each verbose in each thread
For example, the 512px dataset is created with the following command:
python image_resizer.py -i data\train -o output\train -j 4 -ext jpg -a lanczos -s 512
Parts of the code are inspired by the Downsampled ImageNet datasets
@PatrykChrabaszcz.