/MLDatasets

Python library that hosts/formats datasets.

MIT LicenseMIT

MLDatasets

Python library that hosts/formats datasets.

This is going to be a ML dataset hosting/formating library intended to make it easy to download/preprocess/format/etc. your datasets.

First Steps

The first area that will be targeted are image based datasets for classification, image segmentation, image translation and object detection (bounding boxes). Now the first thing that has to happen is the creation of an image based dataset index. If you know of, or use any image based datasets not in the index yet, please consider adding them. The format of the index is very simple (see Image Based Dataset Index).

Image Based Dataset Index

The image based dataset index file can be found in /datasets/image_based.idx

Format:
[Dataset Name];[Dataset Type];[Additional information];[Dataset Download Urls]\n

Dataset Name = that is just the name of the dataset
Dataset Type = there are different types of datasets, not sure on these yet so make up your own (e.g bounding_boxes, classification,etc.)
Additional Information = any infos you have (like a description)
Dataset Download Urls = the direct url/urls to download the dataset (if there are multiple files to be downloaded provide all urls seperated with ';')

Format-Example:
[...]
horse2zebra;unsupervised_image_translation;dataset used in the cyclegan paper for unsupervised horse to zebra image translation;https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/horse2zebra.zip
[...]

MNIST-Example (multiple urls seperated with ';'):
[...]
mnist;classification;digit classification 0-9 (info link: http://yann.lecun.com/exdb/mnist/) ;http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz;http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz;http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz;http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
[...]


You can either edit the list on github or send me your additions via email (zacharias.boehler@gmail.com). Any support is very appreciated.