/data

datasets

Apache License 2.0Apache-2.0

Datasets

A personal collection of datasets converted to uniformed formats. They can be used directly by most DMLC projects. The copyrights of these datasets belong to the original authors.

Text classification

All are converted into the LIBSVM format.

| name | class | +1/-1 | training | testing | feature | feature group | | --- | ----: | ----: | ---: | ---: | ---: | ---: | ---: | | CriteoKaggle | 2 | 3.9:1 | 4.584 × 107 | 6.042 × 106 | 3.429 × 107K | 39 | | CriteoTera | 2 | ? | 2 × 109 | - | 8 × 108 | 39 | | CTRa | 2 | 1:1 | 2.238 × 105 | 6.355 × 104 | 1.314 × 107 | ~200 | | CTRb | 2 | 8.6:1 | 1.645 × 105 | 4.772 × 104 | 1.742 × 107 | ~100 | | Avito | | Avazu |

Image classification

All are converted into the recordio format

name class image size training testing
CIFAR10 10 28 × 28 × 3 60,000 10,000
ILSVRC12 1,000 227 × 227 × 3 1,281,167 50,000