Machine learning dataset loaders
Loaders for various machine learning datasets for testing and example scripts.
Previously in thinc.extra.datasets
.
Setup and installation
The package can be installed via pip:
pip install ml-datasets
Loaders
Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source of details.
# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()
Available loaders
ID / Function | Description | From URL |
---|---|---|
imdb |
IMDB sentiment dataset. | ✓ |
mnist |
MNIST data. | ✓ |
quora_questions |
Quora question answer dataset. | ✓ |
reuters |
Reuters dataset. | ✓ |
snli |
Stanford Natural Language Inference corpus. | ✓ |
stack_exchange |
Stack Exchange dataset. | |
ud_ancora_pos_tags |
Universal Dependencies Spanish AnCora corpus (POS tagging). | ✓ |
ud_ewtb_pos_tags |
Universal Dependencies English EWT corpus (POS tagging). | ✓ |
wikiner |
WikiNER data. | |
dbpedia |
DBPedia ontology dataset via fast.ai. | ✓ |
Registering loaders
Loaders can be registered externally using the loaders
registry as a decorator. For example:
@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
return load_some_data()
assert "my_custom_loader" in ml_datasets.loaders