UCI datasets

Regression datasets from the UCI machine learning repository prepared for benchmarking studies with test-train splits.

Installation

Install using pip (the download size is about 312 Mb):

python -m pip install git+https://github.com/treforevans/uci_datasets.git

Usage

The following code gets the first test-train split (i.e., split=0) of the challenger dataset:

from uci_datasets import Dataset
data = Dataset("challenger")
x_train, y_train, x_test, y_test = data.get_split(split=0)

There are 10 test-train splits for each dataset (as in 10-fold cross validation) with 90% of the dataset being training points and 10% being testing points in each split. The split parameter of the Dataset.get_split method accepts integers from 0 to 9 (inclusive).

Datasets

The below table contains the size (number of observations) and the number of input dimensions of each dataset. All datasets have a single output dimension.

Dataset name	Number of observations	Input dimension
`3droad`	434874	3
`autompg`	392	7
`bike`	17379	17
`challenger`	23	4
`concreteslump`	103	7
`energy`	768	8
`forest`	517	12
`houseelectric`	2049280	11
`keggdirected`	48827	20
`kin40k`	40000	8
`parkinsons`	5875	20
`pol`	15000	26
`pumadyn32nm`	8192	32
`slice`	53500	385
`solar`	1066	10
`stock`	536	11
`yacht`	308	6
`airfoil`	1503	5
`autos`	159	25
`breastcancer`	194	33
`buzz`	583250	77
`concrete`	1030	8
`elevators`	16599	18
`fertility`	100	9
`gas`	2565	128
`housing`	506	13
`keggundirected`	63608	27
`machine`	209	7
`pendulum`	630	9
`protein`	45730	9
`servo`	167	4
`skillcraft`	3338	19
`sml`	4137	26
`song`	515345	90
`tamielectric`	45781	3
`wine`	1599	11

Dataset information can be obtained from the all_datasets dictionary. For example, to obtain a list of all datasets with fewer than 1000 observations, execute the following:

from uci_datasets import all_datasets
[name for name, (n_observations, n_dimensions) in all_datasets.items() if n_observations < 1000]

Papers using these datasets

The following papers use the same datasets and test-train splits present in this repository.