Tabular data for various problems, especially for high-stakes rule-based modeling with the imodels package.
See also https://huggingface.co/imodels
Includes the following datasets and more (see notebooks for more details on the datasets).
To download, use the "Name" field as the key: e.g. imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
.
Name | Samples | Features | Class 0 | Class 1 | Majority class % |
---|---|---|---|---|---|
heart | 270 | 15 | 150 | 120 | 55.6 |
breast_cancer | 277 | 17 | 196 | 81 | 70.8 |
haberman | 306 | 3 | 81 | 225 | 73.5 |
credit_g | 1000 | 60 | 300 | 700 | 70 |
csi_pecarn_prop | 3313 | 97 | 2773 | 540 | 83.7 |
csi_pecarn_pred | 3313 | 39 | 2773 | 540 | 83.7 |
juvenile_clean | 3640 | 286 | 3153 | 487 | 86.6 |
compas_two_year_clean | 6172 | 20 | 3182 | 2990 | 51.6 |
enhancer | 7809 | 80 | 7115 | 694 | 91.1 |
fico | 10459 | 23 | 5000 | 5459 | 52.2 |
iai_pecarn_prop | 12044 | 73 | 11841 | 203 | 98.3 |
iai_pecarn_pred | 12044 | 58 | 11841 | 203 | 98.3 |
credit_card_clean | 30000 | 33 | 23364 | 6636 | 77.9 |
tbi_pecarn_prop | 42428 | 223 | 42052 | 376 | 99.1 |
tbi_pecarn_pred | 42428 | 121 | 42052 | 376 | 99.1 |
readmission_clean | 101763 | 150 | 54861 | 46902 | 53.9 |
First, install the imodels
package: pip install imodels
. Then, use the imodels.get_clean_dataset
function.
imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') โ> Tuple[numpy.ndarray, numpy.ndarray, list]
"""
Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally
Parameters
----------
dataset_name: str
dataset_name - unique dataset identifier
data_source: str
options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'
data_path: str
path to load/save data (default: 'data')
Returns
-------
X: np.ndarray
features
y: np.ndarray
outcome
feature_names: list
"""
# download compas dataset from imodels
X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
# download ionosphere dataset from pmlb
X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')
# download liver dataset from openml
X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')
# download ca housing from sklearn
X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')
Data comes from various sources - please cite those sources appropriately.
notebooks_fetch_data contains notebooks which download and preprocess the data
data_cleaned contains the cleaned csv file for each dataset
To use any of the clinical decision-rule datasets, you must first accept the research data use agreement here.
There are two versions of each PECARN (TBI, IAI, and CSI) dataset.
prop
: missing values have not been imputedpred
: missing values have been imputed
csi_pecarn_pred.csv
note: unlike the rest of the datasets in this repo, which are fully cleaned, csi_pecarn_pred.csv
contains a variable ("SITE")
that should be removed before fitting models.
Dataset | Task | Size | References |
---|---|---|---|
iai_pecarn | Predict intra-abdominal injury requiring acute intervention before CT | 12,044 patients, 203 with IAI-I | ๐, ๐ |
tbi_pecarn | Predict traumatic brain injuries before CT | 42,412 patients, 376 with ciTBI | ๐, ๐ |
csi_pecarn | Predict cervical spine injury in children | 3,314 patients, 540 with CSI | ๐, ๐ |
The breast_cancer
dataset here is not the extremely common Wisconsin breast-cancer dataset but rather this dataset from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.
Some other cool datasets:
- moleculenet - benchmarks for molecular datasets
- srbench - benchmarking for symbolic regression
- big-bench - language modeling benchmarks