Package to easily import datasets from the UC Irvine Machine Learning Repository into scripts and notebooks.
Current Version: 0.0.7
In a Jupyter notebook, install with the command
!pip3 install -U ucimlrepo
Restart the kernel and import the module ucimlrepo
.
from ucimlrepo import fetch_ucirepo, list_available_datasets
# check which datasets can be imported
list_available_datasets()
# import dataset
heart_disease = fetch_ucirepo(id=45)
# alternatively: fetch_ucirepo(name='Heart Disease')
# access data
X = heart_disease.data.features
y = heart_disease.data.targets
# train model e.g. sklearn.linear_model.LinearRegression().fit(X, y)
# access metadata
print(heart_disease.metadata.uci_id)
print(heart_disease.metadata.num_instances)
print(heart_disease.metadata.additional_info.summary)
# access variable info in tabular format
print(heart_disease.variables)
Loads a dataset from the UCI ML Repository, including the dataframes and metadata information.
Provide either a dataset ID or name as keyword (named) arguments. Cannot accept both.
id
: Dataset ID for UCI ML Repositoryname
: Dataset name, or substring of name
dataset
data
: Contains dataset matrices as pandas dataframesids
: Dataframe of ID columnsfeatures
: Dataframe of feature columnstargets
: Dataframe of target columnsoriginal
: Dataframe consisting of all IDs, features, and targetsheaders
: List of all variable names/headers
metadata
: Contains metadata information about the dataset- See Metadata section below for details
variables
: Contains variable details presented in a tabular/dataframe formatname
: Variable namerole
: Whether the variable is an ID, feature, or targettype
: Data type e.g. categorical, integer, continuousdemographic
: Indicates whether the variable represents demographic datadescription
: Short description of variableunits
: variable units for non-categorical datamissing_values
: Whether there are missing values in the variable's column
Prints a list of datasets that can be imported via fetch_ucirepo
filter
: Optional keyword argument to filter available datasets based on a category- Valid filters:
aim-ahead
- Valid filters:
search
: Optional keyword argument to search datasets whose name contains the search query
none
uci_id
: Unique dataset identifier for UCI repositoryname
abstract
: Short description of datasetarea
: Subject area e.g. life science, businesstask
: Associated machine learning tasks e.g. classification, regressioncharacteristics
: Dataset types e.g. multivariate, sequentialnum_instances
: Number of rows or samplesnum_features
: Number of feature columnsfeature_types
: Data types of featurestarget_col
: Name of target column(s)index_col
: Name of index column(s)has_missing_values
: Whether the dataset contains missing valuesmissing_values_symbol
: Indicates what symbol represents the missing entries (if the dataset has missing values)year_of_dataset_creation
dataset_doi
: DOI registered for dataset that links to UCI repo dataset pagecreators
: List of dataset creator namesintro_paper
: Information about dataset's published introductory paperrepository_url
: Link to dataset webpage on the UCI repositorydata_url
: Link to raw data fileadditional_info
: Descriptive free text about datasetsummary
: General summarypurpose
: For what purpose was the dataset created?funding
: Who funded the creation of the dataset?instances_represent
: What do the instances in this dataset represent?recommended_data_splits
: Are there recommended data splits?sensitive_data
: Does the dataset contain data that might be considered sensitive in any way?preprocessing_description
: Was there any data preprocessing performed?variable_info
: Additional free text description for variablescitation
: Citation Requests/Acknowledgements
external_url
: URL to external dataset page. This field will only exist for linked datasets i.e. not hosted by UCI