This is the CleanML Benchmark for Joint Data Cleaning and Machine Learning.
The details of the benchmark methodology and design are described in the paper: CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
To run experiments, download and unzip the datasets. Place it under the project home directory and execute the following command from the project home directory:
python3 --run_experiments [--dataset <name>] [--cpu <num_cpu>] [--log]
--dataset: the experiment dataset. If not specified, the program will run experiments on all datasets.
--cpu: the number of cpu used for experiment. Default is 1.
--log: whether to log experiment process
The experimental results for each dataset will be saved in /result
directory as a json file named as <dataset name>_result.json. Each result is a key-value pair. The key is a string in format "<dataset>/<split seed>/<error type>/<clean method>/<ML model>/<random search seed>". The value is a set of key-value pairs for each evaluation metric and result. Our experimental results are provided in
To run analysis for populating relations described in the paper, unzip
and execute the following command from the project home directory:
python3 --run_analysis [--alpha <value>]
--alpha: the significance level for multiple hypothesis test. Default is 0.05.
The relations R1, R2 and R3 will be saved in /analysis
directory. Our analysis results are provided in
To add a new dataset, first, create a new folder with dataset name under /data
and create a raw
folder under the new folder. The raw
folder must contain raw data named raw.csv
. For dataset with inconsistencies, it must also contain the inconsistency-cleaned version data named inconsistency_clean_raw.csv
. For dataset with mislabels, it must also contain the mislabel-cleaned version data named mislabel_clean_raw.csv
. The structure of the directory looks like:
. └── data └── new_dataset └── raw ├── raw.csv ├── inconsistency_clean_raw.csv (for dataset with inconsistencies) └── mislabel_clean_raw.csv (for dataset with mislabels)
Then add a dictionary to /schema/
and append it to datasets
array at the end of the file.
The new dictionary must contain the following keys:
data_dir: the name of the dataset.
error_types: a list of error types that the dataset contains.
label: the label of ML task.
The following keys are optional:
class_imbalance: whether the dataset is class imbalanced.
categorical_variables: a list of categorical attributes.
text_variables: a list of text attributes.
key_columns: a list of key columns used for deduplication.
drop_variables: a list of irrelevant attributes.
To add a new error type, add a dictionary to /schema/
and append it to error_types
array at the end of the file.
The new dictionary must contain the following keys:
name: the name of the error type.
cleaning_methods: a dictionary, {cleaning method name: cleaning methods object}.
To add a new ML model, add a dictionary to /schema/
and append it to models
array at the end of the file.
The new dictionary must contain the following keys:
name: the name of the model.
fn: the function of the model.
fixed_params: parameters not to be tuned.
hyperparams: the hyperparameter to be tuned.
hyperparams_type: the type of hyperparameter "real" or "int".
hyperparams_range: range of search. Use log base for real type hyperparameters.
To add a new cleaning methods, add a class to /schema/
The class must contain two methods:
fit(dataset, dirty_train)
: take in the dataset dictionary and dirty training set. Compute statistics or train models on training set for data cleaning.
clean(dirty_train, dirty_test)
: take in the dirty training set and dirty test set. Clean the error in the training set and test set. Return (clean_train, indicator_train, clean_test, indicator_test)
, which are the clean version datasets and indicators that indicate the location of error.
We consider "BD" and "CD" scenarios in our paper. To investigate other scenarios, add scenarios to /schema/