This repository contains my use case submission for the course Automated Machine Learning held by Dr. Janek Thomas and Lennart Schneider at the LMU Munich in the winter term 2022/23.
The objective of this use case was to build a automated machine learning tool for imbalanced binary classification.
The requirements and additional information are specified in setup.cfg.
Clone this repository
git clone https://github.com/fstermann/autoibc
Install the package with pip
pip install .
or if you want to install the package with all optional dependencies to produce visualizations
pip install -e .[viz]
# pip install -e ".[viz]" # Escape for zsh
You might need to install the latest version of swig in order to install pyrfr
.
The basic AutoIBC class can be used like any sklearn
estimator.
As such, you can pass it into the cross_val_score
function to cross validate the system.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from autoibc import AutoIBC
from autoibc.data import Dataset
# Instantiate the tool
auto_ibc = AutoIBC()
# Load the dataset
X, y = Dataset.from_openml(idx).to_numpy()
# Cross validate
cv = StratifiedKFold(n_splits=3)
cross_val_score(auto_ibc, X, y, scoring="balanced_accuracy", cv=cv)
To control variables such as inner cross validation splits, or the number of trials, you can pass these parameter to cross_val_score
with the fit_params
argument.
cross_val_score(
auto_ibc,
...,
fit_params=dict(
n_trials=100,
cv_splits=5, # Number of inner cross validation splits
outer_cv=True, # Tells the system to save the runs for each outer fold
run_name="my-custom-run",
seed=42,
)
)
You can also configure a pipeline on your own.
Simply pass in the name of the step along with a list of components to the AutoPipeline
constructor.
Alternatively, pass in components as a dictionary, where the key is the component and the value is the weight assigned to it. The weight will be used by the optimization process.
It is also possible to instruct to try out configurations which skip the step. Simply pass in None
as one of the components.
from autoibc.components import classification, preprocessing, sampling
from autoibc.data import Dataset
from autoibc.pipeline import AutoPipeline
# Set up the pipeline with 3 steps
pipeline = AutoPipeline(
steps=dict(
preprocessing=[
preprocessing.AutoSimpleImputer(),
],
sampling=[
None, # Will consider configurations where no sampling is done
sampling.AutoSMOTE(),
sampling.AutoSMOTETomek(),
],
classification={
classification.AutoRandomForest(): 3, # Prefer Random Forst
classification.AutoGradientBoosting(): 1,
},
)
)
To create a new component, you can create a new class that inherits from BaseAutoIBC
and implements the configspace
property method.
Make sure to pass the sklearn
model to the model
argument of the BaseAutoIBC
constructor.
Example for setting a new component for a Random Forest:
from ConfigSpace import ConfigurationSpace
from sklearn.ensemble import RandomForestClassifier
from autoibc.base import BaseAutoIBC
from autoibc.hp import Boolean, Categorical, Float, Integer
from autoibc.util import make_configspace
class AutoRandomForest(BaseAutoIBC):
def __init__(self) -> None:
super().__init__(model=RandomForestClassifier)
@property
def configspace(self) -> ConfigurationSpace:
return make_configspace(
Boolean("bootstrap", default=True),
Categorical("criterion", ["gini", "entropy"], default="gini"),
Float("max_features", (0.0, 1.0), default=0.5),
Integer("min_samples_leaf", (1, 20), default=1),
name=self.name,
)
To compare the performance of the system on the given benchmark datasets against a random forest baseline, run:
python -m benchmark
The benchmark has been run on Google Colab. To avoid dependecy conflicts between tensorflow, you might need to run
!pip install numpy~=1.23.0
after installation of the package.
The following datasets from OpenML are used in the benchmark example:
ID | Dataset | % Small Class | # Features | # Observations |
---|---|---|---|---|
976 | JapaneseVowels | 0.16 | 15 | 9961 |
980 | optdigits | 0.10 | 65 | 5620 |
1002 | ipums_la_98-small | 0.10 | 56 | 7485 |
1018 | ipums_la_99-small | 0.06 | 57 | 8844 |
1019 | pendigits | 0.10 | 17 | 10992 |
1021 | page-blocks | 0.10 | 11 | 5473 |
1040 | sylva_prior | 0.06 | 109 | 14395 |
1053 | jm1 | 0.19 | 22 | 10885 |
1116 | musk | 0.15 | 170 | 6598 |
41160 | rl | 0.16 | 23 | 31406 |
Visulations of the benchmark results can be found in the visualization notebook.
The following packages are used in the implementation:
openml
Python API for OpenMLscikit-learn
Machine learning libraryimbalanced-learn
Extension ofscikit-learn
to handle imbalanced datasetssmac
Bayesian optimization for hyperparameter tuning- Visualization
matplotlib
Plotting library