We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project is a collaboration between multiple companies in the Netherlands. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.
Note that we're not formally affiliated with the scikit-learn project at all.
The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.
Install scikit-lego
via pip with
pip install scikit-lego
Via conda with
conda install -c conda-forge scikit-lego
Alternatively, to edit and contribute you can fork/clone and run:
pip install -e ".[dev]"
python setup.py develop
The documentation can be found here.
We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.
# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier
...
mod = Pipeline([
("scale", StandardScaler()),
("random_noise", RandomAdder()),
("model", GMMClassifier())
])
...
Here's a list of features that this library currently offers:
sklego.datasets.load_abalone
loads in the abalone datasetsklego.datasets.load_chicken
loads in the joyful chickweight datasetsklego.datasets.load_heroes
loads a heroes of the storm datasetsklego.datasets.make_simpleseries
make a simulated timeseriessklego.pandas_utils.add_lags
adds lag values in a pandas dataframesklego.pandas_utils.log_step
a useful decorator to log your pipeline stepssklego.dummy.RandomRegressor
dummy benchmark that predicts random valuessklego.linear_model.DeadZoneRegressor
experimental feature that has a deadzone in the cost functionsklego.linear_model.DemographicParityClassifier
logistic classifier constrained on demographic paritysklego.linear_model.EqualOpportunityClassifier
logistic classifier constrained on equal opportunitysklego.linear_model.ProbWeightRegression
linear model that treats coefficients as probabilistic weightssklego.naive_bayes.GaussianMixtureNB
classifies by training a 1D GMM per column per classsklego.naive_bayes.BayesianGaussianMixtureNB
classifies by training a bayesian 1D GMM per column per classsklego.mixture.BayesianGMMClassifier
classifies by training a bayesian GMM per classsklego.mixture.BayesianGMMOutlierDetector
detects outliers based on a trained bayesian GMMsklego.mixture.GMMClassifier
classifies by training a GMM per classsklego.mixture.GMMOutlierDetector
detects outliers based on a trained GMMsklego.meta.ConfusionBalancer
experimental feature that allows you to balance the confusion matrixsklego.meta.DecayEstimator
adds decay to the sample_weight that the model acceptssklego.meta.EstimatorTransformer
adds a model output as a featuresklego.meta.GroupedEstimator
can split the data into runs and run a model on eachsklego.meta.OutlierRemover
experimental method to remove outliers during trainingsklego.meta.SubjectiveClassifier
experimental feature to add a prior to your classifiersklego.meta.Thresholder
meta model that allows you to gridsearch over the thresholdsklego.preprocessing.ColumnCapper
limits extreme values of the model featuressklego.preprocessing.ColumnDropper
drops a column from pandassklego.preprocessing.ColumnSelector
selects columns based on column namesklego.preprocessing.InformationFilter
transformer that can de-correlate featuressklego.preprocessing.OrthogonalTransformer
makes all features linearly independentsklego.preprocessing.PandasTypeSelector
selects columns based on pandas typesklego.preprocessing.PatsyTransformer
applies a patsy formulasklego.preprocessing.RandomAdder
adds randomness in trainingsklego.preprocessing.RepeatingBasisFunction
repeating feature engineering, useful for timeseriessklego.model_selection.KlusterFoldValidation
experimental feature that does K folds based on clusteringsklego.model_selection.TimeGapSplit
timeseries Kfold with a gap between train/testsklego.pipeline.DebugPipeline
adds debug information to make debugging easiersklego.metrics.correlation_score
calculates correlation between model output and featuresklego.metrics.equal_opportunity_score
calculates equal opportunity metricsklego.metrics.p_percent_score
proxy for model fairness with regards to sensitive attributesklego.metrics.subset_score
calculate a score on a subset of your data (meant for fairness tracking)
We want to be rather open here in what we accept but we do demand three things before they become added to the project:
- any new feature contributes towards a demonstratable real-world usecase
- any new feature passes standard unit tests (we use the ones from scikit-learn)
- the feature has been discussed in the issue list beforehand
We automate all of our testing and use pre-commit hooks to keep the load on travis light.