MLBox: A Python repository from Alwaysproblem

MLBox is a powerful Automated Machine Learning python library. It provides the following features:

Fast reading and distributed data preprocessing/cleaning/formatting
Highly robust feature selection and leak detection
Accurate hyper-parameter optimization in high-dimensional space
State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,...)
Prediction with models interpretation

For more details, please refer to the official documentation

Getting started: 30 seconds to MLBox

MLBox main package contains 3 sub-packages : preprocessing, optimisation and prediction. Each one of them are respectively aimed at reading and preprocessing data, testing or optimising a wide range of learners and predicting the target on a test dataset.

Here are a few lines to import the MLBox:

from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

Then, all you need to give is :

the list of paths to your train datasets and test datasets
the name of the target you try to predict (classification or regression)

paths = ["<file_1>.csv", "<file_2>.csv", ..., "<file_n>.csv"] #to modify
target_name = "<my_target>" #to modify

Now, let the MLBox do the job !

... to read and preprocess your files :

data = Reader(sep=",").train_test_split(paths, target_name)  #reading
data = Drift_thresholder().fit_transform(data)  #deleting non-stable variables

... to evaluate models (here default configuration):

Optimiser().evaluate(None, data)

... or to test and optimize the whole Pipeline [OPTIONAL]:

missing data encoder, aka 'ne'
categorical variables encoder, aka 'ce'
feature selector, aka 'fs'
meta-features stacker, aka 'stck'
final estimator, aka 'est'

NB : please have a look at all the possibilities you have to configure the Pipeline (steps, parameters and values...)

space = {

        'ne__numerical_strategy' : {"space" : [0, 'mean']},

        'ce__strategy' : {"space" : ["label_encoding", "random_projection", "entity_embedding"]},

        'fs__strategy' : {"space" : ["variance", "rf_feature_importance"]},
        'fs__threshold': {"search" : "choice", "space" : [0.1, 0.2, 0.3]},

        'est__strategy' : {"space" : ["XGBoost"]},
        'est__max_depth' : {"search" : "choice", "space" : [5,6]},
        'est__subsample' : {"search" : "uniform", "space" : [0.6,0.9]}

        }

best = opt.optimise(space, data, max_evals = 5)

... finally to predict on the test set with the best parameters (or None for default configuration):

Predictor().fit_predict(best, data)

That's all ! You can have a look at the folder "save" where you can find :

your predictions
feature importances
drift coefficients of your variables (0.5 = very stable, 1. = not stable at all)

How to Contribute

MLBox has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

Check out call for contributions to see what can be improved, or open an issue if you want something.
Contribute to the tests to make it more reliable.
Contribute to the documents to make it clearer for everyone.
Contribute to the examples to share your experience with other users.
Open issue if you met problems during development.

For more details, please refer to CONTRIBUTING.

Alwaysproblem/MLBox

Getting started: 30 seconds to MLBox

How to Contribute