/oboe

A model selection system for AutoML (automated machine learning) initialization.

Primary LanguagePython

Oboe

In an orchestra, the oboe plays an initial note which the other instruments use to tune to the right frequency before the performance begins; this package, Oboe, can be used to provide an initialization with which to tune the hyperparameters of machine learning algorithms.

Oboe is a data-driven Python algorithmic system for automated machine learning, and is based on matrix factorization and classical experiment design. For a complete description, refer to our paper.

This system is still under developement and subjects to change.

Installation

Dependencies

oboe requires:

  • Python (>= 3.5)
  • numpy (>= 1.8.2)
  • scipy (>= 0.13.3)
  • pandas (>=0.22.0)
  • scikit-learn (>= 0.18)
  • multiprocessing (>=0.70.5)
  • OpenML (>=0.7.0)
  • mkl (>=1.0.0)
  • re
  • os
  • json
  • util

cvxpy is required unless scipy.optimize is used as the convex solver, which is default.

User Installation

This part is currently under development; an example for code usage is in the example folder. The package will be setup.py and pip installable in the future.

Usage

Online Phase (AutoML initialization)

Given a new dataset, we want to select promising models and hyperparameters. Denote features and labels of the training set as x_train and y_train, and features of the test set as x_test, a short example of training and testing is

from auto_learner import AutoLearner
m = AutoLearner(runtime_limit=20) #set the time limit for model fitting to be 20 seconds
m.fit(x_train, y_train)
m.predict(x_test)

Additional arguments can be applied to customize the AutoLearner instance, including:

  1. Basics
  • p_type (str): Problem type, which is one of {'classification', 'regression'}. By default, 'classification'.
  • verbose (Boolean): Whether or not to generate print statements that showcase the progress. By default, false.
  • n_cores (int): Maximum number of CPU cores to use. The default value 'None' means no limit, i.e., up to all the CPU cores of the machine.
  • runtime_limit (int): Maximum runtime for AutoLearner fitting, in seconds. By default, 512 seconds as the timeout limit.
  • scalarization (str): Scalarization of the covariance matrix for mininum variance selection. One of {'D', 'A', 'E'}. 'D', as default, enjoys best performance and fastest speed in practice.
  • solver (str): The convex solver for classic experiment design. Possible value is one of {'scipy', 'cvxpy'}. By default, 'scipy'.
  • build_ensemble (Boolean): Whether to build an ensemble of promising models.
  • stacking_alg(str): The method used for ensemble construction. One of {'greedy', 'stacking'}. By default, 'greedy'.
  1. Advanced customization
  • algorithms (list): A list of algorithm types to be considered, in strings, e.g. ['KNN', 'lSVM']. By default, all the algorithms in the error matrix. The supported classification algorithms are: 'AB' (Adaboost), 'DT' (decision tree), 'ExtraTrees' (extra trees), 'GBT' (gradient boosting), 'GNB' (Gaussian naive Bayes), 'KNN' (kNN), 'Logit' (logistic regression), 'MLP' (multilayer perceptron), 'Perceptron' (perceptron), 'RF' (random forest), 'kSVM' (kernel SVM), 'lSVM' (linear SVM).
  • hyperparameters (dict): A nested dict of hyperparameters to be considered. By default, all the model hyperparameters in the error matrix.
  • error_matrix (DataFrame): Error matrix to use for imputation, includes indices and headers. The one in defaults folder is used by default.
  • runtime_matrix (DataFrame): Runtime matrix to use for runtime prediction, includes indices and headers. The one in defaults folder is used by default.
  • new_row (np.ndarray): Predicted row of error matrix; corresponds to the new dataset. By default, 'None'.
  • selection_method (str): Method of selecting entries of new row to sample. One of {'min_variance', 'qr'}. 'min_variance' corresponds to the selection approach via classic experiment design; 'qr' selects the pivot columns in the error matrix and thus does not provide the functionality of maximizing performance within given runtime budget. By default, 'min_variance'.

For executable and more detailed examples, please refer to the example folder.

Offline Phase

Error Matrix Generation

Please refer to examples/error_matrix_generation for an error matrix generation example.