SenteraLLC/geoml

Design an inheritance architecture for all data, and tuning and training functions

Closed this issue · 5 comments

Issue #11 provided a solution for loading in all research data, split into a train and test set, and create the X matrix and vectors. This is all held in a class (currently name rtio - [should be renamed to something more relevant]), so the object/instance of rtio can be passed to another class that undertakes hyperparameter tuning, and yet another that takes care of the training, testing, and plotting (think more about this).

Instead of create files to hold all the tuning and training functions, assign them to one of three classes (again, think more about this):

  1. feature_data - class that holds the data feature_data.X_train, feature_data.X_test, etc.
  2. tuning - class specifically designed to perform tuning (this should use skopt.gp_minimize as in issue #7); I'm thinking we should make tuning inherit from the feature_data class, so that all functions and data are accessible to tuning.
  3. training - this object must take in the appropriate hyperparameter settings for a model, take the feature_data train/test data, conduct cross validation, and export the trained model in an encrypted format that can't be "reverse engineered" to get at training data. More on this topic in this video by Alex Gaynor and this sklearn page. Also make training inherit from feature_data.

To do

Create a list with all sip_functions, to be ported over to one of these three classes. Any extra functions that will not be ported should be denoted in the list. Upon list creation, we will create a new issue for each of the functions to be included in the class objects.

Here is the working list

Original issue text

Create several files to hold all of the functions for tuning and training so they can be simply loaded into a script to be accessed

Background

The OOP way is to create a class, but classes should only be used (mostly true) if there are particular data to be held and manipulated that should be stored within the instance of the class. Further, classes should be used when there are multiple instances of that class that should exist at the same time (and they may interact in some way).

See more on classes vs functions here.

For tuning and cross-validation, I don't think it is necessary to have classes because all we're looking to do is get the final results, then maybe save to file/database or do some plotting. I think it makes sense to perform tuning and cross-validation as simple functions.

After calculation of results, then it may make sense for those results to be wrapped inside a class that manipulates the results in different ways for plotting, saving, perhaps even saving/encrypting the whole class to be loaded in a different environment for model training?

Functions can always be cut and pasted to be made as part of a class, so don't add the class complexity unless we know it is necessary.

Here is the working list of functions

This will be supplementary to many of the new features to work on.

Some info on inheritance with examples

A key thing to consider when deciding between inheritance and composition is knowing for sure that we can pass a tuned, trained model to predict a response on new data.

Thus, I propose the following architecture:

class FeatureData(object):
    def __init__(self, feature_data=[1,2,3], **kwargs):
        super(FeatureData, self).__init__(**kwargs)
        self.feature_data = feature_data
        print("FeatureData")

class FeatureSelection(FeatureData):
    def __init__(self, feature_selection='Lasso', **kwargs):
        super(FeatureSelection, self).__init__(**kwargs)
        self.feature_selection = feature_selection
        print("FeatureSelection")

class Tuning(FeatureSelection):
    def __init__(self, tuning_params={'alpha: 0.01'}, **kwargs):
        super(Tuning, self).__init__(**kwargs)
        self.tuning_params = tuning_params
        print("Tuning")

class Training(Tuning):
    def __init__(self, training_obj_f='mae', **kwargs):
        super(Training, self).__init__(**kwargs)
        self.training_obj_f = training_obj_f
        print("Training")

class Prediction(object):
    def __init__(self, Training, predict_X=[[12, 14], [13, 16]], predict_y=[2.3, 2.6], **kwargs):
        self.Training = Training
        self.predict_X = predict_X
        self.predict_y = predict_y
        print("Prediction")

Notice that Prediction uses composition instead of inheritance because we MUST have a trained model before anything can be predicted! Thus, the Prediction object can only carry out its functions if it has access to a trained model (and of course has something to predict). If we were to use inheritance with Prediction, then it seems as if we would have to re-train, re-tune, re-do feature selection EVERY TIME we want to make a prediction.
I'm sure there are ways around this (maybe by loading in the objects from a pickled file and overwriting any of the class attributes), but this seems counter-intuitive when we could just pass the training information in the first place.
This architecture would allow to create as many training instances as we would like (ahead of time or in real time), then go ahead and make predictions using the training scenario of our choosing.

Use of the above architecture:

train1 = Training(feature_data=[1,2,3])
train2 = Training(feature_data=[4,5,6])
predict1 = Prediction(train1)
predict2 = Prediction(train2)

Returns:

Predict 1 feature data: [1, 2, 3]
Predict 2 feature data: [4, 5, 6]

See issue #17

For each class that is inherited, we ideally want to be able to gather all necessary information for fully execute the functions of said classed, just by creating a new instance of the child class. This may get difficult to manage as the number of parameters increases, maybe in excess of 10-20.

Consider having a dictionary (or a separate "composition" class) that can be passed to the parent class or any child class and contains all the parameters necessary to fully execute data retrieval, feature selection, tuning, and training. This would make the call to the parent/child class(es) much cleaner, e.g.,

param_dict = {
    'FeatureData': {
        'base_dir_data': r'C:\data_dir',
        'random_seed': 999},
    'FeatureSelection': {
        'algortihm': 'Lasso',
        'n_feats': 10}
    }
my_params = TrainParameters(param_dict)
train1 = Training(my_params )  # this will execute data retrieval, selection, tuning, etc., all according to rules in `param_dict`

Parallel processing

Rule of thumb: divvy up number of PP jobs at the highest level, and only use some cores for the lower level processing. With that said, we still want the option to process in parallel at any level in this process.

Major steps to accomplish:

  1. [DONE - #16] - find feature subset(s); when there are multiple (e.g., hundreds), how should PP be implemented in subsequent steps?
  2. On each feature subset, perform tuning for a particular regression model (e.g., Lasso, PLSR, etc.); have option for either GridSearchCV or skopt.gp_minimize (#7) - these have PP built in (n_jobs) that should allow us to speed up this process if desirable in the context of all processing steps.

I'm sure there will be challenges with implementation, but my plan with the inheritance architecture among FeatureData, FeatureSelection, Tuning, and Prediction is as follows (for now):

  1. For a single feature set, parallelization begins at the tuning step (when there are multiple feature sets, it makes more sense to parallelize those since they are a higher level). Although feature selection can take substantial time, it will likely be a fraction of the time required for tuning. When it comes time that we want to evaluate multiple feature selection methods, that is the point it may be worthwhile to implement parallel processing for FeatureSelection.
  2. As a result of FeatureSelection.fs_find_params(), which is called during init() of Tuning, we are left with df_fs_params, which contains all the information necessary to get a specific number of features - this dataframe can be hundreds of rows long, so we would like the option to process in parallel. Fortunately, both GridSearchCV or skopt.gp_minimize (#7) have PP built in (n_jobs) that should allow us to speed up this process if desirable in the context of the higher level processing tree.
  3. If desired, perform for model 2, model 3, ....model n - this can be designed to run in parallel by copying the Tuning object instance before actually performing the tuning functions (but after the feature selection functions/init()). Then for each model in model_list, perform tuning - a limitation of this for PP is that different models may take substantially longer than others.
  4. Training and plotting is fast, but consider doing both training and plotting for a row of df_fs_params as soon optimal tuning results are acquired. This has the advantage of not having to save tuning results in memory (and wait for all rows in df_fs_params to finish) before plotting, saving to file, and moving on.

I think Tuning and Training should probably be in the same class because we will always have to execute tuning immediately prior to training, and training really doesn't take long at all compared to tuning. Also, I don't think there are really any settings for the training process aside from settings/parameters that must already be set by the tuning step.

The practical use of the Tuning/Training(?) object is to access the needed information when it is necessary to do so using as little memory as possible (imagine 1000 instances of the class needed at a time). This may be used directly or indirectly by customers. For example, we may want to know which features are most beneficial so we can advise agronomists to be collecting specific types of data (indirect use). Once a customer has feature data to predict a response, we really just want to access the parameters and the feature weights/coefficients so we can make a prediction without exposing any of the training data (which may be for business reasons or for customer data/ownership reasons). If the customer has additional data they would like to add in and retrain on, then while in their possession, this Tuning object should be proprietary (and maybe encrypted in a way that it only interacts with our API).

Uses (from comment above)

Find features that are most useful

from research_tools import feature_groups
from research_tools import Training
my_train = Training(param_dict=feature_groups.param_dict_test, print_out=False)
my_train.train()
for _, row in my_train.df_test_filtered.iterrows():
    print('Number of features: {0}\nSelected features: {1}\n'.format(row['feat_n'], row['feats_x_select'])) 

Number of features: 1
Selected features: ('rate_ntd_kgha',)

Number of features: 2
Selected features: ('dae', 'rate_ntd_kgha')

Number of features: 3
Selected features: ('dae', 'rate_ntd_kgha', '740')

Number of features: 4
Selected features: ('dae', 'rate_ntd_kgha', '710', '740')

Number of features: 5
Selected features: ('dae', 'rate_ntd_kgha', '710', '740', '760')

Number of features: 6
Selected features: ('dae', 'rate_ntd_kgha', '710', '740', '760', '870')

Number of features: 7
Selected features: ('dae', 'rate_ntd_kgha', '460', '710', '760', '810', '870')

Number of features: 8
Selected features: ('dae', 'rate_ntd_kgha', '460', '710', '720', '740', '760', '810')

Number of features: 9
Selected features: ('dae', 'rate_ntd_kgha', '460', '660', '710', '720', '740', '760', '810')

Number of features: 10
Selected features: ('dae', 'rate_ntd_kgha', '460', '660', '680', '710', '720', '740', '760', '810')

This tells us that the top four features are 'dae', 'rate_ntd_kgha', '710', '740'

Access feature weights/coefficients to make a prediction without having training data

First, I don't think we have to worry about exposing training data as long as only the sk-learn object is saved (and not our Training object).
After fit(), save the sklearn estimator object to file/DB (using something like pickle, joblib, or ZODB - probably better than saving the entire Training object. Care must be taken to ensure the internal state of the sklearn object remains the same across versions, etc.).
Continuing from example above (training has been done), let's load in the model that uses three features and use joblib to save that model:

import joblib

estimator = my_train.df_test_filtered[my_train.df_test_filtered['feat_n'] == 3]['regressor'].values[0]
joblib.dump(estimator, r'C:\Users\Tyler\Downloads\estimator_3.sav')

In a new terminal (to prove that we're not exposing the same instance of the trained model)

import joblib
import numpy as np

estimator2 = joblib.load(r'C:\Users\Tyler\Downloads\reg_1.sav')
data = np.array([[ 49.        , 336.255     ,   0.49586667],
                 [ 63.        ,  44.834     ,   0.4658    ],
                 [63.         ,  44.834     ,   0.19855]])
estimator2.predict(data)

array([4.47474109, 1.75005601, 2.39479327])

We can see from above that the three features in data are: 'dae', 'rate_ntd_kgha', '740'