Design an inheritance architecture for all data, and tuning and training functions
Closed this issue · 5 comments
Issue #11 provided a solution for loading in all research data, split into a train and test set, and create the X matrix and vectors. This is all held in a class (currently name rtio
- [should be renamed to something more relevant]), so the object/instance of rtio
can be passed to another class that undertakes hyperparameter tuning, and yet another that takes care of the training, testing, and plotting (think more about this).
Instead of create files to hold all the tuning and training functions, assign them to one of three classes (again, think more about this):
feature_data
- class that holds the datafeature_data.X_train
,feature_data.X_test
, etc.tuning
- class specifically designed to perform tuning (this should use skopt.gp_minimize as in issue #7); I'm thinking we should maketuning
inherit from thefeature_data
class, so that all functions and data are accessible totuning
.training
- this object must take in the appropriate hyperparameter settings for a model, take thefeature_data
train/test data, conduct cross validation, and export the trained model in an encrypted format that can't be "reverse engineered" to get at training data. More on this topic in this video by Alex Gaynor and this sklearn page. Also maketraining
inherit fromfeature_data
.
To do
Create a list with all sip_functions, to be ported over to one of these three classes. Any extra functions that will not be ported should be denoted in the list. Upon list creation, we will create a new issue for each of the functions to be included in the class objects.
Original issue text
Create several files to hold all of the functions for tuning and training so they can be simply loaded into a script to be accessed
Background
The OOP way is to create a class, but classes should only be used (mostly true) if there are particular data to be held and manipulated that should be stored within the instance of the class. Further, classes should be used when there are multiple instances of that class that should exist at the same time (and they may interact in some way).
See more on classes vs functions here.
For tuning and cross-validation, I don't think it is necessary to have classes because all we're looking to do is get the final results, then maybe save to file/database or do some plotting. I think it makes sense to perform tuning and cross-validation as simple functions.
After calculation of results, then it may make sense for those results to be wrapped inside a class that manipulates the results in different ways for plotting, saving, perhaps even saving/encrypting the whole class to be loaded in a different environment for model training?
Functions can always be cut and pasted to be made as part of a class, so don't add the class complexity unless we know it is necessary.
Here is the working list of functions
This will be supplementary to many of the new features to work on.
Some info on inheritance with examples
A key thing to consider when deciding between inheritance and composition is knowing for sure that we can pass a tuned, trained model to predict a response on new data.
Thus, I propose the following architecture:
class FeatureData(object):
def __init__(self, feature_data=[1,2,3], **kwargs):
super(FeatureData, self).__init__(**kwargs)
self.feature_data = feature_data
print("FeatureData")
class FeatureSelection(FeatureData):
def __init__(self, feature_selection='Lasso', **kwargs):
super(FeatureSelection, self).__init__(**kwargs)
self.feature_selection = feature_selection
print("FeatureSelection")
class Tuning(FeatureSelection):
def __init__(self, tuning_params={'alpha: 0.01'}, **kwargs):
super(Tuning, self).__init__(**kwargs)
self.tuning_params = tuning_params
print("Tuning")
class Training(Tuning):
def __init__(self, training_obj_f='mae', **kwargs):
super(Training, self).__init__(**kwargs)
self.training_obj_f = training_obj_f
print("Training")
class Prediction(object):
def __init__(self, Training, predict_X=[[12, 14], [13, 16]], predict_y=[2.3, 2.6], **kwargs):
self.Training = Training
self.predict_X = predict_X
self.predict_y = predict_y
print("Prediction")
Notice that Prediction
uses composition instead of inheritance because we MUST have a trained model before anything can be predicted! Thus, the Prediction
object can only carry out its functions if it has access to a trained model (and of course has something to predict). If we were to use inheritance with Prediction
, then it seems as if we would have to re-train, re-tune, re-do feature selection EVERY TIME we want to make a prediction.
I'm sure there are ways around this (maybe by loading in the objects from a pickled file and overwriting any of the class attributes), but this seems counter-intuitive when we could just pass the training information in the first place.
This architecture would allow to create as many training instances as we would like (ahead of time or in real time), then go ahead and make predictions using the training scenario of our choosing.
Use of the above architecture:
train1 = Training(feature_data=[1,2,3])
train2 = Training(feature_data=[4,5,6])
predict1 = Prediction(train1)
predict2 = Prediction(train2)
Returns:
Predict 1 feature data: [1, 2, 3]
Predict 2 feature data: [4, 5, 6]
See issue #17
For each class that is inherited, we ideally want to be able to gather all necessary information for fully execute the functions of said classed, just by creating a new instance of the child class. This may get difficult to manage as the number of parameters increases, maybe in excess of 10-20.
Consider having a dictionary (or a separate "composition" class) that can be passed to the parent class or any child class and contains all the parameters necessary to fully execute data retrieval, feature selection, tuning, and training. This would make the call to the parent/child class(es) much cleaner, e.g.,
param_dict = {
'FeatureData': {
'base_dir_data': r'C:\data_dir',
'random_seed': 999},
'FeatureSelection': {
'algortihm': 'Lasso',
'n_feats': 10}
}
my_params = TrainParameters(param_dict)
train1 = Training(my_params ) # this will execute data retrieval, selection, tuning, etc., all according to rules in `param_dict`
Parallel processing
Rule of thumb: divvy up number of PP jobs at the highest level, and only use some cores for the lower level processing. With that said, we still want the option to process in parallel at any level in this process.
Major steps to accomplish:
- [DONE - #16] - find feature subset(s); when there are multiple (e.g., hundreds), how should PP be implemented in subsequent steps?
- On each feature subset, perform tuning for a particular regression model (e.g., Lasso, PLSR, etc.); have option for either
GridSearchCV
orskopt.gp_minimize
(#7) - these have PP built in (n_jobs
) that should allow us to speed up this process if desirable in the context of all processing steps.
I'm sure there will be challenges with implementation, but my plan with the inheritance architecture among FeatureData
, FeatureSelection
, Tuning
, and Prediction
is as follows (for now):
- For a single feature set, parallelization begins at the tuning step (when there are multiple feature sets, it makes more sense to parallelize those since they are a higher level). Although feature selection can take substantial time, it will likely be a fraction of the time required for tuning. When it comes time that we want to evaluate multiple feature selection methods, that is the point it may be worthwhile to implement parallel processing for
FeatureSelection
. - As a result of
FeatureSelection.fs_find_params()
, which is called during init() ofTuning
, we are left withdf_fs_params
, which contains all the information necessary to get a specific number of features - this dataframe can be hundreds of rows long, so we would like the option to process in parallel. Fortunately, bothGridSearchCV
orskopt.gp_minimize
(#7) have PP built in (n_jobs
) that should allow us to speed up this process if desirable in the context of the higher level processing tree. - If desired, perform for model 2, model 3, ....model n - this can be designed to run in parallel by copying the
Tuning
object instance before actually performing the tuning functions (but after the feature selection functions/init()). Then for each model inmodel_list
, perform tuning - a limitation of this for PP is that different models may take substantially longer than others. - Training and plotting is fast, but consider doing both training and plotting for a row of
df_fs_params
as soon optimal tuning results are acquired. This has the advantage of not having to save tuning results in memory (and wait for all rows indf_fs_params
to finish) before plotting, saving to file, and moving on.
I think Tuning
and Training
should probably be in the same class because we will always have to execute tuning immediately prior to training, and training really doesn't take long at all compared to tuning. Also, I don't think there are really any settings for the training process aside from settings/parameters that must already be set by the tuning step.
The practical use of the Tuning/Training(?)
object is to access the needed information when it is necessary to do so using as little memory as possible (imagine 1000 instances of the class needed at a time). This may be used directly or indirectly by customers. For example, we may want to know which features are most beneficial so we can advise agronomists to be collecting specific types of data (indirect use). Once a customer has feature data to predict a response, we really just want to access the parameters and the feature weights/coefficients so we can make a prediction without exposing any of the training data (which may be for business reasons or for customer data/ownership reasons). If the customer has additional data they would like to add in and retrain on, then while in their possession, this Tuning
object should be proprietary (and maybe encrypted in a way that it only interacts with our API).
Uses (from comment above)
Find features that are most useful
from research_tools import feature_groups
from research_tools import Training
my_train = Training(param_dict=feature_groups.param_dict_test, print_out=False)
my_train.train()
for _, row in my_train.df_test_filtered.iterrows():
print('Number of features: {0}\nSelected features: {1}\n'.format(row['feat_n'], row['feats_x_select']))
Number of features: 1
Selected features: ('rate_ntd_kgha',)Number of features: 2
Selected features: ('dae', 'rate_ntd_kgha')Number of features: 3
Selected features: ('dae', 'rate_ntd_kgha', '740')Number of features: 4
Selected features: ('dae', 'rate_ntd_kgha', '710', '740')Number of features: 5
Selected features: ('dae', 'rate_ntd_kgha', '710', '740', '760')Number of features: 6
Selected features: ('dae', 'rate_ntd_kgha', '710', '740', '760', '870')Number of features: 7
Selected features: ('dae', 'rate_ntd_kgha', '460', '710', '760', '810', '870')Number of features: 8
Selected features: ('dae', 'rate_ntd_kgha', '460', '710', '720', '740', '760', '810')Number of features: 9
Selected features: ('dae', 'rate_ntd_kgha', '460', '660', '710', '720', '740', '760', '810')Number of features: 10
Selected features: ('dae', 'rate_ntd_kgha', '460', '660', '680', '710', '720', '740', '760', '810')
This tells us that the top four features are 'dae', 'rate_ntd_kgha', '710', '740'
Access feature weights/coefficients to make a prediction without having training data
First, I don't think we have to worry about exposing training data as long as only the sk-learn
object is saved (and not our Training
object).
After fit()
, save the sklearn
estimator
object to file/DB (using something like pickle
, joblib
, or ZODB
- probably better than saving the entire Training
object. Care must be taken to ensure the internal state of the sklearn
object remains the same across versions, etc.).
Continuing from example above (training has been done), let's load in the model that uses three features and use joblib
to save that model:
import joblib
estimator = my_train.df_test_filtered[my_train.df_test_filtered['feat_n'] == 3]['regressor'].values[0]
joblib.dump(estimator, r'C:\Users\Tyler\Downloads\estimator_3.sav')
In a new terminal (to prove that we're not exposing the same instance of the trained model)
import joblib
import numpy as np
estimator2 = joblib.load(r'C:\Users\Tyler\Downloads\reg_1.sav')
data = np.array([[ 49. , 336.255 , 0.49586667],
[ 63. , 44.834 , 0.4658 ],
[63. , 44.834 , 0.19855]])
estimator2.predict(data)
array([4.47474109, 1.75005601, 2.39479327])
We can see from above that the three features in data
are: 'dae', 'rate_ntd_kgha', '740'