Copyright (c) 2021 Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.
by Sean M.S. Hayes
This library is made to perform feature selection based on a method originally proposed in by Sun et al. [1]. This library specifically relates to the methodology described in [2], named DEKER for decomposed kernel regression, which includes methods for identifying optimal hyperparameter values. This library was also designed for use in the context of network inference, also described in [2], by iteratively reapplying the DEKER method for feature selection accross all features of a dataset.
The library may be used either as a header-only C++ library using the files in (include/deker) or compiled into a simple executable providing a command-line interface to the library with some simple I/O operations for reading in data and extracting output.
The executable is built with the expectation that many iterations of feature selection will be done on a single data set, as in the context of network inference, most likely in a high performance computing (HPC) environment. The expected workflow is:
- Convert data to a binary format that can be quickly read accross all iterations of feature selection.
- Check the number of responses to perform feature selection on to determine how many batch jobs to provision.
- Execute each iteration of feature selection on a given response as a separate batch job.
- Write output (contained in binary files) to a single 'edge-list'.
deker -b [input file location] [location to write output file]
deker -s [control file location]
- returns the number of responses specified by include_response/exclude_response to console
deker -w [location to write output file] -f [index of feature to use as response] [control file location]
- Program writes results to a binary file, use edgelist function to write binary file results to .csv
- Program expects control file to be plain text with each line in the format of [parameter] [value]
- Control file must contain a line "data_file [data file location]". The data file should be a binary file produced by converting a .csv using the -b option, above.
- Optional parameters to include in control file:
- include_response: which features in the dataset allowed to be responses; expected value integer or integers seperated by comma (no spaces), defaults to all features
- exclude_response: which features in the dataset not allowed to be responses; expected value integer or integers seperated by comma (no spaces), defaults to no features
- include_predictor: which features in the dataset allowed to be predictors; expected value integer or integers seperated by comma (no spaces), defaults to all features
- exclude_predictor: which features in the dataset not allowed to be predictors; expected value integer or integers seperated by comma (no spaces), defaults to no features
- h_huber_loss: size of the quadratically smoothed elbow in the loss function; expected value double, defaults to .001
- w_convergence_threshold: threshold of difference in feature weights (sqrt(w)) determining convergence; expected value double, defaults to .0000001
- w_maxiter: maximum number of fixed-point iterations in solving for feature weights; expected value integer, defaults to 100
- feature_drop_threshold: threshold of feature weight (sqrt(w)) to drop feature from calculation; expected value double, defaults to .001
- lambda_init_max: maximum allowed lambda value; expected value double, defaults to .25
- lambda_init_min: minimum allowed lambda value; expected value double, defaults to .005
- lambda_init_stepsize: stepsize (fraction of maximum to decrease) when searching for 'good' lambda range (highest lambda without eliminating all features in [lambda_init_min,lambda_init_max]); expected value double, defaults to .5
- lambda_init_maxiter: maximum iterations when searching for 'good' lambda range; expected value integer, defaults to 100
- lambda_bo_maxiter: maximum iterations of Bayesian optimization for lambda within 'good' lambda range; expected value integer, defaults to 20
deker -e [deker output file location] [location to write/append to edge list file]
The library is primarily built around the 'deker::fit::Solve_deker' class defined in include/deker/fit.h, with additional functions providing support or I/O functions related to this class. Each instance of the class is constructed using the source data on which the feature selection model will be fit.
Solve_deker(const Eigen::Ref<const Eigen::MatrixXd> data_matrix,
const std::vector<unsigned>& predictors_use_index,
const unsigned& which_response,
const Opt_Param& control)
For the inputs, 'data_matrix' is expected to have samples in rows and features in columns, 'predictors_use_index' should indicate which columns are to be used as predictors, 'which_response' should indicate which column is to be used as response, and 'control' is a struct 'Opt_Param' defined in 'deker::fit::opt_param.h'. The class contains several methods related to solving to the heavily nested feature selection & hyperparameter problems.
const Eigen::Ref<const Eigen::VectorXd> v
- An Eigen vector of double values, indicating weights on feature columns. More precisely, v are the square roots of the weights w, such that v^2 = w. When specified as a 'const' value in a given optimization step, weights are fixed and will not be updated.
const std::vector<unsigned>& feature_index
- A vector of unsigned integers, which is of the same length as 'v' with each element indicating which column in Solve_deker's internal store of predictor data ('input_data.predictors') each element of 'v' corresponds to. When specified as a 'const' value in a given optimization step, weights are fixed and will not be updated in the optimization step.
const double& lambda_regularization
const double& sigma_kernel_width
- Hyperparameter values; when specified as a 'const' value in a given optimization step, hyperparameter is fixed and will not be updated.
Output_deker& sol
- Struct containing fields for output of Solve_deker. Values provided in the struct may be used and will be overwritten and updated with the results of the optimization step. Full definition of the struct is in include/deker/fit/output.h.
std::tuple<double,double,double,double> calc_model_fit(const double& sigma_kernel_width,
const Eigen::Ref<const Eigen::VectorXd> v,
const std::vector<unsigned>& feature_index)
- 'calc_model_fit' determines the fit of the model given a value of 'sigma_kernel_width' (characteristic lengthscale of the kernel function), and feature weights ('v' and 'feature_index'). 'calc_model_fit' returns a tuple of four doubles with different quantities regarding model fit: (in order) estimated degrees of freedom, log-likelihood of fit, parameters A and B of the Platt scaling fit to condition classifier output to probabilities.
double opt_sigma(const Eigen::Ref<const Eigen::VectorXd> initial_v,
const std::vector<unsigned>& feature_index)
- 'opt_sigma' determines the optimal 'sigma_kernel_width' (characteristic lengthscale of the kernel function) value for a given set of feature weights ('v' and 'feature_index'). 'opt_sigma' returns a single double value, the optimal 'sigma_kernel_width' value.
Eigen::VectorXd opt_v(const double& lambda_regularization,
const double& sigma_kernel_width,
const Eigen::Ref<const Eigen::VectorXd> v,
const std::vector<unsigned>& feature_index)
- 'opt_v' determines the optimal 'v' (feature weight) values for given hyperparameter values ('lambda_regularization' and 'sigma_kernel_width'). 'opt_v' returns an Eigen vector, the new 'v' weights.
void steff_iter_v(const double& lambda_regularization,
Output_deker& sol)
- 'steff_iter_v' performs one iteration of Steffensen's method for fixed point iteration [3] of the combined hyperparameter ('sigma_kernel_width') and feature weight ('v' & 'feature_index') optimization problem using 'opt_sigma' and 'opt_v' for a given value of the 'lambda_regularization' hyperparameter. 'steff_iter_v' returns values by updating the Output_deker struct 'sol.'
void inner_full_opt(const double& lambda_regularization,
Output_deker& sol)
- 'inner_full_opt' performs repeated fixed-point iteration until convergence of the combined hyperparameter ('sigma_kernel_width') and feature weight ('v' & 'feature_index') optimization problem using 'steff_iter_v' for a given value of the 'lambda_regularization' hyperparameter. 'inner_full_opt' returns values by updating the Output_deker struct 'sol.'
void opt_lambda(Output_deker& sol)
- 'opt_lambda' performs a complete optimization of both hyperparameters ('lambda_regularization' and 'sigma_kernel_width') and feature weights ('v' & 'feature_index') using Bayesian optimization [4] of the optimized model fit by 'inner_full_opt' for given values of 'lambda_regularization'. 'opt_lambda' returns values by updating the Output_deker struct 'sol.'
- Sun, Y.; Yao, J.; Goodison, S.: Feature selection for nonlinear regression and its application to cancer research. In: Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 73-81. SIAM (2015)
- Hayes, S.M.; Sachs, J.R.; Cho, C.R.: From complex data to biological insight: 'DEKER' feature selection and network inference. In Review.
- Johnson, L.W.; Scholz, D.R.: On Steffensen's method. SIAM Journal on Numerical Analysis. 5 (2): 296–302. (1968) doi:10.1137/0705026
- Cully, A., Chatzilygeroudis, K., Allocati, F., Mouret, J.B.: Limbo: A flexible high-performance library for gaussian processes modeling and data-e�cient optimization. Journal of Open Source Software 3(26) (2018)