/SubsetSelectionCIO.jl

Cutting-plane algorithm for exact cardinality-constrained regression

Primary LanguageJuliaOtherNOASSERTION

SubsetSelectionCIO

SubsetSelection is a Julia package that computes sparse L2-regularized estimators. Sparsity is enforced through explicit cardinality constraint. Supported loss functions for regression are least squares; for classification, logistic and L1 Hinge loss. The algorithm formulates the problem as a pure-integer convex optimization problem and solves it using a cutting plane algorithm.The package is built up on the SubsetSelection package.

Quick start

To install the package:

julia> Pkg.clone("git://github.com/jeanpauphilet/SubsetSelectionCIO.jl.git")

To fit a basic model:

julia> using SubsetSelectionCIO, StatsBase

julia> n = 100; p = 10000; k = 10;
julia> indices = sort(sample(1:p, StatsBase.Weights(ones(p)/p), k, replace=false));
julia> w = sample(-1:2:1, k);
julia> X = randn(n,p); Y = X[:,indices]*w;
julia> γ = 1/sqrt(size(X,1));
julia> indices0, w0, Δt, status, Gap, cutCount = oa_formulation(SubsetSelection.OLS(), Y, X, k, γ)
([36, 184, 222, 240, 325, 347, 361, 605, 957, 973], [-0.950513, -0.94923, -0.950688, -0.956536, 0.951954, -0.953707, -0.954927, -0.9571, -0.959357, -0.95312], 0.26711583137512207, :Optimal, 0.0, 17)

The algorithm returns a set of indices indices0, the value of the estimator on the selected features only w0, the time needed to compute the model Δt, the status of the MIP solver status, the sub-optimality gap Gap and the number of cuts required by the cutting-plane algorithm cutCount.

For classification, we use +1/-1 labels and the convention P ( Y = y | X = x ) = 1 / (1+e^{- y x^T w}).

Required and optional parameters

oa_formulation has five required parameters:

  • the loss function to be minimized, to be chosen among least squares (SubsetSelection.OLS()), Logistic loss (SubsetSelection.LogReg()) and Hinge Loss (SubsetSelection.L1SVM()).
  • the vector of outputs Y of size n, the sample size. In classification settings, Y should be a vector of ±1s.
  • the matrix of covariates X of size n×p, where n and p are the number of samples and features respectively.
  • the level sparsity k; the algorithm will consider the hard constraint "||w||_0 < k".
  • the value of the ℓ2-regularization parameter γ.

In addition, oa_formulation accepts the following optional parameters:

  • an initialization for the selected features, indices0.
  • a time limit ΔT_max, in seconds, set to 60 by default.
  • verbose, a boolean. If true, the MIP solver information is displayed. By default, set to false.
  • Gap a limit on the suboptimality gap to reach. By default, set to 0.

Best practices

  • Tuning the regularization parameter γ: Setting γ to 1/√n seems like an appropriate scaling in most regression instances. For an optimal performance, and especially in classification or noisy settings, we recommend performing a grid search and using cross-validation to assess out-of-sample performance. The grid search should start with a very low value for γ, such as
    γ = 1.*p / k / n / maximum(sum(X[train,:].^2,2))

and iteratively increase it by a factor 2. Mean square error or Area Under the Curve (see ROCAnalysis for implementation) are commonly used performance metrics for regression and classification tasks respectively.

  • The mixed-integer solver greatly benefits from a good warm-start, even though the warm start is not feasible, i.e., k-sparse. Among other methods, one can use the output of SubsetSelection or a Lasso estimator (see GLMNet implementation for instance).

Reference

Dimitris Bertsimas, Jean Pauphilet, Bart Van Parys, Sparse Classification : a scalable discrete optimization perspective , available on Arxiv