SubsetSelection

SubsetSelection is a Julia package that computes sparse L2-regularized estimators. Sparsity is enforced through explicit cardinality constraint or L0-penalty. Supported loss functions for regression are least squares, L1 and L2 SVR; for classification, logistic, L1 and L2 Hinge loss. The algorithm formulates the problem as a mixed-integer saddle-point problem and solves its boolean relaxation using a dual sub-gradient approach.

Quick start

To install the package:

julia> Pkg.install("SubsetSelection")

or the have the latest version

julia> Pkg.clone("git://github.com/jeanpauphilet/SubsetSelection.jl.git")

To fit a basic model:

julia> using SubsetSelection, StatsBase

julia> n = 100; p = 10000; k = 10;
julia> indices = sort(sample(1:p, StatsBase.Weights(ones(p)/p), k, replace=false));
julia> w = sample(-1:2:1, k);
julia> X = randn(n,p); Y = X[:,indices]*w;
julia> Sparse_Regressor = subsetSelection(OLS(), Constraint(k), Y, X)
SubsetSelection.SparseEstimator(SubsetSelection.OLS(),SubsetSelection.Constraint(10),10.0,[362,1548,2361,3263,3369,3598,5221,7314,7748,9267],[5.37997,-5.51019,-5.77256,-7.27197,-6.32432,-4.97585,5.94814,4.75648,5.48098,-5.91967],[-0.224588,-1.1446,2.81566,0.582427,-0.923311,4.1153,-2.43833,0.117831,0.0982258,-1.60631  …  0.783925,-1.1055,0.841752,-1.09645,-0.397962,3.48083,-1.33903,1.44676,4.03583,1.05817],0.0,19)

The algorithm returns a SparseEstimator object with the following fields: loss (loss function used), sparsity (model to enforce sparsity), indices (features selected), w (value of the estimator on the selected features only), α (values of the associated dual variables), b (bias term), iter (number of iterations required by the algorithm).

For instance, you can access the selected features directly in the indices field:

julia> Sparse_Regressor.indices
10-element Array{Int64,1}:
  362
 1548
 2361
 3263
 3369
 3598
 5221
 7314
 7748
 9267

or compute predictions

julia> Y_pred = X[:,Sparse_Regressor.indices]*Sparse_Regressor.w
100-element Array{Float64,1}:
   4.62918
   8.59952
 -16.2796
  -5.611
   1.62764
 -50.4562
  37.407
 -12.3341
  -4.75339
  25.122
   ⋮
  -7.98349
  11.0327
  -8.58172
  16.904
  -9.04211
 -36.5475
  17.2558
 -22.3915
 -57.9727
  -6.06553

For classification, we use +1/-1 labels and the convention P ( Y = y | X = x ) = 1 / (1+e^{- y x^T w}).

Required and optional parameters

subsetSelection has four required parameters:

the loss function to be minimized, to be chosen among least squares (OLS()), L1SVR (L1SVR(ɛ)), L2SVR (L2SVR(ɛ)), Logistic loss (LogReg()), Hinge Loss (L1SVM()), L2-SVM (L2SVM()). For classification, we recommend using Hinge loss or L2-SVM functions. Indeed, the Fenchel conjugate of the Logistic loss exhibits unbounded gradients, which largely hinders convergence of the algorithm and might require smaller and more steps (see optional parameters).
the model used to enforce sparsity; either by adding a hard constraint of the form "||w||_0 < k" (Constraint(k)) or by adding a penalty of the form "+ λ ||w||_0" (Penalty(λ)) to the objective. For tractability issues, we highly recommend using an explicit constraint instead of a penalty, for it ensures the size of the support remains bounded through the algorithm.
the vector of outputs Y of size n, the sample size. In classification settings, Y should be a vector of ±1s.
the matrix of covariates X of size n×p, where n and p are the number of samples and features respectively.

In addition, subsetSelection accepts the following optional parameters:

an initialization for the selected features, indInit.
an initialization for the dual variables, αInit.
the value of the ℓ2-regularization parameter γ, set to 1/√n by default.
intercept, a boolean. If true, an intercept/bias term is computed as well. By default, set to false.
the maximum number of iterations in the sub-gradient algorithm, maxIter.
the value of the gradient stepsize δ. By default, the stepsize is set to 1e-3, which demonstrates very good empirical performance. However, smaller stepsizes might be needed when dealing with very large datasets or when the Logistic loss is used.
the number of gradient updates of dual variable α performed per update of primal variable s, gradUp.
anticycling a boolean. If true, the algorithm stops as soon as the support is not unchanged from one iteration to another. Empirically, the accuracy of the resulting support is strongly sensitive to noise - to use with caution. By default, set to false.
averaging a boolean. If true, the dual solution is averaged over past iterates. By default, set to true.

Best practices

Tuning the regularization parameter γ: By default, γ is set to 1/√n, which is an appropriate scaling in most regression instances. For an optimal performance, and especially in classification or noisy settings, we recommend performing a grid search and using cross-validation to assess out-of-sample performance. The grid search should start with a very low value for γ, such as

    γ = 1.*p / k / n / maximum(sum(X[train,:].^2,2))

and iteratively increase it by a factor 2. Mean square error or Area Under the Curve (see ROCAnalysis for implementation) are commonly used performance metrics for regression and classification tasks respectively.

Instances where the algorithm fails to converge have been reported. If you occur such cases, try normalize the data matrix X and relaunch the algorithm. If the algorithm still fails to converge, reduce the stepsize δ by a factor 10 or 100 and increase the number of iterations maxIter by a factor at least 2.

jeanpauphilet/SubsetSelection.jl

SubsetSelection

Quick start

Required and optional parameters

Best practices

Reference