SubsetSelection is a Julia package that computes sparse L2-regularized estimators. Sparsity is enforced through explicit cardinality constraint or L0-penalty. Supported loss functions for regression are least squares, L1 and L2 SVR; for classification, logistic, L1 and L2 Hinge loss. The algorithm formulates the problem as a mixed-integer saddle-point problem and solves its boolean relaxation using a dual sub-gradient approach.
To install the package:
julia> Pkg.install("SubsetSelection")
or the have the latest version
julia> Pkg.clone("git://github.com/jeanpauphilet/SubsetSelection.jl.git")
To fit a basic model:
julia> using SubsetSelection, StatsBase
julia> n = 100; p = 10000; k = 10;
julia> indices = sort(sample(1:p, StatsBase.Weights(ones(p)/p), k, replace=false));
julia> w = sample(-1:2:1, k);
julia> X = randn(n,p); Y = X[:,indices]*w;
julia> Sparse_Regressor = subsetSelection(OLS(), Constraint(k), Y, X)
SubsetSelection.SparseEstimator(SubsetSelection.OLS(),SubsetSelection.Constraint(10),10.0,[362,1548,2361,3263,3369,3598,5221,7314,7748,9267],[5.37997,-5.51019,-5.77256,-7.27197,-6.32432,-4.97585,5.94814,4.75648,5.48098,-5.91967],[-0.224588,-1.1446,2.81566,0.582427,-0.923311,4.1153,-2.43833,0.117831,0.0982258,-1.60631 … 0.783925,-1.1055,0.841752,-1.09645,-0.397962,3.48083,-1.33903,1.44676,4.03583,1.05817],0.0,19)
The algorithm returns a SparseEstimator object with the following fields: loss
(loss function used), sparsity
(model to enforce sparsity), indices
(features selected), w
(value of the estimator on the selected features only), α
(values of the associated dual variables), b
(bias term), iter
(number of iterations required by the algorithm).
For instance, you can access the selected features directly in the indices
field:
julia> Sparse_Regressor.indices
10-element Array{Int64,1}:
362
1548
2361
3263
3369
3598
5221
7314
7748
9267
or compute predictions
julia> Y_pred = X[:,Sparse_Regressor.indices]*Sparse_Regressor.w
100-element Array{Float64,1}:
4.62918
8.59952
-16.2796
-5.611
1.62764
-50.4562
37.407
-12.3341
-4.75339
25.122
⋮
-7.98349
11.0327
-8.58172
16.904
-9.04211
-36.5475
17.2558
-22.3915
-57.9727
-6.06553
For classification, we use +1/-1 labels and the convention
P ( Y = y | X = x ) = 1 / (1+e^{- y x^T w})
.
subsetSelection
has four required parameters:
- the loss function to be minimized, to be chosen among least squares (
OLS()
), L1SVR (L1SVR(ɛ)
), L2SVR (L2SVR(ɛ)
), Logistic loss (LogReg()
), Hinge Loss (L1SVM()
), L2-SVM (L2SVM()
). For classification, we recommend using Hinge loss or L2-SVM functions. Indeed, the Fenchel conjugate of the Logistic loss exhibits unbounded gradients, which largely hinders convergence of the algorithm and might require smaller and more steps (see optional parameters). - the model used to enforce sparsity; either by adding a hard constraint of the form "||w||_0 < k" (
Constraint(k)
) or by adding a penalty of the form "+ λ ||w||_0" (Penalty(λ)
) to the objective. For tractability issues, we highly recommend using an explicit constraint instead of a penalty, for it ensures the size of the support remains bounded through the algorithm. - the vector of outputs
Y
of sizen
, the sample size. In classification settings,Y
should be a vector of ±1s. - the matrix of covariates
X
of sizen
×p
, wheren
andp
are the number of samples and features respectively.
In addition, subsetSelection
accepts the following optional parameters:
- an initialization for the selected features,
indInit
. - an initialization for the dual variables,
αInit
. - the value of the ℓ2-regularization parameter
γ
, set to 1/√n by default. intercept
, a boolean. If true, an intercept/bias term is computed as well. By default, set to false.- the maximum number of iterations in the sub-gradient algorithm,
maxIter
. - the value of the gradient stepsize
δ
. By default, the stepsize is set to 1e-3, which demonstrates very good empirical performance. However, smaller stepsizes might be needed when dealing with very large datasets or when the Logistic loss is used. - the number of gradient updates of dual variable α performed per update of primal variable s,
gradUp
. anticycling
a boolean. If true, the algorithm stops as soon as the support is not unchanged from one iteration to another. Empirically, the accuracy of the resulting support is strongly sensitive to noise - to use with caution. By default, set to false.averaging
a boolean. If true, the dual solution is averaged over past iterates. By default, set to true.
- Tuning the regularization parameter
γ
: By default,γ
is set to 1/√n, which is an appropriate scaling in most regression instances. For an optimal performance, and especially in classification or noisy settings, we recommend performing a grid search and using cross-validation to assess out-of-sample performance. The grid search should start with a very low value forγ
, such as
γ = 1.*p / k / n / maximum(sum(X[train,:].^2,2))
and iteratively increase it by a factor 2. Mean square error or Area Under the Curve (see ROCAnalysis for implementation) are commonly used performance metrics for regression and classification tasks respectively.
- Instances where the algorithm fails to converge have been reported. If you occur such cases, try normalize the data matrix
X
and relaunch the algorithm. If the algorithm still fails to converge, reduce the stepsizeδ
by a factor 10 or 100 and increase the number of iterationsmaxIter
by a factor at least 2.