privefl/bigstatsr

How does dfmax work?

biona001 opened this issue · 2 comments

I think this is more a question than an issue.

I did a sparse linear regression with dfmax=10000 which is throwing the Too many variables warning, but extracting the optimal beta gives me 23309 non-zero entries? Then I inspect (presumably?) the sparsity level for each lambda, and it never reaches much more than 10000.

# fit lasso and check fit
lasso.fit <- big_spLinReg(G$genotypes, y, covar.train=Z, dfmax=10000)
summary(lasso.fit)$message
[[1]]
 [1] "Too many variables" "Too many variables" "Too many variables"
 [4] "Too many variables" "Too many variables" "Too many variables"
 [7] "Too many variables" "Too many variables" "Too many variables"
[10] "Too many variables"
# extract best beta and count non-zero entries
result <- summary(lasso.fit, best.only = TRUE)
lasso_beta <- result$beta[[1]]
sum(lasso_beta != 0)
[1] 23309
# check first cv fold result and its active list
r = lasso.fit[[1]][10] 
r[[1]]$nb_active 
  [1]     0     1     1     1     1     1     1     1     1     1     1     1
 [13]     1     1     1     1     1     1     1     1     1     1     1     1
 [25]     1     1     1     1     1     1     1     1     1     1     1     1
 [37]     1     1     1     1     1     1     1     1     1     1     1     1
 [49]     1     1     1     1     1     2     2     2     2     2     2     2
 [61]     2     2     2     2     2     2     2     2     2     2     2     3
 [73]     3     3     3     3     3     3     3     3     4     4     8     8
 [85]     8     8     8     8     8    10    10    15    18    20    24    28
 [97]    34    37    40    47    52    57    63    72    78    87    98   107
[109]   122   138   155   174   196   210   228   257   277   316   337   378
[121]   414   462   515   561   619   670   734   813   880   971  1068  1165
[133]  1276  1400  1538  1689  1860  2035  2251  2433  2663  2947  3250  3565
[145]  3913  4294  4736  5161  5665  6210  6815  7414  8059  8805  9656 10517

I'm not really understanding where did 23309 come from? Also, it does seem a bit unexpected to me that specifying dfmax=10000 still gave me a model that had a lot more nonzero entries in it.

I guess if you check each model individually (corresponding to each of the CMSA splittings), you will get something a bit larger than 10K non-zero variables.
But the final model averages all these models, so that you can have much more than 10K if the variables used are not the same in all K models.

I see. You are taking each of the models with slightly more than 10k variables, and literally averaging their beta values. Thanks!