How does dfmax work?
biona001 opened this issue · 2 comments
I think this is more a question than an issue.
I did a sparse linear regression with dfmax=10000
which is throwing the Too many variables
warning, but extracting the optimal beta gives me 23309 non-zero entries? Then I inspect (presumably?) the sparsity level for each lambda, and it never reaches much more than 10000.
# fit lasso and check fit
lasso.fit <- big_spLinReg(G$genotypes, y, covar.train=Z, dfmax=10000)
summary(lasso.fit)$message
[[1]]
[1] "Too many variables" "Too many variables" "Too many variables"
[4] "Too many variables" "Too many variables" "Too many variables"
[7] "Too many variables" "Too many variables" "Too many variables"
[10] "Too many variables"
# extract best beta and count non-zero entries
result <- summary(lasso.fit, best.only = TRUE)
lasso_beta <- result$beta[[1]]
sum(lasso_beta != 0)
[1] 23309
# check first cv fold result and its active list
r = lasso.fit[[1]][10]
r[[1]]$nb_active
[1] 0 1 1 1 1 1 1 1 1 1 1 1
[13] 1 1 1 1 1 1 1 1 1 1 1 1
[25] 1 1 1 1 1 1 1 1 1 1 1 1
[37] 1 1 1 1 1 1 1 1 1 1 1 1
[49] 1 1 1 1 1 2 2 2 2 2 2 2
[61] 2 2 2 2 2 2 2 2 2 2 2 3
[73] 3 3 3 3 3 3 3 3 4 4 8 8
[85] 8 8 8 8 8 10 10 15 18 20 24 28
[97] 34 37 40 47 52 57 63 72 78 87 98 107
[109] 122 138 155 174 196 210 228 257 277 316 337 378
[121] 414 462 515 561 619 670 734 813 880 971 1068 1165
[133] 1276 1400 1538 1689 1860 2035 2251 2433 2663 2947 3250 3565
[145] 3913 4294 4736 5161 5665 6210 6815 7414 8059 8805 9656 10517
I'm not really understanding where did 23309 come from? Also, it does seem a bit unexpected to me that specifying dfmax=10000
still gave me a model that had a lot more nonzero entries in it.
I guess if you check each model individually (corresponding to each of the CMSA splittings), you will get something a bit larger than 10K non-zero variables.
But the final model averages all these models, so that you can have much more than 10K if the variables used are not the same in all K models.
I see. You are taking each of the models with slightly more than 10k variables, and literally averaging their beta values. Thanks!