gbradburd/conStruct

extract log-likelihoods from unfinished cross-validation

Closed this issue · 4 comments

Dear Gideon,

thanks a lot for this wonderful program. I have a request or question rather reporting an issue here. I am running ConStruct with >250 population samples and 10,000 marker SNPs in 10-fold replication with k=1:10 to identify the optimal number of spatial layers. As you can imagine, this takes some time - particularly both the sp and the nsp models are calculated consecutively before the logLikelihood table is calculated.

Thus, I wanted to ask if it is possible to either output the sp likelihood table before the calculation of the nsp model start, or if it is possible to manually calculate to likelihoods from the model.fit objects.

Thanks for your help!

Best, Martin

Hi Martin,

Yes, that'll be slow (although, if you have access to a cluster/multi-core machine, note that you can use the parallel argument in x.validation to parallelize the cross-validation procedure across cores). Unfortunately, there's no easy way to output the standardized log-likelihood (lnL) for one particular value of K or one class of model (e.g., spatial vs. nonspatial).

We describe the full cross-validation procedure in the appendix of the paper (pg 50), but basically, for the _i_th cross-validation replicate, you're parameterizing each model (in your case, K=1:10, for both the spatial and non-spatial models) using the _i_th training data partition, then calculating the lnL of the testing partition given that parameterized model. Then - and this is the sticky wicket - you're standardizing those lnLs by subtracting the greatest lnL of any model for that particular partition, and those standardized lnLs can then be aggregated for any particular model across replicate partitions. So, until you've run the analyses for both the spatial and non-spatial models across all specified values of K, you can't get the standardized lnL for any model for that partition. Does that make sense, and also answer your question?

If you're getting super impatient with the x.validation runs, you could also try the calculate.layer.contributions approach. In datasets with lots of loci, it's not uncommon to see x.validation give strong statistical support for models with large K that don't make a lot of biological sense, or in which particular layers are contributing negligibly to overall covariance. In those cases, the layer contributions are often very helpful.

hope that helps, and sorry that it's a bit slow!
-Gideon

Hi Martin,

Just doing some bookkeeping - should I mark this issue as resolved?

Hi Martin,

Haven't heard from you, so I'm going to mark this as resolved, but if you want to reopen an issue, please do!