bsmith89/StrainFacts

How to perform model selection?

Closed this issue · 4 comments

Hi there,

Apologies if I missed it, but what metric should be used to compare how well models that fit differing numbers of strains compare? For instance, I was hoping to be able to parse something like the AIC or BIC from the model output somehow.

Perhaps this will depend on what model type I specify? See below for an example of how I'm running the command.

sfacts fit \
    --verbose \
    --num-strains 2 \
    --random-seed 0 \
    ./Snodgrassella_alvi_yqjC_metagenotype.loaded \
    ./Snodgrassella_alvi_yqeY_metagenotype.fit

Thanks,

Gavin

Unfortunately, StrainFacts does not provide a default way to do model selection. We have not explored the possibility of using something like BIC, but I believe the major challenge is calculating the effective number of parameters, given that we use regularizing priors (i.e. rho_hyper, pi_hyper, and gamma_hyper). Fortunately, this regularization also means that you don't generally need to specify the "correct" number of strains to get accurate estimates, since extra strains are penalized that might lead to overfitting. So, for some datasets, it's effective to simply run the model with extra strains and see how many are active in the final estimate (analogous to LASSO regression).

On the other hand, this does leave the problem of hyper-parameter selection. I have several ideas about how to evaluate the quality of a fit, but all of them are heuristic and they should probably be considered holistically.

  • StrainFacts provides several evaluation functions that might be useful to you. One of these is sfacts.evaluation.metagenotype_error2, which calculates the deviation between the expected number of alternative alleles at each position in each sample and the observed number. In particular, with the discretized=True option, this will give you a sense of whether an estimate is able to recapitulate the data.
  • If you have enough SNP positions, you could consider doing a test/train split, selecting a portion of positions to hold out, calculate estimates using the training set, then use the sfacts fit_geno command to refit genotypes for the held-out positions while holding the strain fractions fixed from the training estimate. The metagenotype error on the held-out set should be a more accurate reflection of fit quality.
  • Visualize the fits and see if they make intuitive sense. StrainFacts provides several plotting functions in sfacts.plot that will be useful here. In particular plot_metagenotype, plot_genotype, and plot_community which visualize each of these matrices. See examples/evaluate_simulation_fit.ipynb for some ideas about how to use these.
  • Finally, you could consider running a simulation (sfacts simulate) with similar depth, strain diversity, and strain heterogeneity to what you believe to be in your real-world data. If you're getting accurate fits to the simulated data (see sfacts evaluate_fit) this would increase your confidence in the estimates for your real data.

Sorry I don't have a more principled method for you. If your data is not too large, Strain Finder's approach may also be worth checking out (they use AIC). I can also point you at some of the utility scripts that I used for passing metagenotypes to/from the Strain Finder CPickle format when I was benchmarking.

Thanks for the information @bsmith89!

I am aiming to run StrainFacts on many different individual genes, so there aren't many SNP positions.

I didn't appreciate that extra strains would be penalized like that though - so would it be reasonable to set the number of strains to the upper limit of what I think could be possible and then filter out strains afterwards based on a reasonable relative abundance cut-off (e.g., required to be at 10% abundance in at least one sample)? Would it be silly to run StrainFacts with (for example) the number of strains set to 50 if in reality there were only 2? Should the penalization procedure be able to correctly handle that kind of situation?

Thanks!

Gavin

I am aiming to run StrainFacts on many different individual genes, so there aren't many SNP positions.

Cool! Definitely not the main use case that SF was designed for, but I'll be very curious how it turns out.

Given the low number of sites, you might also consider comparing the results to another tool like Strain Finder, since my "fuzzy genotype" approximation might lower your power to detect different strains to some extent.

Or maybe, if you're confident your genes of interest are ubiquitous and single-copy in the species, consider concatenating metagenotypes for all of the genes together and running SF on that as well. More SNPs is often helpful.

I didn't appreciate that extra strains would be penalized like that though - so would it be reasonable to set the number of strains to the upper limit of what I think could be possible and then filter out strains afterwards based on a reasonable relative abundance cut-off (e.g., required to be at 10% abundance in at least one sample)?

Yep, this is essentially how I use it. The regularization of strain numbers (instead of model comparison) is especially useful in very large problems where refitting the model would be too time consuming, and this is what SF was designed for. However, it sounds like your problem might be much smaller (fewer SNPs, just a couple of strains, not sure how many samples...). If you're not getting useful results from SF, I'd definitely suggest you take a look at DESMAN.

Should the penalization procedure be able to correctly handle that kind of situation?

In practice, I've found SF is not particularly sensitive to specifying too many strains. Performance seems to be largely determined by sequencing depth and the amount of strain sharing across samples (rare strains are harder to deconvolve from mixed samples).

Thanks for the information!