geomorphR/geomorph

gm.prcomp optimization for large numbers of species

Closed this issue · 6 comments

Hello -

I'm experimenting with the phylogenetic PCA / PACA to analyze microbial communities. My datasets often have thousands of species. I have left gm.prcomp running for hours on what is actually a relatively small dataset for me, and have no idea how much longer it may take, so I started looking into ways to speed things up.

Within the gm.prcomp function, the first hurdle appears to be 'anc.BM'. I first thought the 'sapply' function within could be simply change to something like mclapply to at least parallelize things, but I tried it and it's still taking quite a while, and now takes a lot of memory. It looks to me like there is a lot of redundant computation here - for simple ancestral state prediction, if you start at the tips and move toward the root, then you should only have to visit each node once while accumulating the ancestral states, right? Whereas a function like sapply is going to process each node individually - and the nodes closer to the root have to re-calculate all the more-tipwise results.

I can probably sit and code up something myself, but I'm not sure if I'm missing something important! I'm a bit concerned that even if I put the time into speeding up the ANC portion, later parts are still going to be intractable (e.g. the naive covariance matrix inversion in phylo.mat... which might be possible to speed up with tricks like those used in MCMCglmm?).

Let me know if you have thought about these issues, have any ideas to optimize things for larger datasets, or if you'd be happy for me to try to contribute! Thank you!

Hi Mike -

Thanks for your reply!

That's a good point about just using ordinate() if I don't need to ancestral states. My initial goal was just to look at sample scores and how they're affected by incorporating the phylogeny of the taxa, so that's a good start. I played with it a bit and was still having some speed issues; I wound up just doing the matrix math and svd with base R functions and it worked pretty quick (but there is definitely a chance I'm missing something important about the math...)

I do think it'd be nice to have the ancestral states and to be able to do some of the fancy plotting! But obviously with such a large phylogeny it'd need a few extra complications to filter it down, and only plot a few important nodes or something. So that's a bigger project. Still, I did find this function from Rphylopars that is orders of magnitude faster for ancestral state calculations and could be a drop-in replacement: Rphylopars::anc.recon.

In case you're curious, I am just trying to see how your methods compare/contrast with some 'traditional' microbiome PCoA methods, like using Weighted UniFrac vs. Bray Curtis dissimilarities. The phylogenetic information in a UniFrac PCoA often returns clearer patterns in our data, but I'd like to bridge the gap with more generative (and perhaps more interpretable) models of dimension reduction, like the package 'zinbwave' (which is very nice for my purposes but doesn't really have a way to incorporate phylogenetic information).

Great; thanks for the info! I do understand avoiding the clutter of dependencies...

That's awesome! Glad I could help!