jsilve24/philr

Is the philr tutorial equally applicable to feature table containing ASVs?

Closed this issue · 4 comments

Hi,

Thanks for developing the PHILR, a great tool for analyzing amplicon data in a CoDA manner.

I was reading the PHILR tutorial but the example data set was based on OTU clustering. Is the tutorial equally applicable to feature table containing ASVs?

Cheers

Hi Justin,

I have 3 questions:

1. Preprocessing of feature table. In the PHILR paper, different OTU table filtering methods were applied to different datasets. For example, in the tutorial dataset, taxa were filtered if they were not seen with more than 3 counts in at least 20% of samples or had a coefficient of variation ≤ 3. However, it is argued that these prefiltering steps are not necessary for ASVs since they are sequencing-error free. Do you think it necessary to apply these "hard filtering thresholds" to ASV table as well? If so, what's your recommendations for prefiltering feature table? Is the "soft-thresholding (taxon weighting)" a better alternative?

2. Phylogenetic tree. Sequence placement into a reference tree is now recommended for building the phylogeny for amplicon data analysis. Is a tree built by sequence placement more suitable for the PHILR than a de novo tree?

3. How to identify balances that distinguish categorical variables with more than 2 levels? The sparse logistic regression was used to identify balances that distinguished human/nonhuman samples. What if I have a categorical variable with 3 different outcomes? What statistical method do you recommend to perform this task?

Thanks in advance.

All of these questions are difficult to answer, but I will do my best to be concise.

Re Preprocessing - There are two reasons to do preprocessing/filtering (1) because you think some things/taxa are spurious and you want to remove them (2) because some taxa are so low abundance that you really don't have enough information to analyze them or to say anything interesting about them (i.e., focusing your statistical power intelligently). I would say that ASVs are not perfect, nothing is perfect. I still do preprocess but I like to think about it for the second reason, try to focus your attention where you have data. I realize I am falling short of telling you how to do your analysis but there are really no hard and fast rules here. That said, if you have a tremendous amount of zeros, even with taxa weighting, this can strongly influence your modeling results.

Re Phylogenetic Tree - You pick the tree that you think is meaningful. PhILR doesn't care beyond that. That's your choice.

Re Balances when categorical variables with more than 2 levels: You can use multinomial regression. (e.g., multiclass logistic regression). There is an implementation of this in the glmnet package as well.

Thanks for sharing your thoughts on these questions.