lamho86/phylolm

Different p-value for same tree with different species sort

Closed this issue ยท 3 comments

Hello!

I have a question about the "phylolm" package. I could not find an explanation to my problem in the package documentation.

I have a tree, a file with a phenotype and amino acids written like 1 or 0 in a matrix file. I would like to check if a specific amino acid is associated with the phenotype.

Should I be using the rTrait and rbinTrait functions like it is shown in the example in the help documents of the R package?

My pipeline would looks like:

data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
tree = read.newick("tree.nwk")
x = rTrait(n=1,phy=tree)
X = cbind(data$aminoacid,x)
y = rbinTrait(n=1,phy=tree, beta=c(-1,0.5), alpha=1 ,X=X)
dat = data.frame(aminoacid = y, predictor = x)
fit = phyloglm(aminoacid~predictor,phy=tree,data=dat,boot=100)

However, my issue with this is that if I use the same phylogenetic tree but with species sorted in a different way I obtain highly different p-values. I tried to sort species in the matrix file the same way as they are present on the tree by this did not help.

If I exclude the rTrain and rbinTrain functions:

data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
tree = read.newick("tree.nwk")
fit = phyloglm(Aminoacid~Phenotype,phy=tree,data=data,boot = 100)

p-values for the sorted and unsorted trees look the very similar, but I am not sure this is a correct approach to use.

Could you please help in resolving this issue?

Hello!

The rTrait and rbinTrait are functions to generate data. In our example, we don't have data, so we used these functions to generate. Since you have data, excluding them is correct.

Thank a lot, but I'm still doesn't understand why p-value are different for same tree with different sort when I'm use only phylolm function.

> all.equal.phylo(target = tree_sorted, current = tree)
[1] TRUE
> all.equal.phylo(target = tree, current = tree_sorted)
[1] TRUE

For example I'am have p-value for unsorted tree 6.044e-09,

> data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
> tree = read.newick("tree.nwk")
> set.seed(123456)
> fit = phyloglm(aminoacid~phenotype,phy=tree,data=data,boot = 100)
> fit_nonsorted=summary(fit)
> print(fit_nonsorted)

Call:
phyloglm(formula = aminoacid ~ phenotype, data = data, phy = tree,
    boot = 100)
       AIC     logLik Pen.logLik
     42.29     -18.15     -17.19

Method: logistic_MPLE
Mean tip height: 176.9107
Parameter estimate(s):
alpha: 0.3078499
      bootstrap mean: 0.1442409 (on log scale, then back transformed)
      so possible downward bias.
      bootstrap 95% CI: (0.01718381,0.3079263)

Coefficients:
            Estimate   StdErr  z.value lowerbootCI upperbootCI   p.value    
(Intercept) -3.67941  0.64461 -5.70793    -5.30723     -2.6647 1.144e-08 ***
phenotype  5.11229  0.87907  5.81554     3.56559      6.8034 6.044e-09 ***
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

Note: Wald-type p-values for coefficients, conditional on alpha=0.3078499
      Parametric bootstrap results based on 100 fitted replicates

when I'm sort tree by cladesize I'm get p-value 6.003e-09, this is quite close, but a litle bit confused. Are any special requirment for sorting tree and matrix file for "phylolm" package?

> data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
> tree = read.newick("tree.nwk")
> tree_sorted<-SortTree(tree, how = "Cladesize", order = TipLabels(tree))
> target<-tree_sorted$tip.label
> data_sorted=data[match(target, row.names(data)),]
> set.seed(123456)
> fit = phyloglm(aminoacid~phenotype,phy=tree_sorted,data=data_sorted,boot = 100)
> fit_sorted=summary(fit)
> print(fit_sorted)

Call:
phyloglm(formula = aminoacid ~ phenotype, data = data_sorted, phy = tree_sorted,
    boot = 100)
       AIC     logLik Pen.logLik
     42.31     -18.15     -17.20

Method: logistic_MPLE
Mean tip height: 176.9107
Parameter estimate(s):
alpha: 0.2923237
      bootstrap mean: 0.1066876 (on log scale, then back transformed)
      so possible downward bias.
      bootstrap 95% CI: (0.007751204,0.3084363)

Coefficients:
            Estimate   StdErr  z.value lowerbootCI upperbootCI   p.value    
(Intercept) -3.67257  0.64273 -5.71398    -5.31000     -2.4048 1.104e-08 ***
phenotype  5.10492  0.87764  5.81666     3.53852      7.0445 6.003e-09 ***
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

Note: Wald-type p-values for coefficients, conditional on alpha=0.2923237
      Parametric bootstrap results based on 100 fitted replicates

I'm so sorry bothering you with so probably silly question, but where I doing something wrong?

They have the same result. The difference is less than the machine error (10^{-8}). In my opinion, there is nothing wrong.