Different p-value for same tree with different species sort
Closed this issue ยท 3 comments
Hello!
I have a question about the "phylolm" package. I could not find an explanation to my problem in the package documentation.
I have a tree, a file with a phenotype and amino acids written like 1 or 0 in a matrix file. I would like to check if a specific amino acid is associated with the phenotype.
Should I be using the rTrait and rbinTrait functions like it is shown in the example in the help documents of the R package?
My pipeline would looks like:
data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
tree = read.newick("tree.nwk")
x = rTrait(n=1,phy=tree)
X = cbind(data$aminoacid,x)
y = rbinTrait(n=1,phy=tree, beta=c(-1,0.5), alpha=1 ,X=X)
dat = data.frame(aminoacid = y, predictor = x)
fit = phyloglm(aminoacid~predictor,phy=tree,data=dat,boot=100)
However, my issue with this is that if I use the same phylogenetic tree but with species sorted in a different way I obtain highly different p-values. I tried to sort species in the matrix file the same way as they are present on the tree by this did not help.
If I exclude the rTrain and rbinTrain functions:
data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
tree = read.newick("tree.nwk")
fit = phyloglm(Aminoacid~Phenotype,phy=tree,data=data,boot = 100)
p-values for the sorted and unsorted trees look the very similar, but I am not sure this is a correct approach to use.
Could you please help in resolving this issue?
Hello!
The rTrait and rbinTrait are functions to generate data. In our example, we don't have data, so we used these functions to generate. Since you have data, excluding them is correct.
Thank a lot, but I'm still doesn't understand why p-value are different for same tree with different sort when I'm use only phylolm function.
> all.equal.phylo(target = tree_sorted, current = tree)
[1] TRUE
> all.equal.phylo(target = tree, current = tree_sorted)
[1] TRUE
For example I'am have p-value for unsorted tree 6.044e-09,
> data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
> tree = read.newick("tree.nwk")
> set.seed(123456)
> fit = phyloglm(aminoacid~phenotype,phy=tree,data=data,boot = 100)
> fit_nonsorted=summary(fit)
> print(fit_nonsorted)
Call:
phyloglm(formula = aminoacid ~ phenotype, data = data, phy = tree,
boot = 100)
AIC logLik Pen.logLik
42.29 -18.15 -17.19
Method: logistic_MPLE
Mean tip height: 176.9107
Parameter estimate(s):
alpha: 0.3078499
bootstrap mean: 0.1442409 (on log scale, then back transformed)
so possible downward bias.
bootstrap 95% CI: (0.01718381,0.3079263)
Coefficients:
Estimate StdErr z.value lowerbootCI upperbootCI p.value
(Intercept) -3.67941 0.64461 -5.70793 -5.30723 -2.6647 1.144e-08 ***
phenotype 5.11229 0.87907 5.81554 3.56559 6.8034 6.044e-09 ***
---
Signif. codes: 0 โ***โ 0.001 โ**โ 0.01 โ*โ 0.05 โ.โ 0.1 โ โ 1
Note: Wald-type p-values for coefficients, conditional on alpha=0.3078499
Parametric bootstrap results based on 100 fitted replicates
when I'm sort tree by cladesize I'm get p-value 6.003e-09, this is quite close, but a litle bit confused. Are any special requirment for sorting tree and matrix file for "phylolm" package?
> data = read.table("Aminoacid_and_phenotype_matrix.txt", header = TRUE)
> tree = read.newick("tree.nwk")
> tree_sorted<-SortTree(tree, how = "Cladesize", order = TipLabels(tree))
> target<-tree_sorted$tip.label
> data_sorted=data[match(target, row.names(data)),]
> set.seed(123456)
> fit = phyloglm(aminoacid~phenotype,phy=tree_sorted,data=data_sorted,boot = 100)
> fit_sorted=summary(fit)
> print(fit_sorted)
Call:
phyloglm(formula = aminoacid ~ phenotype, data = data_sorted, phy = tree_sorted,
boot = 100)
AIC logLik Pen.logLik
42.31 -18.15 -17.20
Method: logistic_MPLE
Mean tip height: 176.9107
Parameter estimate(s):
alpha: 0.2923237
bootstrap mean: 0.1066876 (on log scale, then back transformed)
so possible downward bias.
bootstrap 95% CI: (0.007751204,0.3084363)
Coefficients:
Estimate StdErr z.value lowerbootCI upperbootCI p.value
(Intercept) -3.67257 0.64273 -5.71398 -5.31000 -2.4048 1.104e-08 ***
phenotype 5.10492 0.87764 5.81666 3.53852 7.0445 6.003e-09 ***
---
Signif. codes: 0 โ***โ 0.001 โ**โ 0.01 โ*โ 0.05 โ.โ 0.1 โ โ 1
Note: Wald-type p-values for coefficients, conditional on alpha=0.2923237
Parametric bootstrap results based on 100 fitted replicates
I'm so sorry bothering you with so probably silly question, but where I doing something wrong?
They have the same result. The difference is less than the machine error (10^{-8}). In my opinion, there is nothing wrong.