lamho86/phylolm

phyloglm coefficient=0 P value=1

Opened this issue · 2 comments

Hi, Thanks for a great R package.

we have used the IG10 method of phyloglm. We are using it to associate (the presence of) orthologous groups of genes (OGs) within a species to the binary trait "clinical / environmental" by analysing a large number of genomes from clinical and environmental isolates. This revealed a set of OGs that are significantly associated with the isolates being clinical (taking phylogenetic structure into account) (FDR adjusted p-values < 0.05). Several of the OGs were already known virulence genes in the species so the results seem to make sense.

However, when investigating the enriched OGs we found several cases of OGs located in the same operons that were not significantly enriched according to phyloglm. In several cases, these displayed the same presence/absence pattern among the genomes as the enriched OGs in the same operon, except that they were present in all clinical genomes instead of all-but-one. This does not make sense, since the pattern should make them more associated with the clincial trait, not less, hence their p-value should be lower(?). Instead, phyloglm gave them p-value 1.0. When we ran phyloglm again on these OGs we saw that phyloglm produced a warnings (hereafter). Testing around a bit indicates that all cases where the OG is present in all clinical genomes give warnings:

"1. The estimated coefficients in the absence of phylogenetic signal lead to some linear predictors beyond 'btol'. Increase btol? Starting from beta=0 other than intercept.

  1. The estimate of 'alpha' (1696.97428328463) reached the upper bound (1713.9897618914). This may simply reflect a flat likelihood at large alpha values, indicating that the phylogenetic correlation is negligible.

  2. The phyloglm failed to converge."

When increasing btol to 30 or even 100, P values are close to one, and coefficient values are around 16.
Based on my review of previous issues and discussions, it was suggested that when alpha is high, using glm instead of phyloglm would be more appropriate. Following this advice, I utilized the R package logistf (which applies Firth’s modified score procedure in logistic regression analysis) to calculate p and coefficient values on these OGs. The results showed coefficients around 4 and very low p values, which align with the OGs in the same locus (coefficients around 3.1 and P values < 0.03).
Could you please help me to understand why phyloglm gives coefficient=0 and P-value=1?
Would it be okay to use Logistf for these OGs instead?
Thanks in advance

I don't understand this behavior. I suspect it's because of a "perfect separation" issue, but that should be taken care of with the penalty in the penalized likelihood, from Firth's correction. Very large coefficients in magnitude, beyond btol, point to this separation (perfect separation of the presence/absence cases using the predictor).

Did you try the method="logistic_MPLE" instead of the IG10 method? In my experience it behaves better, which is why it was chosen as the default.

Yes, here are some results:

cbind(matr_IG10, matr_MPLE)
coef alpha p.value coef alpha p.value
OG0003724 3.114850 373.9333 0.0003582600 3.434199 352.2505 0.0007278849
OG0003717 0.000000 1712.5930 1.0000000000 3.957085 344.5511 0.0026120137
OG0003718 0.000000 1712.5930 1.0000000000 3.957085 344.5511 0.0026120137
OG0003725 3.115043 372.3286 0.0003636831 3.439199 345.9934 0.0007890508
OG0003719 0.000000 1712.5930 1.0000000000 3.957085 344.5511 0.0026120137
OG0003711 0.000000 1712.5930 1.0000000000 3.957085 344.5511 0.0026120137
OG0003720 0.000000 1712.5930 1.0000000000 3.957085 344.5511 0.0026120137
OG0003712 2.460507 386.2847 0.0003713056 2.662290 345.8159 0.0004460286

Alpha and coefficient values are similar but p values are all 0.0026 and ~10 times higher than genes in the same locus. After FDR p values adjustment, they are not significant (q > 0.05), Which is not expected considering that they are present in all the clinical isolates and part of the same locus (and function).