possibly erroneous odds ratio?
alexlevylab opened this issue · 3 comments
Hi, thanks for this great software, I like it a lot! However I am seeing strange results:
I am running a regression of bacterial genomes with a trait (0 or 1), and presence and absence of protein domains (0 or 1):
total number of genomes: 1458
total number genomes with trait A: 493; 5 of those encode for protein domain X
total number of genomes without trait A: 965; 425 of those encode for protein domain X
I thought this would be a good example as a sanity check because the raw numbers are speak to a severe depletion of the protein domain X in the genomes with trait A.
When I run the lmm version of the program, I get the expected result (an odds ratio smaller than 1):
pyseer --lmm --phenotypes trait.tsv --pres domain_presence_absence.tsv --similarity phylogeny_similarity_lmm.tsv --cpu 15 --max-af 0.97 --filter-pvalue 1 > lmm.assoc
variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 notes
PF01111 2.95E-01 3.85E-65 9.24E-01 -8.60E-02 9.00E-01 2.50E-03
#convert beta to odds ratio
predicted odds-ratio = 0.917
However, when I run the regular pyseer command, I get an unexpectedly inflated odds ratio:
pyseer --phenotypes trait.tsv --pres domain_presence_absence.tsv --distances phylogeny_distances.tsv --cpu 15 --lineage --lineage-file lineage_effects.out --max-af 0.97 --filter-pvalue 1 > pyseer.assoc
variant af filter-pvalue lrt-pvalue beta beta-std-err intercept PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 lineage notes
PF011111 2.95E-01 3.85E-65 1.99E-09 9.16E+00 1.70E+00 -4.53E+00 4.13E+00 -1.15E+01 8.90E+00 -7.73E+00 -1.30E+00 -1.21E-01 1.89E+00 4.92E+00 4.17E+00 1.85E+00 MDS5
#result
predicted odds-ratio = 9509.057
This odds ratio speaks to enrichment, and seems to me strangely inflated as well.
Forgive my ignorance, but is perhaps a feature of the stochastic fitting of the sigmoid to the data? Maybe somehow it diverges? Is it a bug? How can I interpret this? I would like to fix this issue, or at least know how to justify/understand why I am getting these values. Perhaps it is totally sensical, and is a hole in my understanding.
Thank you in advance!
My guess would be that this is due to population structure, which you are (correctly) including as a correction in both methods. If it would help with your intiuition, you could try running without this (see --no-distances
with the fixed effect model). You may also want to view the trait on the phylogeny. Typically it's not as simple as noting an obvious depletion, because these could all be on a single branch, so power to map it to an individual locus would be low.
Hi Professor Lees,
I ran the pyseer code again with the --no-distances
flag as you recommended. I am now seeing the expected result based on the raw data (negative beta; odds ratio < 1).
variant af filter-pvalue lrt-pvalue beta beta-std-err intercept
PF01111 2.95E-01 3.85E-65 1.62E-86 -4.34E+00 4.54E-01 -1.01E-01
Upon observing the tree, it is indeed terribly imbalanced (see image; Red and blue = trait presence or absence; grey and navy blue = pfam presence or absence). Although the power to map the trait to a locus should be low, I am still getting a p-value of 1.99E-09
with the regular pyseer run (from my previous post).
I therefore need a way to remove any cases like these where the population structure changes the sign of the beta, yet is passing the p-value filter. I am wondering if I can somehow use the fact that the --no-distances
and the regular run had two different signs for beta (negative vs. positive) as a sign to disqualify the result. Does that make sense to you? I am not an expert and wonder if you think it's reasonable.
Thanks!
Alex
I wouldn't necessarily recommend coming up with your own adhoc filtering criteria, and would instead stick to the best practices advice and the methods in the second half of the GWAS tutorial
Use the linear mixed model (which seemed to work here), check the QQ-plot, and check your findings both against the tree and biologically.