MaayanLab/appyter-catalog

[Independent Enrichment Analysis] Bug in background?

kvittingseerup opened this issue · 2 comments

I've recently started using your Appyter for Independent Enrichment Analysis to analyze the Enrichr catalog with a costume background.

But because I kept getting very large odds ratios and very small p-values I got suspicious. Therefore I tested the first 10 genes of the 2019 Human WIkiPathway NRF2 pathway WP2884 using a background of the 20 first genes in the gene set. The result can be found here. Ass seen from the Notebook the odds ratio for the is NRF2 pathway WP2884 is calculated to be Inf and the p-value is 6.56e-32. That does not seem like it should be the case if the background was considered?

Did I input the genes wrongly or something similar?

Inf odds ratios are possible if the input gene set is a subset of a gene set in the gene set library. This is due to the formula given a contingency table:

a b
c d

odds ratio = ad/bc

in case of a subset bc will be 0. In the Enrichr code this is handled by dividing by max(1, bc), which will result in a very large value.

Not sure if there are some other issues with the background correction, though.

I think the problem is more illustrated by the p-value. With the dataset I mention above the fisher.test would (in Rcode) look something like:

m1 <- matrix(c(0,0,10,10), ncol = 2, byrow = F)
broom::tidy( fisher.test(m1) )
  estimate  p.value conf.low conf.high method                             alternative
         0       1        0       Inf Fisher's Exact Test for Count Data two.sided  

If on the other hand the background was not used you would end up with something like:

m2 <- matrix(c(2e4,0,10,10), ncol = 2, byrow = F)
broom::tidy( fisher.test(m2) )
  estimate  p.value conf.low conf.high method                             alternative
       Inf 6.50e-32    3103.       Inf Fisher's Exact Test for Count Data two.sided  

I've tested this with some of my real data and the odds ratio and p-values reported by Independent Enrichment Analysis is very similar to what I get when using a fisher test with all known genes as background instead of the provided subset.

You can also see this by running the example dataset you provide as both foreground and background (Appyter found here). There I still get very significant results with very high OR even though there should be no enrichment.