fisherTestForGenomicOverlapsMilosPjanicMod

Introduction

This is a combined bash/R script that will generate a Fisher test p-value that shows significance for the overlap of two sets of genomic regions (for example from ChIP-Seq experiments).

To calculate Fisher p-value you would need three BED files:

First bed file to overlap, i.e. file1.bed
Second bed file for overlap, i.e. file2.bed
Bed file that will serve as genomic background for overlaps, i.e. background.bed

Note that file names do not have to be file1.bed, file2.bed, background.bed, this is just an example.

As genomic background for the human genome you can use e.g. combined ENCODE set of open chromatin regions or a similar data set. Genomic background is necessary to calculate the constituents of the Fisher test contingency matrix: overlap of A and B in the background BG set, A but not B in BG, B but not A in BG, BG without A and B.

We provide a combined ENCODE DHS data set in a bed file which you can use as a background set in the human genome, background.bed.

Comparison with Bedtools Fisher

When comapered to Bedtools Fisher, our test gives less inflated p-values as seen in qqplot:

And when compaered to theoretical distribution of p-values, our observed p-values follow theoretical and show deviations from it near the end, as expected.

Dependencies

You need to run the script in bash shell, and you also need to have R installed, as well as Rscript that needs to be installed in the PATH folder /usr/bin/Rscript

Example run

Run ./fishertest.sh with the three file names as subsequent multiple arguments

Example output

./fishertest.sh file1.bed file2.bed background.bed