permuting-feature-enrichment-test

This tool attempts to help determine the relationship of one set of genomic features to another, by executing a set of permutation tests, and by producing a set of graphs to help visualize the results. The basic idea of the permutation testing is to randomly change the location of the regions of interest in the genome, counting the number of overlaps with the feature of interest, to determine the distribution of overlap counts you would expect to see between the features and the regions by chance. This tool does permutation tests on three different, but related, measures: the number of regions that contain a feature; the number of features that overlap a region; and the number of base pairs of overlap between regions and features.

Example Usage and Output

To test the relationship of regions A to features B using 10000 permutations run on 4 cores, run the following command:

python overlap_bps_with_feature.py A.bed "Region A" B.bed "Feature B" genome.fa.fai --permutations 1000 --cores 4

The results will be placed in the directory A_to_B and contain a file results.txt that has the results of the permutation testing:

    REGION_HITS     REAL    Region A      Feature B        46
    REGION_HITS     QUANTILE        Region A      Feature B        0.823
    FEATURE_HITS    REAL    Region A      Feature B        110
    FEATURE_HITS    QUANTILE        Region A      Feature B        0.999
    OVERLAP REAL    Region A      Feature B        753149
    OVERLAP QUANTILE        Region A      Feature B        0.933

The lines marked "REAL" have the actual observed measures for the feature and region sets. The lines marked "QUANTILE" show the quantile of the real observed value in the distribution measured by the permutations; this can be used to estimate a p-value for enrichment (or un-enrichment, if it is low).

There will also be a pdf file, A_to_B.pdf, which contains plots of the permutation distributions of each of the three measures described above, as well as a series of plots of the three measures when the regions are shifted fixed amounts to the left and to the right of their current locations. You can see an example plot here.

Prerequisites

The following programs need to be installed and in the path of the user executing the script:

BEDTools https://code.google.com/p/bedtools/ (tested with 2.17.0)
BEDOps https://code.google.com/p/bedops/ (tested with 2.10)
R (Specifically Rscript) (tested with 2.14)

The Python script requires the argparse and multiprocessing libraries The R script requires the ggplot2 and grid libraries

Input Files

The script requires three input files:

Region BED file
Feature BED file

These are BED files with the locations of your regions and features. For proper results, these should be sorted and merged (with bedtools mergeBed) ahead of time.

Genome index file

This is a list of the chromosome names and their lengths in the genome of interest, of the type generated by samtools faidx.

Optionally you can also use a gaps file. This is a BED file with the locations of gaps in the genome reference, or other regions you want to be excluded from the possible locations used in the location permutations.

Options

Usage for the script is:

  usage: overlap_bps_with_feature.py [-h] [--gaps GAPS]
                               [--permutations PERMUTATIONS]
                               [--cores CORES] [--replot] [--shift_only]
                               [--process_iteration_chunk PROCESS_ITERATION_CHUNK]
                               [--iteration_number ITERATION_NUMBER]
                               [--iteration_chunk_size ITERATION_CHUNK_SIZE]
                               [--use_condor] [--verbose]
                               bp_file bp_name feature_file feature_name
                               genome

The options are:

gaps: A BED file containing regions of the genome to exclude from location permutation
permutations: The number of random location permutations to compute
cores: The number of simultaneous processes to launch to run permutations
replot: if this option is used, the script will not do any new testing but will use the data files stored in the results directory to redraw the plots
shift_only: skip the permutation testing and only generate the shift figures
use_condor: execute the permutations by spawning new tasks to be run on a compute grid using the condor scheduling engine
iteration_chunk_size: number of permutations to compute in a single process (only useful if you are playing with distrbuting processes on condor)
verbose: print extra debug/logging information
process_iteration_chunk/iteration_number/iteration_chunk_size: do not use; these parameters are used when distributing tasks to condor

Other Tools

There are a variety of other techniques for computing the significance of enrichment between a set of regions and a set of features. Most of them are more sophisticated than this script. Here are a few to investigate: