command-line interface to compare an observed relationship between 2 sets of intervals to the distribution of that relationship derived from a number of shufflings. This "relationship" can be anything that returns a single number, e.g. total bases of overlap between the 2 sets, or number of overlaps.
see example section for usage
parameters:
------
inputs
------
-a: bed file (the -a file is always shuffled)
-b: bed file (the -b file is *NEVER* shuffled)
--bs: instead of b, specify a single file and output will be split into
groups by the final column (good for finding enrichment in ENCODE
regions or many user-defined regions with a single call)
-----------
constraints
-----------
--domain: (optional) shuffle -a intervals inside this domain.
--include: [same as domain] may be specified multiple times.
--exclude: (optional) do not allow shuffled intervals to land in the
intervals specified in this file (may be specified multiple
times)
--metric: (optional, default=3 metrics of overlap, n_intersections and
jaccard) may be any bash command, e.g.:
'awk "BEGIN{s=0}{s+=$3-$2}END{print s}"'])
or
'perl my-evaluator.pl'
or
'wc -l'
or anything that accepts intervals to stdin and prints a number
to stdout: eg:
'python interval-accepter.py'
the metric is value to compare the between observed and shuffled.
-g: genome version (to get chromosomes), e.g. mm8, dm3, hg19 or a file
----
misc
----
-n: (optional, default=1000) number of shufflings
-t: (optional, default=1) number of threads to use.
--seed: (optional) seed for the random number generator
--png: (optional) a path to save a png of the simulated distribution
vs the observed metric.
call like:
$ python -m shuffler -n 1000 -a query.bed -b subject.bed \
--domain domain.bed -t 8 -g hg19 --seed 42
output is a single integer indicating the number of simulations with a metric < the observed
$ python -m shuffler -a query.bed:500 -b subject.bed:-5000 --domain domain.bed
this will pad each side of query.bed by 500 bases (this only affects the shuffled data not the observed). it will also extend intervals in subject.bed only upstream. if 4 columns are present, it look for strand there if 6 columns are present it will check for strand there. if strand is included, up-stream is based on that, otherwise, upstream is "less than".
$ python -m shuffler -a query.bed -b subject.bed:-5000:100 --domain domain.bed
will extend the intervals in subject.bed by 5kb upstream and 100 downstream. for only downstream, use subject.bed:0:100
domains
in the examples above, the --domain is a bed file with a single set of regions the define the intervals where shuffled query intervals can be placed.
we extend this with --domains.
$ shuffler -a query.bed -b subject.bed:-5000:100 --domains domains.bed
here, the program will pull unique values from the 4th column of domains.bed and create a domain.$value.bed file for each uniqe value. so, e.g. domains.bed may contain 1000 TSS sites (labelled as "TSS" in the 4th column), 2100 exons, 2001 introns, and 407 UTRs. The program will create a file for each of those feature types and report the number of shuffled values that are larger than the observed for each of those types.
domains can also be extended:
$ shuffler -a query.bed -b subject.bed:-5000:100 --domains domains.bed:10000
will extend each domain intervla by 10KB.
example
We have the chromHMM segmentations in a BED file called: h1hesc.ChromHMM.hg18.bed the first few lines look like this:
chr1 0 200 DnaseU
chr1 200 400 CtcfO
chr1 400 5863 Low
chr1 5863 6263 CtcfO
chr1 6263 6463 H4K20
chr1 6463 81263 Quies
chr1 81263 81463 Ctcf
chr1 81463 123063 Quies
chr1 123063 123463 DnaseU
chr1 123463 128863 Low
where the last column is the name given to that particular chromHMM segmentation. We also have another file with about 2000 top-secret intervals: intervals.bed and a domain file that lists the possible locations where those intervals could be: charm.domain.bed. We could use --include to specify that, but it's faster to use --exclude. So we generate the complement with bedtools:
bedtools complement -i charm.domain.bed -g hg18.genome > charm.domain.complement.bed
then our call to shuffler looks like this (annotated):
python -m shuffler \
-a intervals.bed \ # 2000 intervals of interest
--bs h1hesc.ChromHMM.hg18.bed \ # last column indicates segmentation
-g hg18.genome \ # genome file
-n 250 \ # number of shufflings
--seed 42 \ # seed for reproducibility
-t 14 \ # use 14 threads
--png shuffler.hmm.png \ # save image
--exclude data/charm.domain.complement.bed # don't shuffle inside these
and the output has one line for each unique value in the final column of h1hesc.ChromHMM.hg18.bed:
group intersection jaccard n_intersections
Art 0.3 0.3 0.156
Ctcf 0.84 0.84 0.84
CtcfO 0.716 0.716 0.36
DnaseD 0.112 0.112 0.044
DnaseU 0 0 0
Elon 0.984 0.984 1
ElonW 1 1 1
Enh 0 0 0
EnhF 0 0 0
EnhW 0 0 0
EnhWF 0 0 0
FaireW 0.916 0.916 0.98
Gen3' 1 1 1
Gen5' 0 0 0
H4K20 0 0 0
Low 1 1 1
Pol2 1 1 1
PromF 0 0 0
PromP 0.292 0.288 0
Quies 0.884 0.884 0.688
Repr 0.012 0.012 0
ReprD 0 0 0
ReprW 1 1 0.916
Tss 1 1 1
TssF 0 0 0.016
where each column is a "metric". For these, we are most interested in the first numeric column--which shows the proportion of times a shuffled dataset had a higher overlap than the observed. We can see things like Enhancer have a low probability of occurring by chance while TSS has a p-value of 1. We can confer with a biologist to decide what this means. In other cases, we may wish to look at the jaccard index which is the ratio of intersection to union or the number of (unique) intersections.
Since we specified --png, we get this image:
Where the blue line indicates the observed intersection, number of intersections or jaccard, respectively and the green shows the distribution of simulations. Remember we got all this from a single command:
python -m shuffler -a intervals.bed --bs h1hesc.ChromHMM.hg18.bed \
-g hg18.genome -n 250 --seed 42 -t 14 --png shuffler.hmm.png \
--exclude data/charm.domain.complement.bed
which ran 250 shufflings for each category from the 2.8 million chromHMM segementations in about 5 minutes.
todo
see genome-wide plot (fig 2) and bi-plot (fig6) in zhang et al: "Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions" http://genome.cshlp.org/content/17/6/787.full.pdf
clustered-ness of regions within a single file
segway/chrom hmm segmentatnois: http://www.nature.com/nmeth/journal/v9/n5/full/nmeth.1937.html