Guide to scripts and data for "Sequence characteristics distinguish transcribed enhancers from promoters and predict their breadth of activity"
doi:10.1534/genetics.118.301895
avg_curves.py
averages ROC and PR curves from many classifiers, used when we split larger sets up
enh-prom_analyses.ipynb
contains R code for relative ROC calculation (Fig. 3), TF motif analyses (Fig. 5), PCA, kmer weights (Fig 4)
kmer_count.py
counts occurrence of all sequences of length k in a set of genomic regions
set_length.py
makes every region in a bed file the same length, keeping same center point
all_fantom_enhancers.bed
Broad enhancers = all with #tiss >45
Context Specific = random subset of those with #tiss = 1
regions were set to 600bp before use
all_fantom_prom.bed
Broad Promoters = random subset of those with mean_act >372
Context-Specific = all with mean_act <9
regions were set to 600bp before use
roadmap_enhancers_600bp.bed
filtered, set to 600bp
prom_enh_rel_ROC.txt
values for Fig. 3 relative ROCs
roadmap_promoters_600bp.bed
filtered, set to 600bp
tf_motif_specificity.csv
FANTOM TSPS scores, IDs
output and scripts from all SVM classifiers
N.B. classifier script requires Python 2.7.8 and Shogun Machine Learning Toolbox v4.0.0
fantom_enhVSprom/
direct classifiers between enhancers and promoters (Fig. 1)
fantom_enhVsprom_cgiMatched/
direct classifiers between enhancers and promoters, stratified by CGI overlap
broadVSspecific/
classifiers between broad and specific regions (Fig. 2)
cgi_analyses/
stratified by CGI status (Fig. 3)
roadmap_enhVSprom/
direct classifiers between enhancers and promoters (Fig. 6)
tomtom output for top 6-mers in direct classifiers between enhancers and promoters (Fig 3B)
tomtom output for top 6-mer in other enhancer and promoter classifiers
hocomoco/ (Fig 5)
jaspar/ (Figs S11 & S12)\
overall broad and narrow tf counts in regions (Fig. 5)