EpiSAFARI

Valley patterns, a dip between two summits, are prevalently observed in the genomewide profiles of many functional genomics signals such as ChIP-Seq, WGBS, RepliSeq, etc. The valleys potentially delineate the locations of functionally important elements, e.g. cis-regulatory elements such as enhancers, promoters, etc.

EpiSAFARI is a command line tool for detection, annotation, and characterization of the valley shaped patterns within the functional genomics signal profiles. It takes:

- Signal Profile in bedGraph format,
- Mapped reads in SAM/bowtie/eland/... format.

and outputs:

- The spline smoothed signal profile.
- The valleys, local minima and maxima in the signal profile.
- Gene and transcription factors annotations, mappability values and nucleotide content for the valley.

In addition, EpiSAFARI can compare the valleys detected in 2 samples and identify the "differential valleys". This enables comparing 2 samples generated under different conditions.

The output is formatted as an extended BED file with multiple columns. Please refer below for the specification of output file format.

Download and Installation

You can download EpiSAFARI code by clicking on the green "Clone or Download" button and downloading zip or by checking out via git. After that, navigate to the checkout (or download) directory. If you downloaded the zip file of the source, unzip it using "unzip master.zip".

You need to have g++, gzip, and the GSL libraries installed for building EpiSAFARI. These are installed in most Unix distributions but if they are not installed, type:

sudo yum -y install gsl gsl-devel gcc-c++ gzip

This should install the necessary GSL libraries for building EpiSAFARI correctly.

Now EpiSAFARI can be built using:

make clean
make

The executable is located under directory bin/.

It may be useful to install samtools for processing BAM files.

To get help on which options are available, use:

./bin/EpiSAFARI -help

Parameter Selection and Impact of Parameters on Detected Valleys

EpiSAFARI uses a spline smoothing procedure with several parameters and these parameters can have impact on the sensitivity of the method.

The spline smoothing is based on projecting the epigenetic signal profile on basis splines and removing noise. The basis splines are defined in terms of a spline degree and a set of knots.

We generally observed that overly simple and complex smoothing causes underfitting and overfitting of the data and this decreases sensitivity of valley detection.

Knot Selection

The number of knots, and the placement of knots are important factors while smoothing the signal. We generally observed that increasing knot number above 7 creates overfitting when using windows of length 1000 base pairs.

On the other hand, the using a knot number below 6 seems to decrease sensitivitiy similarly because smoothed signal profile is underfit, i.e., does not represent the original signal well. Thus EpiSAFARI uses 7 knots by default.

The placements of knots along the signal domain is an open problem in spline based smoothing of signals. We compared several knot placement strategies (Uniform, Derivative-based, and Random) and found that uniform knot placement generally works comparable in accuracy as other knot placement strategies.

If it is necessary, the users can change the knot selection parameters while spline smoothing the data (See below)

Spline Degree

The spline degree tunes polynomial complexity of the basis spline curves. We observed that increasing the degree above 5 may cause overfitting and setting it below 3 may cause underfitting. Thus, by default, EpiSAFARI uses degree of 5.

If it is necessary, this parameter while data is being smoothed using "-bspline_encode" option (See below).

Setting the Number of Coefficients with new Spline Degree and Knot Selections

If the spline degree or knot number is changed, the number of coefficients can be computed simply as (number of knots) + (spline degree) - 2. This value should be used as the number of coefficients parameter while spline smoothing the data.

Window length

The window length specifies the length of the window that is used while smoothing the data. Signal in each window is smoothed then concatenated. The large window lengths creates a large signal for smoothing. Therefore the knot numbers must be adjusted with the increasing window length. By fefault, EpiSAFARI's parameters are selections work well with 1000 base pair windows for dense signals. For sparse signals, such as DNA methylation, EpiSAFARI uses 5000 base pair long windows by default.

This parameter can be changed while data is being smoothed.

Hill Score Thresholds

EpiSAFARI reports a hill score between 0 and 1 that is used to measure the topological quality of valleys. Hill score of 1 represents a valley that shows monotonically increasing signal while moving from the valley's dip to the summits.

The reported valleys must be filtered with respect to the reported hill scores. We observed that there is very high enrichment of valleys with hill scores close to 1.0. These valleys represent biologically meaningful valleys. Therefore EpiSAFARI uses hill score threshold of 0.90

If the hill score threshold is decreased, the valley redundancy increases: The fraction of reported valleys with overlaps increase. Depending on the application, this may be a useful and intended behaviour.

Usage Examples

EpiSAFARI run starts with setting up the input files. (Note that we use samtools for converting BAM file to SAM files.). EpiSAFARI can take bedGraph files and mapped reads directly as input. It is necessary to divide the data into chromosomes.

Building input with bedGraph and bigWig files

We show an example from ENCODE project below:

wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneGm12878H3k04me3StdSigV2.bigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
chmod 755 bigWigToBedGraph
./bigWigToBedGraph wgEncodeBroadHistoneGm12878H3k04me3StdSigV2.bigWig wgEncodeBroadHistoneGm12878H3k04me3StdSigV2.bigWig.bgr
mkdir bedGraphs
./bin/EpiSAFARI -separate_bedGraph_2_chromosomes wgEncodeBroadHistoneGm12878H3k04me3StdSigV2.bigWig.bgr bedGraphs

If there are multiple replicates to be pooled, they can be done at once or separately. If done separately, EpiSAFARI pools the bedGraphs automatically and uses the total signal profile in the analyses.

Building input with mapped read files

EpiSAFARI can also process mapped read files, for example in SAM format. We show again an example from the ENCODE Project:

wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneGm12878H3k4me3StdAlnRep1.bam
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneGm12878H3k4me3StdAlnRep2.bam

rm -f -r processed_reads
mkdir processed_reads
samtools view wgEncodeBroadHistoneGm12878H3k4me3StdAlnRep1.bam | ./bin/EpiSAFARI -preprocess_reads SAM stdin processed_reads
samtools view wgEncodeBroadHistoneGm12878H3k4me3StdAlnRep2.bam | ./bin/EpiSAFARI -preprocess_reads SAM stdin processed_reads

This example pools the 2 replicates of data. If there are more multiple replicates of reads to be pooled, they can be done at once or separately. If done separately, EpiSAFARI pools the reads automatically and uses the total signal profile in the analyses as in the example above.

We strongly recommend removing duplicates from the reads. This decreases valley detection time quite much:

mkdir processed_reads/sorted processed_reads/dedup
./bin/EpiSAFARI -sort_reads processed_reads processed_reads/sorted
./bin/EpiSAFARI -remove_duplicates processed_reads/sorted 2 processed_reads/dedup

Spline Fitting to the Data

After input files are setup, we perform spline fitting to the read coverage signals. The raw data generated by EpiSAFARI is rather large so it is useful to gzip them:

n_spline_coeffs=10
spline_order=5
max_max_err=5
max_avg_err=3
l_win=1000
l_step_win=1000
sparse_data=0
l_post_filter=50
brkpt_type=1
min_POI_distance_in_bps=50

## bedGraph files:
./bin/EpiSAFARI -bspline_encode -signal_dir bedGraphs -n_spline_coeff ${n_spline_coeffs} -bspline_order ${spline_order} -max_max_err ${max_max_err} -max_avg_err ${max_avg_err} -l_win ${l_win} -min_POI_distance ${min_POI_distance_in_bps} -sparse_profile ${sparse_data} -brkpts_type ${brkpt_type} -l_step_win ${l_step_win} -l_post_filt_win ${l_post_filter}

## mapped read files:
./bin/EpiSAFARI -bspline_encode -signal_dir processed_reads/dedup -n_spline_coeff ${n_spline_coeffs} -bspline_order ${spline_order} -max_max_err ${max_max_err} -max_avg_err ${max_avg_err} -l_win ${l_win} -min_POI_distance ${min_POI_distance_in_bps} -sparse_profile ${sparse_data} -brkpts_type ${brkpt_type} -l_step_win ${l_step_win} -l_post_filt_win ${l_post_filter}

n_spline_coeffs controls the number of knots that are used to fit b-spline. It should not be set to a very high value as this may cause overfitting of the data.

Valley Detection

We next perform valley identification. We first download the multi-mappability signal then identify the valleys:

seq_dir=hg19_seq
mmap_dir=hg19_36bp
mkdir hg19_36bp
cd hg19_36bp
wget -c http://archive.gersteinlab.org/proj/MUSIC/multimap_profiles/hg19/hg19_36bp.tar.bz2
tar -xvjf hg19_36bp.tar.bz2
cd ..
mkdir hg19_seq
cd hg19_seq
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar -xvzf chromFa.tar.gz
../bin/EpiSAFARI -preprocess_FASTA . fa .
cd ..
max_trough_sig=1000
min_summit_sig=5
min_summit2trough_frac=1.2
max_summit2trough_dist=1000
min_summit2trough_dist=0
min_multimapp=1.2
sparse_data=0
pval_type=0

## bedGraph files:
./bin/EpiSAFARI -get_valleys -signal_dir bedGraphs -mmapp_dir ${mmap_dir} -genome_dir ${seq_dir} -max_signal_at_trough ${max_trough_sig} -min_signal_at_summit ${min_summit_sig} -f_min ${min_summit2trough_frac} -l_min ${min_summit2trough_dist} -l_max ${max_summit2trough_dist} -max_mmap ${min_multimapp} -max_qval 0.01 -sparse_profile ${sparse_data} -pval_type ${pval_type}
all_valleys_fp=bedGraphs/significant_valleys.bed.gz

## mapped read files:
./bin/EpiSAFARI -get_valleys -signal_dir processed_reads/dedup -mmapp_dir ${mmap_dir} -genome_dir ${seq_dir} -max_signal_at_trough ${max_trough_sig} -min_signal_at_summit ${min_summit_sig} -f_min ${min_summit2trough_frac} -l_min ${min_summit2trough_dist} -l_max ${max_summit2trough_dist} -max_mmap ${min_multimapp} -max_qval 0.01 -sparse_profile ${sparse_data} -pval_type ${pval_type}
all_valleys_fp=processed_reads/dedup/significant_valleys.bed.gz


# Filter: Remove valleys with FDR higher than log(0.05), hill scores lower than 0.90 and average multi-mappability higher than 1.2.
gzip -cd ${all_valleys_fp} | awk {'if(NR==1){print $0};if($18<-3 && $10>=0.90 && $11>=0.90 && $8<1.2)print $0'} > sign.bed

# Merge valleys with dips closer than 200 base pairs.
./bin/EpiSAFARI -merge_valleys sign.bed 200 merged_sign.bed

Valley Annotation

We finally perform valley annotation. We first download the GENCODE gene annotation gff file:

wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz

l_promoter=1000

## bedGraph files:
./bin/EpiSAFARI -annotate_features merged_sign.bed gencode.v19.annotation.gff3.gz ${l_promoter} annotated_features.bed

Finally, you can also add ENCODE2 transcription factor binding annotations. We have built the GFF file for the uniformly processed peaks of 690 transcription factors from the ENCODE2 cell lines that you can download and use to annotate the features:

wget -c http://harmancilab.org/tools/EpiSAFARI/wgEncodeAwgTfbs.gff.gz
./bin/EpiSAFARI -annotate_features merged_sign.bed wgEncodeAwgTfbs.gff.gz 0 annotated_features.bed

Annotation adds a new annotation column to every entry in the valleys file and automatically updates the header.

Sparse Mode

EpiSAFARI can also process sparse signals. Examples of these include DNA methylation data, which happens only at cytosine nucleotides. For this, run EpiSAFARI with sparse_data=1 to enable sparse signal smoothing. We also turn on post-median filter on.

We demonstrate this option on the DNA methylation data for H1hESC cell line from Roadmap Epigenome Project:

wget -c ftp://ftp.genboree.org/EpigenomeAtlas/Current-Release/experiment-sample/Bisulfite-Seq/H1_Cell_Line/UCSD.H1.Bisulfite-Seq.combined.wig.gz

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes
chmod 755 fetchChromSizes
chmod 755 wigToBigWig
chmod 755 bigWigToBedGraph

gzip -cd UCSD.H1.Bisulfite-Seq.combined.wig.gz > UCSD.H1.Bisulfite-Seq.combined.wig
./fetchChromSizes hg19 > hg19.list
./wigToBigWig UCSD.H1.Bisulfite-Seq.combined.wig hg19.list UCSD.H1.Bisulfite-Seq.combined.wig.bw
./bigWigToBedGraph UCSD.H1.Bisulfite-Seq.combined.wig.bw UCSD.H1.Bisulfite-Seq.combined.wig.bw.bgr

mkdir DNAm_bgrs
./bin/EpiSAFARI -separate_bedGraph_2_chromosomes UCSD.H1.Bisulfite-Seq.combined.wig.bw.bgr DNAm_bgrs

n_spline_coeffs=10
spline_order=5
max_max_err=0.3
max_avg_err=0.3
l_win=5000
sparse_data=1 
l_post_filter=50
brkpt_type=0

./bin/EpiSAFARI -bspline_encode DNAm_bgrs ${n_spline_coeffs} ${spline_order} ${brkpt_type} ${max_max_err} ${max_avg_err} ${l_win} ${sparse_data} ${l_post_filter}
 
# Compute valleys.
seq_dir=hg19_seq
mmap_dir=hg19_36bp
max_trough_sig=1000
min_summit_sig=0.7
min_summit2trough_frac=1.2
max_summit2trough_dist=2000
min_summit2trough_dist=250
min_multimapp=1.2
sparse_data=1
pval_type=0

./bin/EpiSAFARI -get_significant_extrema DNAm_bgrs ${max_trough_sig} ${min_summit_sig} ${min_summit2trough_frac} ${min_summit2trough_dist} ${max_summit2trough_dist} ${mmap_dir} ${min_multimapp} ${seq_dir} 0.1 ${sparse_data} ${pval_type}

# Filter: Remove methyl-valleys with FDR higher than log(0.05), CpG count less than 20, hill score less than 0.99, and GC content less than 0.4.
cat DNAm_bgrs/significant_valleys.bed | awk {'if(NR==1){print $0};if($18<-3 && $16>20 && $10>=0.99 && $11>=0.99 && $8<1.2 && ($13+$14)/($12+$13+$14+$15)>0.4)print $0'} > sign.bed

# Merge the methl-valleys whose minima are within 200 base pairs of each other.
./bin/EpiSAFARI -merge_valleys sign.bed 200 merged_sign.bed

./bin/EpiSAFARI -annotate_features merged_sign.bed wgEncodeAwgTfbs.gff.gz 0 annotated_features.bed

Visualization of the Signal

IGV can be used to visualize the spline coded signals. These are stored as bedGraphs files under data directory.

To visualize the spline smoothed signal profile for chromosome 1, use following:

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
chmod 755 bedGraphToBigWig
gzip -cd bedGraphs/spline_coded_1.bgr.gz > bedGraphs/spline_coded_1.bgr
sed -i 's/chr//g' hg19.list
bedGraphToBigWig bedGraphs/spline_coded_1.bgr hg19.list spline_coded_1.bgr.bw

The bigWig file named spline_coded_1.bgr.bw can be opened in IGV to visualize the spline smoothed signal. An example is shown below:

Output format

EpiSAFARI outputs the identified valleys to a BED file named "significant_valleys.bed".

This is an extended bed file with following columns:

[Chromosome]: Chromosome ID

[Left maxima position]: Position the left maximum of the valley

[Right maxima position]: Position the right maximum of the valley

[Minima position]: Position of the minimum of the valley

[Left maxima signal]: Signal at the left maxima

[Right maxima signal]: Signal at the right maxima

[Minima signal]: Signal at the minima

[Average multi-mappability signal]: Average multi-mappability signal on the valley

[Maximum multi-mappability signal]: MAximum multi-mappability signal on the valley

[Left hill quality]: Fraction of the nucleotides on the left hill

[Right hill quality]: Fraction of the nucleotides on the right hill

[A count]: Count of A nucleotides in the valley

[C count]: Count of C nucleotides in the valley

[G count]: Count of G nucleotides in the valley

[T count]: Count of T nucleotides in the valley

[CpG count]: Count of CpG dinucleotides in the valley
[P-value]: P-value of the valley
[FDR]: False discovery rate at which valley is deemed significant
[Annotation]: Annotated element's name and type of the element (gene, exon, transcript, promoter, TF_peak)

Assigning Valleys to Promoters and Detection of Supervalleys

EpiSAFARI can assign the valleys to the gene promoters to identify a tentative list of supervalleys around gene promoters.

# Download GENCODE Annotations.
wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz

# Parse the promoters as TSS +/- 10kb.
gzip -cd gencode.v19.annotation.gff3.gz | awk 'BEGIN{FS="\t"}{if($3=="gene"){gene_start=$4;if($7=="-"){gene_start=$5;};split($9, arr, ";");for(i=1;i<=length(arr);i++){if (arr[i] ~/gene_name=/){gene_name=arr[i]}};print $1"\t"gene_start-10000"\t"gene_start+10000"\t"gene_name"\t.\t"$7}}' > promoters.bed

# Assign the valleys to the gene promoters.
./bin/EpiSAFARI -assign_valleys_2_regions promoters.bed merged_sign.bed valleys_2_promoters.bed

# Sort the promoters with respect to number of valleys around promoter and get a list of the gene symbols.
sort -n -k7,7 valleys_2_promoters.bed -r | head -n 200 | awk {'print $4'} | sort -u > genes_with_supervalleys.list

Above code assigns the valleys to the promoters. The output file 'valleys_2_promoters.bed' contains the number (and the list) of valleys around the promoters of all the genes.

Detection of Differential Valleys

When there are multiple samples (different cell lines, samples generated under different conditions), an important downstream analysis is comparison of the valleys detected in these samples.

EpiSAFARI can be used to compare the valleys in the two samples. For this, use the 2 sample comparison option. Below, we assume the GM12878 and K562 valleys are detected using above commands. We perform valley comparison as below:

# Compare samples.
sample1_signal_dir=./GM12878/processed_reads/dedup
sample1_valleys_fp=./GM12878/merged_sign.bed 
sample2_signal_dir=./K562/processed_reads/dedup
sample2_valleys_fp=./K562/merged_sign.bed 
./bin/EpiSAFARI -get_2_sample_differential_valleys ${sample1_valleys_fp} ${sample1_signal_dir} ${sample2_valleys_fp} ${sample2_signal_dir} 0 2

# Now we filter out the valleys with respect to p-value.
awk '{if(NR==1)print $0;if(NR>1 && $17<-10 && $18>-2){print $0}}' 2_sample_differential_stats.txt > gm12878_specific_valleys.bed
awk '{if(NR==1)print $0;if(NR>1 && $18<-10 && $17>-2){print $0}}' 2_sample_differential_stats.txt > k562_specific_valleys.bed

File named '2_sample_differential_stats.txt' contains the differential valley statistics in all the valleys pooled from GM12878 and K562 cell lines. This file is a tab-delimited file that contains significance and signals around all the valleys from the two samples. This file can be processed as above to filter valleys or it can be loaded into R and processed as a data frame.

EpiSAFARI can be used to add annotations to the differential valleys ('2_sample_differential_stats.txt' file) using '-annotate_features' option:

./bin/EpiSAFARI -annotate_features 2_sample_differential_stats.txt gencode.v19.annotation.gff3.gz ${l_promoter} annotated_2_sample_differential_stats.txt

harmancilab/EpiSAFARI