/merge_peaks

Pipeline for using IDR to produce a set of peaks given two replicate eCLIP peaks

Primary LanguageShell

merge_peaks

CWL-defined pipeline for using IDR to produce a set of peaks given two replicate eCLIP peaks

Installation:

  • (For YEOLAB: module load eclipidrmergepeaks/0.1.0)
For all others:

Outline of workflow:

  • Normalize CLIP BAM over INPUT for each replicate (overlap_peakfi_with_bam_PE.cwl)
  • Peak compression/merging on input-normalized peaks for each replicate (compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat_outputfull.cwl)
  • Entropy calculation on CLIP and INPUT read probabilities within each peak for each replicate (make_informationcontent_from_peaks.cwl)
  • Reformat *.full files into *.bed files for each replicate (full_to_bed.cwl)
  • Run IDR on peaks ranked by entropy (idr.cwl)
  • Calculates summary statistics at different IDR cutoffs (parse_idr_peaks.cwl)
  • Normalize CLIP BAM over INPUT using new IDR peak positions (overlap_peakfi_with_bam_PE.cwl)
  • Identifies reproducible peaks within IDR regions (get_reproducing_peaks.cwl)

Usage:

(see the examples/merge_peaks_1input.yaml or examples/merge_peaks_2inputs.yaml manifest file for a full example). Below is a description of all fields required to be filled out in the manifest file:

First, use the example template to fill out the names and paths pertaining to your samples. The shebang "#!" line will depend on your experimental setup (either 2 replicates with 2 corresponding inputs, or 2 replicates normalized over 1 input).

This should match what was used to call CLIPper peaks.

species: hg19

BAM file containing the merged-barcode (read 2 only) PCR-deduped CLIP reads mapping to the genome for Replicate 1. Replace "rep1" with a unique ID for each rep1.

    - name: "rep1"
      ip_bam: 
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF994WPX.r2.bam

BAM file containing the merged-barcode (read 2 only) PCR-deduped INPUT reads mapping to the genome for Replicate 1.

      input_bam:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF590UCY.r2.bam

BED file containing the called peak clusters for Replicate 1 Output from either CLIPPER or input-normed peaks. This pipeline will perform input norm internally for you, so it won't really matter which file you use.

      peak_clusters:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF639MYI.bed6

BAM file containing the merged-barcode (read 2 only) PCR-deduped CLIP reads mapping to the genome for Replicate 2. Replace "rep2" with a unique ID for each rep2.

    - name: "rep2"
      ip_bam: 
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF154BQS.r2.bam

BAM file containing the merged-barcode (read 2 only) PCR-deduped INPUT reads mapping to the genome for Replicate 2.

      input_bam:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF590UCY.r2.bam

BED file containing the called peak clusters for Replicate 2 Output from CLIPPER or input-normed peaks. This pipeline will perform input norm internally for you.

      peak_clusters:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF664WCU.bed6

FINAL OUTPUTS

Merged reproducible peaks are reported as:

rep1.vs.rep2.bed

Where rep1 and rep2 are the user-defined names in the manifest.

To run the workflow:

  • Ensure that the yaml file is accessible and that wf_get_reproducible_eclip_peaks.cwl is in your $PATH.
  • Type: ./merge_peaks_2inputs.yaml

Outputs

  • merged_peaks_bed: this is the BED6 file containing reproducible peaks as determined by entropy-ordered peaks between two replicates.
    • chrom
    • start
    • end
    • minimum of the -log10 p-value between two replicates (coolumn 4)
    • geomean of the log2 fold changes (column 5)
    • strand This is probably what will be useful.
  • *.full files: these tabbed outputs have the following columns (in order):
    • chromosome
    • start
    • end
    • name (colon separated region)
    • reads in CLIP
    • reads in INPUT
    • p-value
    • chi value or (F)isher
    • (F)isher or (C)hi square test
    • enriched or depleted
    • negative log10p value (400 if above certain threshold)
    • log2 fold change
    • entropy
  • idr.out: output from IDR
  • idr.out.bed: output from IDR as a bed file
  • *.custombed: contains individual replicate information. The headers are:
    • IDR region (entire IDR identified reproducible region)
    • peak (reproducible peak region)
    • geomean of the l2fc
    • rep1 log2 fold change
    • rep2 log2 fold change
    • rep1 -log10 pvalue
    • rep2 -log10 pvalue

Notes:

  • The current conda version of perl installed using create_environment_merge_peaks.sh will install perl v5.22.0, which is different from the version tested on TSCC (5.10.1). Since 5.18, there have been slight changes resulting in hash keys being accessed in a non-deterministic way. Installing 5.22.0 will result in minor changes from the reference, but will otherwise give similar outputs. Included is a script run_perlbrew_perl5.10.1.sh which will attempt to install perl 5.10.1, which will give you deterministic results.
  • Custombed files are staged for deprecation, we don't usually use this.