/ChIA-PET2

a versatile and flexible pipeline for analysing different variants of ChIA-PET data

Primary LanguageC++GNU General Public License v3.0GPL-3.0

ChIA-PET2

ChIA-PET2 is a versatile and flexible pipeline for analysing different variants of ChIA-PET data from raw sequencing reads to chromatin loops.

ChIA-PET2 was named not only because it is a tool for ChIA-PET data analysis, but also because it supports at least 2 different ChIA-PET protocols (bridge linker protocol or half-linkers protocol) data, 2 modes of read alignments (short or long read alignment) and 2 dimensional contact map output.

ChIA-PET2 integrates all steps required for ChIA-PET data analysis, including linkers trimming, reads mapping, duplicates removing, peaks calling and chromatin loops calling. It supports different kinds of ChIA-PET data generated from different ChIA-PET protocols. It also provides quality controls for different steps of ChIA-PET analysis. In addition, ChIA-PET2 can use phased genotype data to call allele-specific chromatin interactions. We applied ChIA-PET2 to different ChIA-PET datasets, demonstrating its significant improved performance as well as its ability to easily process ChIA-PET raw data.

Install

ChIA-PET2 could be installed in a linux-like system with OpenMP support. The ChIA-PET2 pipeline requires the following dependencies, which are usually already installed in a bioinformatics cluster.

  • BWA v0.7.10+ : for reads alignment
  • MACS2 v2.1.0+ : for peaks calling
  • samtools v1.3+ : for sam file manipulation
  • bedtools v2.24.0+ : for bed/bedpe file manipulation
  • R with the ggplot2 and VGAM packages : for calling MICC and plotting Quality Control figure

To install ChIA-PET2:

tar -zxvf ChIA-PET2.tar.gz
cd ChIA-PET2
chmod +x bin/ChIA-PET2
make

ChIA-PET2 will be installed in the bin directory in user home (~/bin) by default. It's recommended to set the ~/bin directory to the PATH.

Usage

Just type in ' ChIA-PET2 -h ' for detailed usage.

$ ChIA-PET2 -h
usage : ChIA-PET2 -g genomeindex -b bedtoolsgenome -f fq1 -r fq2 -A linkerA -B linkerB -o OUTdir -n prefixname
Use option -h|--help for more information

ChIA-PET2 0.9.3   2017.11.07
----------------------------
OPTIONS

  -s|--start:     start from which step(1:8): 1:Trim Linkers; 2:Map Reads; 3:Build PETs; 4:Call Peaks; 5:Find Interactions
                  6:Plot QC; 7:Estimate statistical confidence; 8:Phase PETs(optional), default=1
  -g|--genome:    genome index for bwa
  -b|--bedtoolsgenome: chromsomes size file for bedtools
  -f|--forward:   one fastq(.gz) file
  -r|--reverse:   the other fastq(.gz) file
  -A|--linkerA:   one linker sequence, default=GTTGGATAAG
  -B|--linkerB:   the other linker sequence, default=GTTGGAATGT
  -o|--output:    output folder, default=output
  -n|--name:      output prefix name, default=out
  -m|--mode:      0,1,2;  0: A/B linkers; 1: bridge liker; 2: Enzyme site, default=0
  -e|--err:       Maximum mismatches allowed in linker sequence, default=0
  -k|--keepempty: 0,1,2; 0:No linker-empty reads; 1:keep 1 linker-empty read; 2:keep 2 linker-empty reads. default=0
  -t|--thread:    threads to run, default=1
  -d|--short:     short reads (0 or 1), default=0 for reads >70bp. If the read length is about 20bp, set d=1
  -M|--macs2 parameters, default="-q 0.05"
  -Q|--mapq:      mapq cutoff, default=30
  -C|--cutoffPET: PET count cutoff before running MICC, default=2
  -S|--slop:      slop length, default=100
  -E|--extend:    extend length on both sides, default=500
  -l|--length:    min length of reads after linker trimming. default=15
  -P|--phased:    optional phased genotype file: 'chr1\tstart\tend\tA\tC'
  [-h|--help]:    help
  [-v|--version]: version

Notes

  • Suppose the bridge linker is:

      5'end ->
      ACGCGATATCTTATCTGACT
      TGCGCTATAGAATAGACTGA
                  <- 5'end
    

We could set the first N, e.g. 15, bases of both ends (from 5' to 3') as the parameters: "-A ACGCGATATCTTATC -B AGTCAGATAAGATAT"

  • The genome size file (bedtoolsgenome) should be tab delimited and structured as follows:

      For example, Human (hg19):
      chr1  249250621
      chr2  243199373
      ...
      chrM  16571
    

Output

Important result files:

  • prefixname.interactions.intra.bedpe: inra-chromosomal loops (11 columns)
  • prefixname.interactions.inter.bedpe: iner-chromosomal loops (11 columns)
  • prefixname.interactions.MICC: significance estimation for each loops (13 columns)
  • prefixname.QCplot.pdf: Quality control figure for different steps of analysis.

prefixname.interactions.MICC file has 11+2 columns. The last 2 columns ( -log10(1-PostProbability) and FDR ) are estimated by MICC based on a Bayesian mixture model. See the toy example below. This means the chromatin loop between peak_1(chr:9118-10409) with peak depth 3330 and peak_3(chr:89064-90360) with peak depth 3814 has 39 supportive pair-end tags(PETs). -log10(1-PostProbability)=4.5 and FDR=0.03 .

chr start end chr start end peak1 peak2 depth1 depth2 #PET PP FDR
chr1 9118 10409 chr1 89064 90360 peak_1 peak_3 3330 3814 39 4.5 0.03

Other intermediated files include:

  • prefixname_1.valid.sam
  • prefixname_2.valid.sam
  • prefixname.bedpe
  • prefixname.rmdup.bedpe
  • prefixname_peaks.slopPeak
  • prefixname.trim.stat
  • prefixname.bedpe.stat

Toolkit

trimLinker: trim linker sequences in the pair-end fastq(.gz) files. There will be also a summary file named output.trim.stat

$ Usage:   trimLinker [-option] [argument]
option:  -h  show help information
     -t  threads
     -e  number of mismatch allowed in linker, default=0
     -k  keep empty: 0, 1, 2
     -l  min length of trimmed reads: default=15
     -o  output directory
     -m  mode: 0 or 1, A/B linkers(0) or bridge linker(1)
     -A  linkerA
     -B  linkerB
     -n  output name prefix
example: trimLinker -t 4 -o outdir -m 0 -n name -k 0 -A AAAAAAAT -B TTTTACGG fq1.fq.gz fq2.fq.gz

buildBedpe: build bedpe file from two sam files. The read names should be in the same order. keepseq_flag indicates whether to keep sequences in the bedpe file (default=0, no need to keep sequences unless we want to do allele-specific analysis).

$ buldBedpe file1.sam file2.sam output  MAPQ_cutoff thread keepseq_flag

removeDup: remove duplicate PETs with N threads. The bedpe file containing the paired-end tags (PETs) is in the following format.

chr start end chr start end name score strand1 strand2
chr1 9128 9228 chr1 89064 89164 readpair1 . + -

buildTagAlign: build tag file from bedpe file for MACS2 input.

$ buildTagAlign in.bedpe out.tag.bed

bedpe2Interaction: Detect chromatin interactions from bedpe file.

$ pairToBed -a in.bedpe -b peaks.bed -type both > tmp.bedpe
$ bedpe2Interaction tmp.bedpe interaction.out.bedpe out.bedpe.stat

QCplots.R: Plot the Quality Control figure.

$ Rscript QCplots.R directory name

MICC2.R: Call MICC package to estimate the statistically significance of chromatin loops. Default CUTOFFPET=2.

$ Rscript MICC2.R interactions.intra.bedpe interactions.inter.bedpe miccOUT CUTOFFPET
Other parameters can be provided:
$ Rscript MICC2.R interactions.intra.bedpe interactions.inter.bedpe miccOUT 2 6 1e-10

bedpe2Matrix: Generate the Hi-C style matrix. The output matrix is in triplet sparse format, which is compatible with HiCPlotter.

$ bedpe2Matrix --binsize 10000 --chrsizes chrom_hg19.sizes --ifile in.rmDup.bedpe --oprefix PREFIX --progress

bedpe2Phased: Generate the allele-specific PETs. Each line of the phased genotype file phased.bed has 5 columns separated with tabs '\t': "chr start end A C", as below.

chr start end alelle1 allele2
chr1 19118 19119 A C
$ grep -v "*" in.bedpe > bothmapped.bedpe
$ pairToBed -a bothmapped.bedpe -b phased.bed -type either > bedpe.flt.bed
$ bedpe2Phased bedpe.flt.bed outputprefix

Generate a coverage track: We can transform the demo.bedpe.tag.sorted file to bigwig format to get a coverage track, which could be visualized by IGV or UCSC genome browser.

$ bedtools genomecov -i demo.bedpe.tag.sorted -bg -g chrom_hg19.sizes > demo.bedpe.tag.sorted.bedgraph
$ bedGraphToBigWig demo.bedpe.tag.sorted.bedgraph chrom_hg19.sizes demo.bedpe.tag.sorted.bigwig

Citation

Please cite the following article if you use ChIA-PET2 in your research:

  • Li, G., Chen, Y., Snyder, M.P. and Zhang, M.Q. (2016) ChIA-PET2: a versatile and flexible pipeline for ChIA-PET data analysis. Nucleic acids research. doi:10.1093/nar/gkw809

If you use MICC to call significant chromatin loops (step 7 in the pipeline), please cite the following:

  • He C, Zhang MQ, Wang X: MICC: an R package for identifying chromatin interactions from ChIA-PET data. Bioinformatics 2015, 31:3832-3834.

Contact

Author: Guipeng Li

Email: guipeng.lee(AT)gmail.com