/ChIA-DropBox

add support to SRRxxx fastq.

Primary LanguagePython

ChIA-DropBox

---- a novel analysis and visualization pipeline for multiplex chromatin interactions

  • Coding by Simon Zhongyuan Tian (IN ADDITION: Interface from ChIA-Drop data to Juicer box 2D heatmap was coding by Byoungkoo Lee; Interface from ChIA-Drop data to ChIA-PIPE loop cluster was coding by Daniel Capurso) *

Recently, we developed ChIA-Drop1, a novel experimental method for detecting multiplex chromatin interactions with single-molecule precision via droplet-based and barcode-linked sequencing. ChIA-DropBox2 a novel toolkit for analyzing and visualizing multiplex chromatin interactions, which includes: a ChIA-DropBox data processing pipeline3 and a visualizing tool ChIA-View4. Here we also supply tools for importing SPRITE5 and GAM6 data into ChIA-DropBox pipeline to see library quality and view data using ChIA-View (in MCP.transformate_SPRITE_to_ChIA-DropBox and MCP.transformate_GAM_to_ChIA-DropBox directories). We also give a interface to transmit ChIA-Drop pairewise contact file to call loop clustering in ChIA-PIPE7 pipeline.

1 Meizhen Zheng, Simon Zhongyuan Tian, Daniel Capurso, Minji Kim, Rahul Maurya, Byoungkoo Lee, Emaly Piecuch et al. "Multiplex chromatin interactions with single-molecule precision." Nature volume 566, pages558–562(2019).

2 Simon Zhongyuan Tian, Daniel Capurso, Minji Kim, Byoungkoo Lee, Meizhen Zheng, Yijun Ruan. "ChIA-DropBox: a novel analysis and visualization pipeline for multiplex chromatin interactions" bioRxiv (2019): 613034.

3 https://github.com/TheJacksonLaboratory/ChIA-DropBox.git

4 https://github.com/TheJacksonLaboratory/ChIA-view.git

5 Quinodoz S A, Ollikainen N, Tabak B, et al. Higher-order inter-chromosomal hubs shape 3D genome organization in the nucleus[J]. Cell, 2018, 174(3): 744-757. e24.

6 Beagrie R A, Scialdone A, Schueler M, et al. Complex multi-enhancer contacts captured by genome architecture mapping[J]. Nature, 2017, 543(7646): 519.

7 Capurso D, Wang J, Tian S Z, et al. ChIA-PIPE: A fully automated pipeline for ChIA-PET data analysis and visualization[J]. bioRxiv, 2018: 506683.

This pipeline was developed and executed in Centos 6.5 of HPC (high performance computing).

0 Data source

1 Prepare materials for longranger

1.1 Install longranger pipeline

Download 10X Genomeics longranger pipeline, and then install it according to its Instruction. We have tested longranger v2.1.x, which could be requist from 10X Genomeics.

1.2 Prepare reference genome

Before running ChIA-dropbox pipeline, we need to prepare genome reference according to 10X Genomics guide. Current ChIA-DropBox can be used for these version genomes: dm3, dm6, hg19, hg38, mm9, mm10.

mkdir reference_genome/
cd reference_genome/
ls 
refdata-dm3/
refdata-hg19-2.1.0/
refdata-GRCh38-2.1.0/
...

1.3 Prepare FASTQ files

For users, we suggest 3 data soures. Testing data: (Miseq) are some small size data for pipeline testing. ChIA-Drop Production data: are data releaseb by our Natuure paper. User Data: are data generageted by user-self.

1.3.1 Testing data (Miseq)

RNAP2 ChIA-Drop: SHG0051 & ChIA-Drop: SHG0061 & Pure-DNA ChIA-Drop: SHG0041

are aviable @ https://www.dropbox.com/sh/klb1hwprl66qced/AAAGEE9e5SumVEK31jCN4NTKa?dl=0

1.3.2 GSE109355 Production data (Nextseq/Hiseq)

Please rename the SRRxxx FASTQs to 10X longranger needed name by using: MCP.PRE01.prepare_data_SRR.sh

e.g. RNAP2 ChIA-Drop: SHG0051H @ SRR7722067

MCP.FASTQ/
mv SRR7722067.1.fastq SHG0051H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq.gz
mv SRR7722067.2.fastq SHG0051H_GT18-08809_SI-GA-B5_S20_L004_R2_001.fastq.gz

ChIA-Drop: SHG0061H @ SRR7722059

cd MCP.FASTQ/
mv SRR7722059.1.fastq SHG0061H_GT18-08817_SI-GA-B10_S28_L006_R1_001.fastq.gz
mv SRR7722059.1.fastq SHG0061H_GT18-08817_SI-GA-B10_S28_L006_R2_001.fastq.gz

1.3.3 User Self generate fastqs

If user want to generate fastq files by yourself, please use longranger mkfastq.

MCP.FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz
MCP.FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz

1.4 Loading modules in HPC and Install dependent packages.

Following tools and packages are necessary for ChIA-DropBox data processing pipeline.

module load longranger/2.1.5 (or 2.1.6);
module load python/2.7.13 (packages: os, sys, subprocess, operator, collections,time, pysam, pybedtools, glob)
module load samtools/1.8;
module load java/1.8.0
module load R/3.4.4 (packages: dplyr, ggplot2, gridExtra, scales, sitools)
module load BEDtools/2.26.0
module load ImageMagick/7.0.7-26
module load juicer/1.7.5

# make allies for juicer:
juicer1="/opt/compsci/juicer/1.7.5/CPU/common/"
juicer2="/opt/compsci/juicer/1.7.5/"

2 Run Pipeline

2.1 Name the library

ChIA-dropbox pipeline automaticly processes data accoding to "library name". After you have prepared above praparation, please create a folder in you HPC system. The folder name should stricly following following strategry:

  • If the library was sequenced by Illumina Miseq, the folder named with 3 letters + 4 digits (e.g. SHG0024)

  • If the library was sequenced by Illumina Hiseq, the folder named with 3 letters + 4 digits + "H" (e.g. SHG0024H)

  • If the library was sequenced by Illumina Nextseq, the folder named with 3 letters + 4 digits + "N" (e.g. SHG0024N)

  • If the library was sequenced by Illumina NovaSeq, the folder named with 3 letters + 4 digits + "V" (e.g. SHG0024V)

For example, regarding to the example fastq files above (1.3), we should create a folder named: SHG0055H

2.2 Download ChIA-dropbox pipeline

Download ChIA-dropbox pipeline to you HPC system from github to the library folder created in 2.1 (e.g. copy all files into SHG0055H/)

for example:

git clone https://github.com/TheJacksonLaboratory/ChIA-DropBox.git
ls ChIA-dropbox-v1.0/
mkdir SHG0055H
cd SHG0055H/
cp -rf ../ChIA-dropbox-v1.0/MCP* .

2.3 Setup parameters

2.3.1 Prepare Pipeline

## FASTQ (USING: MCP.PRE02.prepare_MCP.FASTQ.sh)
cd SHG0055H/MCP.FASTQ/

#!! please copy fastq file here : 
cp /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
cp /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq

## or soft link:
ln -s /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
ln -s /path-to-fastq/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq

## Reference genome (USING: MCP.PRE03.prepare_MCP.REF.sh)
cd SHG0055H/
ln -s /projects/ruan-lab/Sequence_pool/10X/REFGNM10X/ MCP.REF
## detail see: 1.2 Prepare reference genome 

## Fastq type (input fasteq file type: from GEO SRR type or generate by lingrange mkfastq)
## This script will tansform the name of fastq, which is from SRR compressed.

cd SHG0055H/
sh MCP_RUN00_prepare_data.sh
#This data is download from GEO (SRR...) or NOT? (SRR/NOT):
SRR

ls MCP.FASTQ/*
MCP.FASTQ/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq
MCP.FASTQ/SHG0055H_GT18-08809_SI-GA-B5_S20_L004_R1_001.fastq

2.3.2 Run MCP_RUN01_initiate_parameters.sh

We integerate a script called MCP_RUN01_initiate_parameters.sh, via which we could setup all necessary parameters for this ChIA-dropbox pipeline. We only need to follow the questions and answer each of them acoording to its recommanded answer, see following example please.

$ cd /path-to-data-processing/SHG0055H/

$ sh MCP_RUN01_initiate_parameters.sh

################
Library name: SHG0055H
~~~~~~~~~~
SHG0055H
################
qsub ID: 1055
~~~~~~~~~~
1055
##################
1055
Longranger .mro LIB-ID (must be all digits): 1055
1055
~~~~~~~~~~
1055
##################
folder of fastq: /path-to-data-processing/SHG0055H/FASTQ/
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
~~~~~~~~~~
/path-to-data-processing/SHG0055H/FASTQ/
SHG0055H_GT18-19704_SI-GA-H11_S22_L001_I1_001.fastq.gz  SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz  SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
###################
EXAMPLE:  SHG0066_GT18-06593_SI-GA-A1_S5_L001_I1_001.fastq.gz
fastq prefix: SHG0055H_GT18-19704_SI-GA-H11
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R1_001.fastq.gz
/path-to-data-processing/SHG0055H/FASTQ/SHG0055H_GT18-19704_SI-GA-H11_S22_L001_R2_001.fastq.gz
~~~~~~~~~~
SHG0055H_GT18-19704_SI-GA-H11
###################
cpu number required: 16
~~~~~~~~~~
16
###################
memory size required: 150gb
~~~~~~~~~~
150gb
###################
pipeline running time requierd: 25
~~~~~~~~~~
25
###################
fragment extention size (0, 3000, 5000 ...): 3000
~~~~~~~~~~
3000
###################
execute mark: 20190116-142816
~~~~~~~~~~
20190116-142816
###################
cell gender (m=male; f=female) : m
~~~~~~~~~~
m
###################
genome reference (hg19, hg38, mm9, mm10, dm3, dm6, dm3hg19, hg19dm3) : dm3
genome size file: /path-to-data-processing/SHG0055H/MCP.pc//dm3_len_6CHROM.txt
~~~~~~~~~~
/path-to-data-processing/SHG0055H/MCP.pc//dm3_len_6CHROM.txt
###################
/path-to-data-processing/SHG0055H/REF/refdata-dm3
genome reference directory: /path-to-data-processing/SHG0055H/REF/refdata-dm3
~~~~~~~~~~
/path-to-data-processing/SHG0055H/REF/refdata-dm3
###################

After running MCP_RUN01_initiate_parameters.sh, there will occur two new configuration files for this library: MCP.conf and MCP.mro.

2.3.3 Run MCP_RUN02_exec_pipeline_from_to.sh

MCP_RUN02_exec_pipeline_from_to.sh is the script to select pipeline steps and generate qsub file (HPC job file).

$sh MCP_RUN02_exec_pipeline_from_to.sh

CDB0 = STEP0000_MAPPING.run
CDB1 = STEP0001_MKDOO.run
CDB2 = STEP0101_GETBC.run
...
CDB37 = STEP2001_plot.sh
CDB38 = STEP2002_combine_image.run

from
0
TO
38
SHG0055H

B/S (B: big resource is needed; S: small resource is needed)
B

#! /bin/bash
#PBS -l nodes=1:ppn=16,mem=150gb,walltime=25:00:00
#PBS -N  SHG0055H.0.38.dm3
cd $PBS_O_WORKDIR
FROM=0
TO=38

OUTSIDE=`pwd`
source ${OUTSIDE}/MCP.conf
source ${OUTSIDE}/MCP.module
MCPPC.PART.0-38.qsub
qsub  MCPPC.PART.0-38.qsub

This script (MCP_RUN02_exec_pipeline_from_to.sh) will create a qsub file MCPPC.PART.0-38.qsub, which runs the pipeline from very beginning (CDB0 = STEP0000_MAPPING.run) to the end (CDB38 = STEP2002_combine_image.run). We can also manually select steps or step to execute (by select FROM and TO):

$sh MCP_RUN02_exec_pipeline_from_to.sh

CDB0 = STEP0000_MAPPING.run
CDB1 = STEP0001_MKDOO.run
...
CDB37 = STEP2001_plot.sh
CDB38 = STEP2002_combine_image.run

from
12
TO
23
SHG0055H

B/S (Big=this library defined qsub resource needed; Small=default qsub)
S 

#! /bin/bash
#PBS
#PBS -N  SHG0055H.12.23.dm3
cd $PBS_O_WORKDIR
FROM=12
TO=23

OUTSIDE=`pwd`
source ${OUTSIDE}/MCP.conf
source ${OUTSIDE}/MCP.module
MCPPC.PART.12-23.qsub
qsub  MCPPC.PART.12-23.qsub

2.4 Submit pipeline qsub

Using the command generated in 2.3.3, we can submit the job to the HPC:

$ qsub  MCPPC.PART.12-23.qsub

$ qstat -u stian

Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
9093578.helix-master    stian       batch    SHG0055H.12.23.dm   7638     1      1       --   01:00:00 R  00:00:02

$ qsub MCPPC.PART.0-38.qsub

Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
9093579.helix-master    stian       batch    SHG0055H.0.38.dm3    --      1     16     150gb  25:00:00 Q       --

Commonly, for a 10 M reads library, qsub executive time is ~ 5 hours.

  • /path-to-data-processing/SHG0055H/SHG0055H.0-38.20190114-122447.log: recorded all steps log.
  • /path-to-data-processing/SHG0055H/SHG0055H.0.38.dm3.o9090159: reports qsub executive information.

2.5 ChIA-dropbox pipeline output

D12_REPORT/

ChIA-dropbox generates statistices informatin in /path-to-data-processing/SHG0055H/SHG0055H/D12_REPORT/.

$ cat /path-to-data-processing/SHG0055H/SHG0055H/D12_REPORT/SHG0017_sta.csv

SHG0055H	lib
0.010103762232363806	pcr_duplication
8142864	Total_PET_(Paired_end_reads)
2073187	PET_with_BC_(BarCode)
1704288	Uniq_Mappable_reads (R1)
267914	Reads_mapq>=30
97141	Reads_q>=30_&_len>=50bp
92071	Reads_q>=30_&_len>=50bp_&_3'ext500
92071	Reads_q>=30_&_len>=50bp_&_3'ext500_&_MAJOR_CHROM
304151	BC_count_of_PET-w/-BC_(BarCode)
291621	BC_of_Uniq_Mappable_reads_(R1)
43743	BC_of_Reads_q>=30_&_len>=50bp_&_3'ext500_&_MAJOR_CHROM
43743	GEM_TOTAL
41562	GEM_singleFrag
2181	GEM_multiFrags
296	GEM_of_intra_chrom
1885	GEM_of_inter_chrom
574	GEM_of_inter_chrom_divby_chrom_has_multiFrags
5751	GEM_of_inter_chrom_divby_chrom_has_singleFrag
48183	GEM_of_TOTAL_SA_[SingleFrag+IntraFrags]
47313	GEM_with_Frags=1
779	GEM_with_Frags=2
72	GEM_with_Frags=3
14	GEM_with_Frags=4
4	GEM_with_Frags=5
...

D05_TABLE/

ChIA-dropbox prepared multiple formats of output files to meet different requirments.

(1) Per line per GEM
  • .PlinePgemSimp --- Per line per GEM with fragments coodinates.
GEM_ID GEM_coordinate GEM_span Fragment_number List_of_fragment_coordinates(read count/fragment)
SHG0055H-10000046-AAACACCAGTAACGATBX1-FA-1-0 chrX:16274973-16286464 11491 2 chrX:16274973-16275587(0);chrX:16285880-16286464(1)
SHG0055H-10000116-AAACACCGTACAGAGCBX1-FA-1-0 chrX:10798609-18731966 7933357 2 chrX:10798609-10799202(0);chrX:18731338-18731966(1)
(2) Per line per fragment
  • .region(.rds) --- in ChIA-view to see linear alignment and clustering view for multiple fragments. In this file, per line represents per fragment; fragments are of a same complex owning same GEM-ID.
Chrom Start End Frag_num/GEM GEM_ID
chrX 16274973 16275587 2 SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0
  • .region.PEanno(.rds) --- in ChIA-view to see linear alignment and clustering view for multiple fragments, with annoted fragments by promoter or num-promoter.
Chrom Start End Frag_num/GEM GEM_ID PEanno
chrX 16274973 16275587 2 SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0 P
chrX 16285880 16286464 2 SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0 E
  • .gff --- view multiple fragments linearly in IGV as a pysudo gene track (large data may cause IGV slow).
(3) Per line per fragment-pair
  • .hic --- 2D heatmap file for Juicebox

  • .bedpe_PlinePpair --- Convert all fragments of a same GEMcode to pair-end interaction (PET).

Chrom1 Start1 End1 Chrom2 Start2 End2 Frag_num/GEM GEM_ID
chrX 16274973 16275587 chrX 16285880 16286464 2 SHG0055H-1000-10000046-AAACACCAGTAACGATBX1-M02838-FA-1-0
chrX 10798609 10799202 chrX 18731338 18731966 2 SHG0055H-1000-10000116-AAACACCGTACAGAGCBX1-M02838-FA-1-0
(4) Coverage
  • .rcov.bedgraph --- Coverage file generated from quilified reads
  • .fcov.bedgraph --- Coverage file generated from fragments
  • .acov.bedgraph --- Coverage file generated from PET anchors (without remove duplicated anchors).

D20_PLOT/

Here export ChIA-dropbox QC plots:

/path-to-data-processing/SHG0055H/SHG0055H/D20_PLOT/SHG0055H.QCPLOT.AIN1.png

(5) ChIA-DropBox Demo Video

Watch the video