ATACseq, ChIPseq, RNAseq pipeline wrappers and more

This repository has the following combined shell/awk/python/R scripts which can be used for High-throughput sequecning data analysis.

ATACseq.sh: bulk ATACseq pipeline, from fastq to open chromatin regions.
ChIPseq.sh: ChIPseq pipeline, from fastq to peak calling step.
RNAseq.sh: bulk RNAseq pipeline, from fastq to differential expressed genes.
adapt_trim.sh: adapter trimming function, seperated from the above pipelines.
cisVar.sh: pipeline wrapper of cisVar.
GATK_HF.sh: variants calling by GATK, from fastq to vcf.
trans_assemble.sh: de novo transcript assembly, from fastq to GTF.
PLAR.sh: de novo PLAR lncRNA discovery pipeline wrapper.
rRNA_dep.sh: ribosomal RNA depletion from fastq files.

Requirements: Python3, cutadapt, macs2(>=2.1.1), R, DESeq2, featureCounts, bowtie2, bwa,STAR, fastqc, samtools, bedtools, deeptools

ATACseq.sh

This script will QC fastq files and align reads to the reference genome build with Bowtie2, depending on the species selection passed by -g or the index and other required files passed by -i, -b and -c, convert to filtered BAM/BED and bigwig format, then call peaks with MACS2 in BEDPE mode after Tn5 shifting.

Input

Paired-end fastq files with _R1/2 suffix, i.e. test_R1.fastq.gz, test_R2.fastq.gz

Single-end sequencing data is also supported with -s, although it is not recommended.

Options

help message can be shown by ATACseq.sh -h

Usage: ATAC.sh <options> -g <hg38|hg19|mm10> <reads1>|..<reads2> 
    Options:
      -g [str] Genome build selection <hg38|hg19|mm10>
      -i [str] Custom bowtie2 index PATH
      -b [str] Custom blacklist PATH
      -c [str] Genome size abbr supported by MACS2
      -p [str] Prefix of output
      -t [int] Threads (1 default)
      -s Single-end mod (DO NOT recommend, Paired-end default)
      -h Print this help messagee

Example

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/ATACseq.sh
chmod 755 ATACseq.sh
./ATACseq.sh -g hg19 -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Alternatively, you may use your custom Bowtie2 genome index, open chromatin blacklist:

./ATACseq.sh -i /path/to/Bowtie2/index -b /path/to/blacklist -c dm -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Output

All results will be store in current (./) directory.

test_trimmed_R1/2.fastq.gz: adapter trimmed fastq files.
test_mkdup.bam: all alignments, with duplicates marked.
test_filtered.bam: useful filtered alignments; duplicates, unpaired, unmapped, low-quality, secondary, chrM reads removed.
test_se.bed: useful filtered alignments in BED format.
test_pe.bed: useful filtered alignments in BEDPE format, the 2nd and 3rd columns indicate the fragment start and end coordinates on genome.
test.bw: bigwig file converted from test_se.bed, can be upload to genome browser for visualization.
test_shift.bed: Tn5 shifted BEDPE format, it will be used for macs2 peak calling.
macs2: output of macs2, see here. Only broad peaks will be called by default. In addition, test_broad_filtered.bed is the peaks file with blacklist filtered.
fastqc: the report(s) of fastqc
logs: running logs

ChIPseq.sh

This script will QC fastq files and align reads to reference genome with BWA, depending on the species selection passed by -g or the index passed by -i, convert to filtered BAM/BED and bigwig format but DOES NOT call peaks.

Input

Paired-end fastq files with _R1/2 suffix, i.e. test_R1.fastq.gz, test_R2.fastq.gz Or single-end fastq file with -s.

Options

help message can be shown by ChIPseq.sh -h

Usage: ChIPseq.sh <options> -g <hg38|hg19|mm10> <reads1>|..<reads2> 
    Options:
      -g [str] Genome build selection <hg38|hg19|mm10>
      -i [str] Custom BWA index PATH
      -p [str] Prefix of output
      -t [int] Threads (1 default)
      -s Single-end mod (Paired-end default)
      -a Use BWA aln algorithm (BWA mem default)
      -h Print this help message

Example

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/ChIPseq.sh
chmod 755 ChIPseq.sh
./ChIPseq.sh -g hg19 -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Alternatively, you may use your custom BWA genome index :

./ChIPseq.sh -i /path/to/BWA/index -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Output

All results will be store in current (./) directory.

test_trimmed[_R1/2].fastq.gz: adapter trimmed fastq files.
test_mkdup.bam: all alignments, with duplicates marked.
test_filtered.bam: useful filtered alignments; duplicates, unpaired, unmapped, low-quality, secondary, chrM reads removed.
test_se.bed: useful filtered alignments in BED format.
test_pe.bed: useful filtered alignments in BEDPE format, the 2nd and 3rd columns indicate the fragment start and end coordinates on genome.
test.bw: bigwig file converted from test_se.bed, can be upload to genome browser for visualization.
fastqc: the report(s) of fastqc
logs: running logs

Peak calling

NOTE: this pipeline does NOT call peaks, you might want to run it manually. Input is highly recommended for peak calling, put input fastq files through this pipeline with same parameter(s).

test_pe.bed (and input_pe.bed) can be used for macs2 peak calling in BEDPE mode:

macs2 callpeaks -t test_pe.bed -c input_pe.bed -f BEDPE -g hs -n test

--broad is recommended for histone modifications.

test_se.bed and test_filtered.bam can also be used in BED or BAM mode of macs2.

See more about MACS2 (for TFs peak calling) and SICER or SICERpy (for Histone Mods peak calling).

RNAseq.sh

This script will QC fastq files and align reads to the reference genome and transcriptome with STAR, depending on the species selection passed by -s or the index and GTF passed by -i and -g, featureCounts and DESeq2 will be used for reads counting and differential expressed genes discovery,

Input

Paired-end fastq files with _R1/2.fastq.gz suffix all together in a directory, i.e.

Single-end not supported

ls -1 ./
cond1_rep1_R1.fastq.gz 
cond1_rep1_R2.fastq.gz 
cond1_rep2_R1.fastq.gz 
cond1_rep2_R2.fastq.gz
cond2_rep1_R1.fastq.gz 
cond2_rep1_R2.fastq.gz 
cond2_rep2_R1.fastq.gz 
cond2_rep2_R2.fastq.gz
....

And a text file with meta information. i.e.

sample  condition
cond1_rep1  group1
cond1_rep2  group1
cond2_rep1  group2
cond2_rep2  group2
....

You can use the script to scan fastq files and generate the text file:

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/RNAseq.sh
chmod 755 RNAseq.sh 
./RNAseq.sh -p /path/to/directory/contains/fastq/

Then the meta.txt will be created and opened with VIM. Sample column (1st) should have been filled, edit the text by adding the group information on the 2nd column.

NOTE: Provide the PATH of the DIRECTORY which contains fastq to the scripts, DO NOT give the path of fastq files directly!

Options

help message can be shown by RNAseq.sh -h

Usage: RNAseq.sh <options> -c conditions.txt /PATH/to/directoy/contains/fastq/ 
    Options:
      -m [str] /PATH/to/meta.txt
      -s [str] species <hg|mm> hg=hg38, mm=mm10
      -i [str] Custom STAR index PATH
      -g [str] Custom Reference GTF transcripts PATH
      -t [int] Threads (1 default)
      -p prepare meta.txt
      -n Nextera adapters (Truseq default)
      -h Print this help message

Example

./RNAseq.sh -s hg -m meta.txt -t 24 /path/to/directory/contains/fastq/

Alternatively, you may use your custom genome and transcriptome reference:

./RNAseq.sh -i /path/to/STAR/index -g /path/to/GTF -m meta.txt -t 24 /path/to/directory/contains/fastq/

Output

All results will be store in current (./) directory.

TRIMMED/{prefix}_R1/2_trimmed.gz: adapter trimmed fastq files.
BAM/{prefix}.bam: STAR output, accepted alignments.
BAM/SJ.out.tab: STAR output, splice junctions.
featureCounts.txt: featureCounts output, raw fragments count.
all_genes_exp.txt: size-factor normalized gene expression levels with P values.
logs: running logs and fastqc reports.

NOTE: Sample names in meta.txt have to match the featureCounts output exactly, check your text or use this script to create it. This script cannot automatically run the DEG discovery pair-wisely if you have >2 groups. Either edit deseq.r or analyze it manually in R. A online tool might be useful: iDEP.

adapt_trim.sh

This script is separated from ChIPseq.sh, it trims adapter sequences from fastq files with cutadapt@python3.

Input

Paired-end fastq files or single-end with -s, i.e. test_R1.fastq.gz test_R2.fastq.gz

Options

help message can be shown by adapt_trim.sh -h

Usage: adapt_trim.sh <options> <reads1>|..<reads2
    Options:
    	-p Prefix of output
      -t Threads (1 default)
      -s Single-end mod (Paired-end default)
      -n Nextera adapters (Truseq default)
      -h Print this help message

Example

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/adapt_trim.sh
chmod 755 adapt_trim.sh 
./adapt_trim.sh -p test -t 24 test_R1.fastq.gz test_R2.fastq.gz

NOTE: multi-threads support only works with python3>=3.4, multiprocessing>=0.70, cutadapt>=1.15 and pigz

Output

All results will be store in current (./) directory.

test_R1/2_trimmed.gz: adapter trimmed fastq files.

trans_assemble.sh

This script QC fastq files and aligns reads to hg19/GRCh37(depends on the index and GTF provided) using HISAT2. De novo transcripts assembly will be performed by stingtie.

Input

Paired-end fastq files.

Usage

./trans_assemble.sh <reads1> <reads2> <prefix of output> <starnd: fr|rf|un>

NOTE: Edit the script and mofiy $threads, $index, $gtf.

Output

All results will be store in current (./) directory.

{prefix}.bam: sorted accepted alignments.
{prefix}.gtf: de novo transcripts assembled with reference GTF guiding.
logs: running logs.

cisVar.sh

This script is a wrapper of cisVar.

Input

Output of hornet, sorted and indexed bam file.

Usage

./cisVar.sh hornet.bam <read_depth> <indivdual files>

NOTE: Edit the script and mofiy $vcf PATH to your SNP vcf files.

See more about cisVar.

Output

All results will be store in current (./) directory.

{prefix}.{read_depth}.final.txt: mian regression output.
{prefix}.${DEP}.total.txt.prepost.png: desity plot of the regression output.

PLAR.sh

This script is a re-written wrapper of PLAR.

Requirements stringtie, cuffdiff, plar, CPC2 and HMMER

Input

BAM files of aligned RNAseq data, and coresponding GTFs.

Usage

./PLAR.sh hornet.bam <output dir> <prefix_sample1,prefix_sample2...> <strand:rf|fr|un> <sample1_rep1.gtf> <sample1_rep2.gtf> <sample2_rep1.gtf> ... <sample1_rep1.bam,sample1_rep2.bam,sample2_rep1..>

NOTE: Edit the script and modify $plar_path, $cpc2_path. Additional annotation files required in $plar_path, see PLAR for more details.

Output

All results will be store in current (./) directory.

final_lncRNA.bed: filtered and sorted lncRNAs in BED12 format.

rRNA_dep.sh

This script removes ribosomal RNA reads from fastq files by mapping them to rRNA genes and retrieving unmmaped reads.

Input

Paired-end fastq files.

Usage

./rRNA_dep.sh <reads1> <reads2> <prefix of output>

NOTE: Edit the script and mofiy $threads, $genome, $gtf.

Output

All results will be store in current (./) directory.

{prefix}_dep_R1/2_fq.gz: rRNA removed fastq files.
{prefix}_rRNA.log: mapping log.

GATK_HF.sh

This script is a wrapper of variants calling by GATK.

Input

Paired-end fastq files.

Usage

./GATK_HF.sh fastq1 fastq2

NOTE: Edit the script and mofiy $hg19, $picard, $gatk, $gatk_bundle_hg19, $gatk_ref_hg19 PATH.

See more about GATK Best Practices.

Output

All results will be store in current (./) directory.

GVCF and VCF

Author @zhaoshuoxp
April 26 2019

athulmenon/Pipelines-Wrappers

ATACseq, ChIPseq, RNAseq pipeline wrappers and more

ATACseq.sh

Input

Options

Example

Output

ChIPseq.sh

Input

Options

Example

Output

Peak calling

RNAseq.sh

Input

Options

Example

Output

adapt_trim.sh

Input

Options

Example

Output

trans_assemble.sh

Input

Usage

Output

cisVar.sh

Input

Usage

Output

PLAR.sh

Input

Usage

Output

rRNA_dep.sh

Input

Usage

Output

GATK_HF.sh

Input

Usage

Output