/PARpipe

Complete analysis pipeline for PAR-CLIP data

Primary LanguagePerl

PARpipe is a complete analysis pipeline for PAR-CLIP data providing the following features:
    Pre-processing and alignment of reads
    Definition of interaction sites using PARalyzer v1.5 (download)
    Additional site-level metrics
    Annotation of reads, groups, and cluster
    Meta-analysis of binding sites relative to important transcript features

PARpipe is housed at github and the Ohler lab website:
https://github.com/ohlerlab/PARpipe
https://ohlerlab.mdc-berlin.de/software/PARpipe_119/

We have provided support for both human and mouse datasets and it can be modified for other organisms.

The github repository includes PARalyzerv1.5.
We provide an  script "setup.sh" in the github repository that will automatically download all necessary accesory files for human hg19 and/or mouse mm10 genomes and put these files in their proper location. These include bowtie genome indices, .2bit genomes, and annotation (gencode and repeat)...
Alternatively, the user can prepare their own files (See "GTF Requirements" below).

PARpipe has been tested on linux systems. There may be difficulties on mac os/x.

It has the following software dependencies (version known to be compatible), for which users must either add the executable to their path or specify the path at the top of the parpipe.sh script.
Bpipe (0.9.8.5)
Bowtie (0.12.7)
SAMtools (0.1.17)
BEDtools (v2.20.1)
cutadapt (1.3) 

-Users must either add the executable to their path or specify the path at the top of the parpipe.sh script.
-User may need to modify the adapter sequence in parclip_pipe.sh appropriately.
-We have a wrapper file to specify this on the fly - let me know if it works for you (it is called wrap.pl).
-Necessary R and perl libraries that come packaged with PARpipe. One may need to modify the PERL5LIB path appropriately, if the needed libraries are not already installed locally.


QUICK START (in this case it will be installed in the home directory)

cd

git clone https://github.com/ohlerlab/PARpipe.git

# All the scripts and libraries are in the PARpipe/scripts/ directory
cd PARpipe

# If you want to download all necessary files for analysis of human -hg19 and/or mouse -mm10 data (this is all we support at the moment)
# bash setup.sh -s <h|m|b>
# h = human, m = mouse, b = human and mouse
bash setup.sh -s h

# To test PARpipe, go to the PARpipe/test directory
cd test
bpipe run -r ../parclip_pipe.sh test.fastq

#to remove unnecessary intermediates:
bpipe preserve *.csv *.bed *.distribution *.bam *.bam.bai *.clusters.txt *ini *pdf
bpipe cleanup -y


Input file:
.fastq: this should be a demultiplexed PAR-CLIP library

Output Files:
.bam: compressed sam file with genomic alignments
.bam.bai: sorted bam indices
.clusters.bed: bed file of cluster locations
.clusters.csv: statistics for each found cluster
.clusters.txt: summary information for reads, groups clusters, and initial processing
.distribution: for each cluster, gives nucleotide-resolution information on T-to-C conversion signal, background, and percent, and read count
.fastq.processing: cutadapt output of filtering information
.gene_cl/gr.csv: gene-level information by cluster/group
.groups.csv: statistics for each found group
.ini: PARalyzer utilized parameters
Spatial.pdf files: pdf outputs representing binding behavior by annotation category

Output Statistics in groups/clusters.csv files:
Aligned to: annotation category
ID: IDs specific to each read/group/cluster
ModeLocation: location with the most T-to-C conversions
ModeScore: for above location, score of signal to background
ConversionLocationCount: number of different location in the sequence with T-to-C conversions
ConversionEventCount: number T-to-C conversions
NonConversionEventCount: number of Ts that did not convert to Cs
T2Cfraction: number of reads with T-to-C conversions / number of reads
ConversionSpecificity: log(number of reads with T-to-C conversions / number of reads with other conversions)
endG_fraction: fraction of reads that ended in G
RedundantSeqFraction: fraction of distinct reads with more than one copy
RedundantCopyFraction: fraction of all reads with more than one copy
UniqueReads: number of reads that have only one copy
None: number of reads with no conversions
Other_1: number of reads with a non-T-to-C conversion
T2C_1: number of reads with a T-to-C conversion
Link: chromosome and coordinates for easy entrance in visualization programs

Output Statistics in .gene_cl/gr.csv files:
Sum: sum of that statistic over all sites for that gene
Med: median of that statistic for all sites for that gene
5'utr/Intron/Exon/3'utr/Start_codon/Stop_codon: number of sites mapping to that annotation category
Junction: number of sites mapping to a junction between categories (coding-intron, coding-3'utr, etc.)
GeneType: as described in the gene_type category for this gene in the .gtf file used



GTF Requirements -> if the user wants to create their own GTF
Must be tab-delimited
The eighth column must contain, as non-tab-delimited:
gene_type "<gene type>"
transcript_type "<transcript type>"
transcript_id "<transcript ID>"
gene_name "<gene name>"
transcript_status "<transcript status>", which must be KNOWN to be used
third column must contain feature type as gene, transcript, exon, start_codon, stop_codon, UTR
second column must contain annotation source
fourth column must be start (lower number) position
fifth column must be end (higher number) position
seventh column must be strand (+/-)
start and end locations must be inclusive