PARpipe is a complete analysis pipeline for PAR-CLIP data providing the following features: Pre-processing and alignment of reads Definition of interaction sites using PARalyzer v1.5 (download) Additional site-level metrics Annotation of reads, groups, and cluster Meta-analysis of binding sites relative to important transcript features PARpipe is housed at github and the Ohler lab website: https://github.com/ohlerlab/PARpipe https://ohlerlab.mdc-berlin.de/software/PARpipe_119/ We have provided support for both human and mouse datasets and it can be modified for other organisms. The github repository includes PARalyzerv1.5. We provide an script "setup.sh" in the github repository that will automatically download all necessary accesory files for human hg19 and/or mouse mm10 genomes and put these files in their proper location. These include bowtie genome indices, .2bit genomes, and annotation (gencode and repeat)... Alternatively, the user can prepare their own files (See "GTF Requirements" below). PARpipe has been tested on linux systems. There may be difficulties on mac os/x. It has the following software dependencies (version known to be compatible), for which users must either add the executable to their path or specify the path at the top of the parpipe.sh script. Bpipe (0.9.8.5) Bowtie (0.12.7) SAMtools (0.1.17) BEDtools (v2.20.1) cutadapt (1.3) -Users must either add the executable to their path or specify the path at the top of the parpipe.sh script. -User may need to modify the adapter sequence in parclip_pipe.sh appropriately. -We have a wrapper file to specify this on the fly - let me know if it works for you (it is called wrap.pl). -Necessary R and perl libraries that come packaged with PARpipe. One may need to modify the PERL5LIB path appropriately, if the needed libraries are not already installed locally. QUICK START (in this case it will be installed in the home directory) cd git clone https://github.com/ohlerlab/PARpipe.git # All the scripts and libraries are in the PARpipe/scripts/ directory cd PARpipe # If you want to download all necessary files for analysis of human -hg19 and/or mouse -mm10 data (this is all we support at the moment) # bash setup.sh -s <h|m|b> # h = human, m = mouse, b = human and mouse bash setup.sh -s h # To test PARpipe, go to the PARpipe/test directory cd test bpipe run -r ../parclip_pipe.sh test.fastq #to remove unnecessary intermediates: bpipe preserve *.csv *.bed *.distribution *.bam *.bam.bai *.clusters.txt *ini *pdf bpipe cleanup -y Input file: .fastq: this should be a demultiplexed PAR-CLIP library Output Files: .bam: compressed sam file with genomic alignments .bam.bai: sorted bam indices .clusters.bed: bed file of cluster locations .clusters.csv: statistics for each found cluster .clusters.txt: summary information for reads, groups clusters, and initial processing .distribution: for each cluster, gives nucleotide-resolution information on T-to-C conversion signal, background, and percent, and read count .fastq.processing: cutadapt output of filtering information .gene_cl/gr.csv: gene-level information by cluster/group .groups.csv: statistics for each found group .ini: PARalyzer utilized parameters Spatial.pdf files: pdf outputs representing binding behavior by annotation category Output Statistics in groups/clusters.csv files: Aligned to: annotation category ID: IDs specific to each read/group/cluster ModeLocation: location with the most T-to-C conversions ModeScore: for above location, score of signal to background ConversionLocationCount: number of different location in the sequence with T-to-C conversions ConversionEventCount: number T-to-C conversions NonConversionEventCount: number of Ts that did not convert to Cs T2Cfraction: number of reads with T-to-C conversions / number of reads ConversionSpecificity: log(number of reads with T-to-C conversions / number of reads with other conversions) endG_fraction: fraction of reads that ended in G RedundantSeqFraction: fraction of distinct reads with more than one copy RedundantCopyFraction: fraction of all reads with more than one copy UniqueReads: number of reads that have only one copy None: number of reads with no conversions Other_1: number of reads with a non-T-to-C conversion T2C_1: number of reads with a T-to-C conversion Link: chromosome and coordinates for easy entrance in visualization programs Output Statistics in .gene_cl/gr.csv files: Sum: sum of that statistic over all sites for that gene Med: median of that statistic for all sites for that gene 5'utr/Intron/Exon/3'utr/Start_codon/Stop_codon: number of sites mapping to that annotation category Junction: number of sites mapping to a junction between categories (coding-intron, coding-3'utr, etc.) GeneType: as described in the gene_type category for this gene in the .gtf file used GTF Requirements -> if the user wants to create their own GTF Must be tab-delimited The eighth column must contain, as non-tab-delimited: gene_type "<gene type>" transcript_type "<transcript type>" transcript_id "<transcript ID>" gene_name "<gene name>" transcript_status "<transcript status>", which must be KNOWN to be used third column must contain feature type as gene, transcript, exon, start_codon, stop_codon, UTR second column must contain annotation source fourth column must be start (lower number) position fifth column must be end (higher number) position seventh column must be strand (+/-) start and end locations must be inclusive