RIBOSOME PROFILING ANALYSIS FRAMEWORK; de Klerk E., Fokkema I.F.A.C et al (2015). FOOTPRINT ALIGNMENT, TRIPLET PERIODICITY ANALYSIS, PEAK CALLING. ================================================================================ This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to: Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. The latest version of the scripts and this README file are available at: http://lumc.github.io/ribosome-profiling-analysis-framework/ ================================================================================ WORKFLOW ================================================================================ This framework has been developed on Linux and tested on various Linux systems. It may run on Windows, but this is not supported. Several provided examples for running the scripts on a large number of files are written in Bash and will therefore not run on Windows without additional software, like Cygwin. This framework requires the installation of PHP-CLI (Command Line Interface), available from http://php.net/. This framework has been developed and tested using PHP 5.4 and up, but may work with earlier versions. OVERVIEW OF THE FRAMEWORK -------------------------------------------------------------------------------- - Alignment of ribosome footprint reads - Analysis of read length distribution of ribosome footprints - Creating wiggle files with genomic coordinates from transcriptome alignment - Merging wiggle files from transcriptome and genome alignments - Retrieving coordinates relative to the annotated open reading frames (ORFs) using Mutalyzer - Analysis of triplet periodicity - Merging biological replicates prior to peak calling (identification of translation initiation sites (TISs)) - Peak calling: identification of primary ORFs, alternative ORFs and upstream ORFs - Merging result files - Analysis of switches in TIS usage - TIS location and motif analysis SAMPLE NAMING -------------------------------------------------------------------------------- Throughout the documentation of this framework, we provide example commands. The examples provided here are from the study of de Klerk E., Fokkema I.F.A.C. et al. (2015), in which two different treatments were performed (harringtonin and cycloheximide) in two different timepoints, with three biological replicates. Different treatments or timepoints are named A to Z, with replicates named in numbers (1 to 9). We recommend using similar, or equal, sample naming. This framework will read the sample names from the beginning of the file names, if separated with an underscore or period. Examples of possible FASTQ file names: A1_pool7213_TATAT_L002_R1_001.fastq (sample name: A1) D3.fastq (sample name: D3) I2_ACACAC_L008_R1_001.fastq (sample name: I2) REFERENCE SEQUENCES -------------------------------------------------------------------------------- This framework is based on the Mus Musculus mm10 genome assembly but it can also be adapted for the Homo Sapiens hg19 and hg38 genome assemblies. Other assemblies are currently not supported due to their absence in the Mutalyzer tool, which is used to translate genomic coordinates into transcriptomic coordinates and vice versa. See https://v2.mutalyzer.nl for more information. The transcriptome reference sequence was downloaded from ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.rna.fna.gz Last modified date: 2013-05-08. If mouse.rna.fna.gz is not available, you can download the separate files (named mouse.1.rna.fna.gz, mouse.2.rna.fna.gz and so on) and combine them using bash with: for file in mouse.?.rna.fna.gz; do cat $file >> mouse.rna.fna.gz; done The genome reference sequence was downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/ Last modified date: 2012-02-09. ALIGNMENT OF RIBOSOME FOOTPRINT READS -------------------------------------------------------------------------------- To prevent loss of reads or misalignment of short reads (30 nt) adjacent or spanning an exon-exon junction, reads are first mapped to the transcriptome reference, and unaligned reads are subsequently mapped to the genome reference. For more information about the alignment method, please see de Klerk E., Fokkema I.F.A.C. et al (2015). 1) The SAM file obtained from the transcriptome alignment reports positions of mapped reads relative to the start of the transcript. To be able to merge the results of the genome and the transcriptome alignment these positions need to be converted to genomic positions. Transcriptome sequences do not always properly align to the genome, for instance due to the presence of insertions, deletions, chimeric alignments, or they align to random chromosomes. Because the coordinates of the reads mapped to the transcriptome reference cannot be correctly converted to genomic coordinates for transcripts that do not properly align, first the transcriptome reference needs to be filtered for these transcripts. For the remaining transcripts, the mm10_transcript_positions_create.php script is used to create a file that indicates the location of the exons of the transcripts on the genome. For more information on this script and the alignment, please see the help file: help/mm10_transcript_positions_create.txt. Usage: ./mm10_transcript_positions_create.php TRANSCRIPTOME_ALIGNMENT_SAM_FILE Produces: - mm10_transcript_positions.txt This file contains the genomic coordinates of all exons for transcripts that align to the genome, excluding random chromosomes, without insertions or deletions or chimeric alignment. These can therefore be used for the transcriptome alignment. - transcriptome_alignment_unsupported_transcripts.txt Contains all unsupported transcripts and the reasons for rejection. 2) The unsupported transcripts are then removed from the transcriptome reference, using the following bash code: cut -f 1 transcriptome_alignment_unsupported_transcripts.txt | grep -v '^#' \ > transcriptome_alignment_unsupported_transcripts_list.txt tr '\n' '+' < mouse.rna.fna | sed 's/+>/\n>/g' \ | grep -vFf transcriptome_alignment_unsupported_transcripts_list.txt \ | tr '+' '\n' > mouse.rna.fna.no_unsupported.fa The resulting file (mouse.rna.fna.no_unsupported.fa) is then used to create a Bowtie index file suitable for alignment using bowtie. 3) Before alignment, the adapter sequences need to be removed from the reads. There are various tools available which do this, such as cutadapt. Make sure that after this step, your files are named *.trunc. 4) The following set of commands runs Bowtie to align the reads to the transcriptome reference, store the unaligned reads, and finally align the unaligned reads to the genome reference. For the alignments on the transcriptome, only the forward mapped reads are selected, and then the SAM files are filtered, requiring a minimum of 25 nucleotides per read. SAM files obtained from the alignment on the genome are also filtered, requiring a minimum of 25 nucleotides per read. SAM files from the genome alignment are converted to BAM files, which are then sorted and converted into mpileup files. The mpileup files are then converted to 5' end wiggle files, one for each strand. For the creation of 5' end wiggle files from the genome alignment, use the SageWiggle.py script included in this framework. In the commands below, replace "mouse.rna.fa.no_unsupported.fa.bowtieindex" with the correct location of the bowtie index file of your transcriptome reference sequence file, filtered for unsupported transcripts (see step 2). Replace "mm10.reference.fa.bowtieindex" with the correct location of the bowtie index file of your genome reference sequence file. for file in *.trunc; do bowtie -k 1 -m 20 -n 1 --best --strata -p 8 -S --chunkmbs 2000 --norc --un "${file}.transcriptome_unaligned" mouse.rna.fa.no_unsupported.fa.bowtieindex "$file" "${file}.transcriptome_aligned.sam" 2>> results.transcriptome_alignment.txt; done; for file in *.transcriptome_unaligned; do bowtie -k 1 -m 2 -n 1 -p 8 -S --best --strata --chunkmbs 2000 mm10.reference.fa.bowtieindex "$file" "${file}.genome_aligned.sam" 2>> results.transcriptome_unaligned.genome_alignment_mm10.txt; done; for file in *.transcriptome_unaligned.genome_aligned.sam; do echo $file; head -n 100 "$file" | grep "^@" > "${file}.M25.sam"; awk '($6 > 25M)' "$file" >> "${file}.M25.sam"; done; for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam; do echo $file; samtools view -bS "$file" > "${file}.bam"; done; for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam; do echo $file; samtools sort "$file" "${file}.sorted"; done; for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools index "$file"; done; for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools mpileup -d 10000000 "$file" > "${file}.mpileup"; done; for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam.mpileup; do echo $file; python sageWiggle.py -i "$file" -o "${file}.F.wig5" "${file}.R.wig5"; done; for file in *.transcriptome_aligned.sam; do echo $file; awk '($2 == 0)' "$file" > "${file}.mappedforward.sam"; done; for file in *.transcriptome_aligned.sam.mappedforward.sam; do echo $file; awk '($6 > 25M)' "$file" > "${file}.M25.sam"; done; for file in *.transcriptome_aligned.sam.mappedforward.sam.M25.sam; do echo $file; cut -f 1,3,4 "$file" > "${file}.3col.sam"; done 5) Multiple scripts have transcript names but need either strand or gene information, or vice versa. To provide this, download a file from the UCSC table browser, listing the transcripts, the strand and the genes. To build this list: - Open the UCSC table browser: http://genome.ucsc.edu/cgi-bin/hgTables - Select genome, assembly, track: "RefSeq Genes". - Table: "refGene", output format: "selected fields from primary and related tables". - Click "get output". - On this next page, select "name", "strand" and "name2" in the field list, and click "get output". - Save in a file and sort. On linux, this can be done with the sort command. - In this README, this file is referred to as mm10_gene_list.txt. ANALYSIS OF READ LENGTH DISTRIBUTION OF RIBOSOME FOOTPRINTS -------------------------------------------------------------------------------- 6) The SAM files generated by the alignment of the ribosome footprint reads can be used to retreive information on the read length distribution. A simple bash script that calculates the read length distribution for reads aligning to the transcriptome (NM and NR separately) and reads aligning to the genome, is described in help/read_length_distribution.txt. The read length distribution can also be calculated per gene. For that, use the get_read_length_per_gene.php script. The script takes gene IDs in an input file and determines the read length distribution for its transcripts per sample, from both transcriptome and genome alignment SAM files. It requires the mm10_gene_list.txt file to see which transcripts belongs to which gene, the mm10_transcript_positions.txt file to check the strand and the positions of the transcripts, and the input SAM files. Usage: ./get_read_length_per_gene.php FILE_WITH_GENES_TO_ANALYZE mm10_gene_list \ mm10_transcript_positions.txt SAM_FILE [SAM_FILE [SAM_FILE [...]]] Example: ./get_read_length_per_gene.php genes_to_analyze.txt mm10_gene_list.txt \ mm10_transcript_positions.txt *.M25.sam Produces: - For each transcript of the given genes, one gene_transcript_read_length_distribution.txt file, with the read lengths in the first column, and the number of reads with this length per sample in the next columns. - One summary file, ALL_ALL_read_length_distribution.txt, with all read lengths summed up. This README will continue describing the steps to follow for triplet periodicity analysis and ORF analysis (peak calling), regardless of whether the read length distribution has been analyzed or not. GROUPING READS FROM SAM FILES OBTAINED FROM TRANSCRIPTOME ALIGNMENT AND BASIC STATISTICS -------------------------------------------------------------------------------- 7) Mapped reads in the SAM files are first grouped together to generate a file that sums the coverage per transcript per position. For this, use the pack_sam_files.php file. Usage: ./pack_sam_files.php SAM_FILE [SAM_FILE [SAM_FILE [...]]] Example: ./pack_sam_files.php *.transcriptome_aligned.*3col.sam Produces: - For each given SAM file, a SAM.packed file. Contains the transcript ID, the position relative to the start of the tran- script, and the total coverage for that position, i.e. the number of reads aligned starting at this position. 8) The packed SAM files can be used to run some statistics, using packed_sam2coverage_per_gene.php. Usage: ./packed_sam2coverage_per_gene.php mm10_gene_list.txt \ PACKED_SAM_FILE [PACKED_SAM_FILE [...]] Example: ./packed_sam2coverage_per_gene.php mm10_gene_list.txt *.packed Produces: - One file which name depends on the given SAM file(s), ending in the suffix .coverage_per_gene.txt, containing (per SAM file) which transcripts can not be matched to a gene, listing the top 5 unknown transcripts based on their coverage, and a table containing coverage per gene per sample. The genes are sorted on total coverage in all samples together. Other statistics can be run directly on the packed SAM files, for instance how many reads are mapped to non-coding reads. CREATING WIGGLE FILES WITH GENOMIC COORDINATES FROM TRANSCRIPTOME ALIGNMENT -------------------------------------------------------------------------------- 9) The packed SAM files are used to generate wiggle files. For this, use the packed_sam2wiggle.php script, which needs the transcript positions file created in 1). Usage: ./packed_sam2wiggle.php mm10_transcript_positions.txt \ PACKED_SAM_FILE [PACKED_SAM_FILE [PACKED_SAM_FILE [...]]] Example: ./packed_sam2wiggle.php mm10_transcript_positions.txt *.packed Produces: - 4 wiggle files per SAM file; + F unfiltered, + F filtered (NR and XR removed), + R unfiltered, + R filtered (NR and XR removed). Unfiltered 5' wiggle files are used for visualization purposes only, whereas filtered 5' wiggle files, containing positions of reads mapping only to coding transcripts, are used for triplet periodicity analysis and peak calling. Positions mapping to non-coding transcripts won't be processed with this analysis framework. MERGING WIGGLE FILES FROM TRANSCRIPTOME AND GENOME ALIGNMENTS -------------------------------------------------------------------------------- 10) The merging of the wiggle files from the alignment on the transcriptome and the alignment on the genome is done using the "wiggelen" package written by Martijn Vermaat and Jeroen Laros. The merge_wigglefiles.sh script is a wrapper around the wiggelen package. For detailed information, see the help file help/merge_wigglefiles.txt. Usage: ./merge_wigglefiles.sh (it should be run in the directory where all the wiggle files reside) Produces: - 4 wiggle files per sample, using the transcriptome and genome alignment wiggle files for each created file: + F unfiltered, + F filtered (NR and XR removed), + R unfiltered, + R filtered (NR and XR removed). Unfiltered 5' wiggle files are used for visualization purposes only, whereas filtered 5' wiggle files, containing positions of reads mapping only to coding transcripts, are used for triplet periodicity analysis and peak calling. Positions mapping to non-coding transcripts won't be processed with this analysis framework. RETRIEVING COORDINATES RELATIVE TO THE ANNOTATED OPEN READING FRAMES (ORFs) USING MUTALYZER -------------------------------------------------------------------------------- 11) To translate the genomic coordinates to coordinates relative to the annotated Translation Initiation Site (TIS), we use the Mutalyzer service. Mutalyzer's position converter has a batch-feature that allows for the upload of large files containing genomic postions. Through email you will be notified when your batch job is done, after which you can download the result file. Wiggle files are converted to Mutalyzer position converter batch files using the wig2batchfile.php script. Please note that since Mutalyzer is a service aimed at genetic variation, all positions are converted into genetic variants. Batch files are created for the filtered 5' wiggle files only. Usage: ./wig2batchfile.php WIGGLE_FILE Example: for file in *.filtered.wig5; do ./wig2batchfile.php "$file"; done Produces: - One Mutalyzer batch input file. 12) The position converter batch files are manually loaded into Mutalyzer: https://v2.mutalyzer.nl/batchPositionConverter Do not run more than a few files at a time. It will considerably slow down the process, and will increase the chance of failures, after which you might need to re-upload your file. Wait until Mutalyzer reports that the file has been uploaded before uploading another file. The test installation is slightly faster but there is no guarantee that it is always online. 13) When done, download the result files from Mutalyzer. If you're not sure anymore which results file belongs to which batchfile, you can match these files by using the match_mutalyzer_output_to_batchfile.sh script. Usage: ./match_mutalyzer_output_to_batchfile.sh \ MUTALYZER_RESULTS_FILE [MUTALYZER_RESULTS_FILE [...]] \ BATCH_FILE [BATCH FILE [...]] Example: ./match_mutalyzer_output_to_batchfile.sh batch-job-* *mutalyzer_batchfile.txt This script will rename the Mutalyzer results files according to the batch input files. ANALYSIS OF TRIPLET PERIODICITY -------------------------------------------------------------------------------- 14) The analyze_triplet_periodicity.php script reports the number of reads relative to the annotated start codon. A standard feature of ribosome profiling data is the presence of a major peak at position -12 relative to the annotated start codon (harringtonin-treated samples), and an higher amount of reads whose 5' ends map to the first nucleotide of a codon (cycloheximide-treated samples). In the output file of the analyze_triplet_periodicity.php script, numbers in positions %1, %2 and %3 represent the sum of reads whose 5' end mapped to the first, second and third nucleotide of the annotated codons, respectively. Usage: ./analyze_triplet_periodicity.php MUTALYZER_RESULTS_FILE WIGGLE_FILE \ mm10_gene_list.txt STRAND (Valid values for STRAND: +, -, F and R) Example: ./analyze_triplet_periodicity.php \ A1.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \ A1.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F \ > A1.merged_wiggle.F.filtered.wig5.periodicity.txt Produces: - Direct output to the screen (which should then be saved to a file if needed) with detailed statistics about the mapping of the positions to the transcriptome by Mutalyzer, followed by the list of positions relative to the TIS, and the total coverage for each position. Of the positions in the coding region, only the first 6 are shown individually. All positions in the coding region, including the positions up to 15 bases upstream of the coding region, are grouped by their location in the annotated codons (first, second and third base), and displayed as positions %1, %2 and %3 respectively, with the summed coverages to assess the overall triplet periodicity. Positions starting with an asterisk (*) are positions in the 3' UTR. The given number is the relative distance to the stop codon: *1 is the first base after the stop codon. The script does the following things: + Read out the coverage information from the wiggle file. + Read the gene list to find out which genes (transcripts, actually) are on which strand. + Read the Mutalyzer result file to see which locations map to which transcripts, determining the position within the transcript. * Filter low coverage positions (< 3). * Mappings on non-coding transcripts are ignored. * Mappings on transcripts on the other strand are ignored. * Transcripts not in the gene list are ignored and reported. * Filter positions without transcript mapping. * Filter positions with only intronic mapping. * Filter positions too far from translation start or stop sites (> 500bp). * The remaining possible mappings are processed. Please note that the "coding region" is defined as the region starting at -15 nucleotides from the ATG to the end of the open reading frame. Please note that this script does not see the difference between the UTR and the intergenic sequence. Both are referred to as "UTR mappings". - 5' UTR or 3' UTR mappings, while mappings to the coding region are also available, are discarded. - If there are 5' UTR and 3' UTR mappings, but none in the coding region, we assume an intergenic situation and if one of the two positions is clearly closer to the coding region than the other (> 100bp closer), then the closest position is picked. If not, the position is discarded, and reported to the user. - After this, all mappings are counted by their coverage and reported. MERGING BIOLOGICAL REPLICATES PRIOR TO PEAK CALLING (IDENTIFICATION OF TRANSLATION INITIATION SITES (TISs)) -------------------------------------------------------------------------------- 15) To increase the statistical power and to lower background signal, merge the biological replicates of the harringtonin-treated samples. Both the wiggle files as well as the Mutalyzer result files, need to be merged. Examples, assuming A and C are harringtonin samples: Merging the wiggle files: for sample in A C; do for strand in F R; do wiggelen merge ${sample}?.merged_wiggle.${strand}.filtered.wig5 \ | sed 's/^\([0-9]\+\) \([0-9]\+\)/\1\t\2/' \ > merged_${sample}.merged_wiggle.${strand}.filtered.wig5; rm ${sample}?.merged_wiggle.${strand}.filtered.wig5.idx; done done Merging the Mutalyzer files: for sample in A C; do for strand in F R; do head -n 1 ${sample}1.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \ > merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt grep -h "^chr" ${sample}?.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \ | sort -g | uniq \ >> merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt done done The order of the variants in the Mutalyzer results file is sorted differently, but this has no effect on the following analysis results. PEAK CALLING: IDENTIFICATION OF PRIMARY ORFs, ALTERNATIVE ORFs and UPSTREAM ORFs -------------------------------------------------------------------------------- 16) The find_ORFs.php script is based on a local dynamic peak calling algorithm, which tries to identify peaks (TISs) relative to the coverage of the surrounding region. NOTE: Run this script on the merged replicates of harringtonin samples, only. Usage: ./find_ORFs.php MUTALYZER_RESULTS WIGGLE_FILE mm10_gene_list.txt STRAND (Valid values for STRAND: +, -, F and R) Example: ./find_ORFs.php \ merged_A.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \ merged_A.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F Produces: - *.ORF_analysis_results_stats.txt Statistics on the ORF finding run; information on the loaded gene list file and wiggle file, mentions the number of positions left out on each filtering step, and mentions the number of genes left with found translation initiation sites. - *.ORF_analysis_results.txt The results file shows, per gene, the number of positions found, the number actually analyzed as candidate peaks, and the number of TISs found. These are then reported with the genomic position, the coverage, and per transcript, the position relative to the annotated TIS on the transcript. - *.ORF_analysis_results_after_cutoff.txt This file has the same format of the results file, but shows only TISs found in the annotated coding region, at least 5000 bases from the annotated TIS. (the value of 5000 can be configured in the script) For more information on this cutoff, please see de Klerk E., Fokkema I.F.A.C. et al (2015). The script does the following things: + Read out the coverage information from the wiggle file + Read the gene list to find out which genes (transcripts, actually) are on which strand + Read the Mutalyzer result file to see which locations map to which transcripts, determining the position within the transcript * Filter low coverage positions (< 3) * Mappings on non-coding transcripts are ignored * Mappings on transcripts on the other strand are ignored * Transcripts not in the gene list are ignored and reported (so they could be fixed == added to the gene info file) * Filter positions without transcript mapping * Filter positions with only intronic mapping * Filter positions too far from translation start or stop sites (> 500bp) * The remaining possible mappings are processed. Please note that the "coding region" is defined as the region starting at -15 nucleotides from the ATG to the end of the open reading frame. Please note that this script does not see the difference between the UTR and the intergenic sequence. Both are referred to as "UTR mappings". - 3' UTR mappings, while mappings to the coding region or in the 5' UTR are also available, are discarded. - After this, all positions are stored, per gene, for the next step. + Peaks are searched for by walking through the positions, starting upstream. Each position, with at least a coverage of 20, is analyzed. (Please note, that 20 has been chosen for analysis of merged biological replicates. Otherwise, 10 was used.) * Its coverage should be higher than that of the positions located 3, 6, 9, 12 and 15 nucleotides upstream of it. * Its coverage should be at least as high as that of the positions located 1 and 2 nucleotides downstream of it. * It should show a triplet periodicity and a clear "harringtonin pattern": - The first nucleotide of the analyzed codon should have a coverage of at least 60% of the total coverage of that codon (configurable setting). - The following 5 codons should also not have a higher maximum coverage than this position's coverage. - If any of the following 5 codons have a maximum coverage higher than 10% of the coverage of the position we're analyzing, that codon must not show a conflicting triplet periodicity pattern (see first two points above). * If all these rules apply, the position is stored as a possible TIS. + Per gene, from all its found possible TISs, we will take the one with the highest coverage as a reference, and discard any other candidate TISs that do not have at least a coverage (on that position) of 10% of the reference (highest candidate). + All remaining candidate TISs will be reported. + Genes that have no candidate peaks left, are ignored. + For all genes remaining, the positions of the candidate TISs are reported relative to all known transcripts of the gene in question. The original number of positions with high coverage (>10) is mentioned along with the number of remaining candidate TISs. Results are displayed as: genomic position, coverage, position on transcript. The last column may be repeated, depending on the number of known transcripts. + Additionally, a separate file reports all candidate peaks found at a position higher than 5Kb from a known TIS, so they can easily be checked more thoroughly if they are false positives. MERGING RESULT FILES -------------------------------------------------------------------------------- 17) To be able to statistically compare the possible switches in TIS usage, the results of the ORF analyses need to be combined into one file, per strand (F and R). The merge_ORF_files.php script merges the files into one summary file, but does not take care of the strands, therefore make sure you don't mix forward and reverse files with each other. Usage: ./merge_ORF_files.php ORF_FILE [ORF_FILE [ORF_FILE [...]]] Example: ./merge_ORF_files.php *.F.*.ORF_analysis_results.txt Produces: - One file which name depends on the given input files, ending in the suffix .merged_ORF_analyses.txt. The format is described below. The script analyzes the file names to isolate the sample IDs. It assumes that the sample IDs are the only differences between the input file names. The sample IDs are also used in the headers of the output. While grouping, it ignores the positions of the peaks on the transcripts. It reports the chromosomal position, the coverage per sample, and the gene name. Every peak that has been called in at least one sample, is displayed. NOTE: If the original wiggle files of the individual biological replicates are present in the same directory, these are all read out while merging the ORF analysis result files, and the resulting file will report the individual coverages of all biological replicates, per TIS peak. If the wiggle files are not found, the coverages are taken from the ORF analysis result files, which means when a coverage of 0 is reported, it simply means that the position was not recognized as a TIS in that sample. The actual measured coverage in that sample may not be 0. Example output: # Chromosome Position A1 A2 A3 C1 C2 C3 Gene chr1 4857920 32 14 27 6 3 10 Tcea1 chr1 4857929 24 6 24 2 4 6 Tcea1 ANALYSIS OF SWITCHES IN TIS USAGE -------------------------------------------------------------------------------- The output file of the merge_ORF_files.php script is used to test switches in TIS usage. For this, the lme4.0 R package is used. The input file for the following R code is a tab delimited file. The first line containins a list of all sample names (in the below example, 6), separated by a tab. All following lines contain as first column the gene symbols and then, per sample, the actual individual coverages. The file is imported into R as a dataframe, containing 6 columns and n rows (n=number of called peaks). Example input for R: A1 A2 A3 C1 C2 C3 Tcea1 32 14 27 6 3 10 Tcea1 24 6 24 2 4 6 Below is the R code used for the analysis. library(lme4.0) load("TIS_analysis.Rdata") data_samples<-data_small[,2:7] Myotubes <- grepl("C", names(data_samples)) coeff_summary_fit_col3_col4 <- list() for(i in 1: length(list_data.multinom)) { print(i) mydata <- list_data.multinom [[i]] loc <- mydata[,1] N <- nrow( mydata) M <- 6 weight <- colSums(mydata[,2:7]) if ((!weight[1] && !weight[2] && !weight[3]) || (!weight[4] && !weight[5] && !weight[6])) { print("No coverage available in at least one condition") next } dat <- as.vector(as.matrix(mydata[,2:7])) loc_all <- rep(loc, M) Myotubes_all <- rep(Myotubes, each=N) weight_all <- rep(weight, each=N) subj_all <- factor(rep(colnames(mydata[,2:7]), each=N)) fit<- lmer( dat/weight_all ~ 0 + Myotubes_all*loc_all - Myotubes_all + (loc_all|subj_all), family=binomial(), weight=weight_all) fit0 <- lmer( dat/weight_all ~ 0+loc_all + (loc_all|subj_all), family=binomial(), weight=weight_all) coeff_summary_fit_col3_col4[[i]] <- coef(summary(fit))[(1+nrow(coef(summary(fit)))/2):nrow(coef(summary(fit))),3:4] } sink("coeff_summary_fit_col3_col4.csv") coeff_summary_fit_col3_col4 sink() TIS LOCATION AND MOTIF ANALYSIS -------------------------------------------------------------------------------- 18) The generate_stats_peaks_per_location.php script is used to (i) categorize TISs based on their location relative to different gene regions, (ii) report codon motifs, (iii) report the sequence of potential uORFs (sequence between each TIS located in the 5'UTR and the annotated TIS. The script uses the ORF analysis results files of the merged replicates. It takes all ORF analysis results files, both the results after the cutoff as the ones before, and sums up the number of peaks found per location: 5' UTR, annotated TIS, coding, 3' UTR, or multiple. "Multiple" means that the TIS location was mapping to multiple transcripts, and in at least two of these, the location was different. An exception here is when one of the locations where the TIS maps to, is "annotated_TIS". Then, it is assumed that the other transcripts with other positions are not the translated transcripts. Note that to determine the locations of the reads, the positions that are given by the find_ORFs.php script have to be increased by 12 bases, since these are the 5' ends of the reads. This means that positions -12 through -10 are counted as annotated TIS. Lower numbers are counted as 5'UTR, while higher numbers (> -10) are counted as coding. Positions starting with an asterisk (*) are counted as 3'UTR. This also means that read start positions at the end of the coding region, less than 12 bp away from the 3'UTR, are counted as coding while in fact in reality they represent a read that lies in the 3'UTR. This can not be detected however, because we don't know the length of the coding region of the transcripts. To be able to report the TIS codon motifs and the sequence between each 5'UTR TIS and the annotated TIS, this script downloads the RefSeq sequences of the transcripts that the read was aligned to, automatically from the NCBI website using the URL format: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_007386.2.gb&rettype=gb To prevent repeated downloads on successive runs of the script, it requires an "NM cache" directory where all downloaded sequences are stored and can be re- used by successive runs. The name of the directory is currently configured in the $_SETT settings array within the script. If the directory is not found or not writable, the script will complain and refuse execution. When that happens, check if the directory can be made writable. Otherwise create an empty directory and set its name in the settings in the script source code. Usage: ./generate_stats_peaks_per_location.php ORF_FILE [ORF_FILE [ORF_FILE [...]]] Example: ./generate_stats_peaks_per_location.php *ORF_analysis_results* Produces: - *.ORF_analysis_results.stats_peaks_per_location.txt (one per sample) Reports the total number of TISs found as well as the total coverage of found TISs per gene region (5'UTR, annotated TIS, coding, 3'UTR, multiple). Both are reported since a high number of identified TISs may not correlate to also having high coverage in this region. The statistics are reported separately for all peaks before the cutoff and all peaks including the ones after the cutoff. - *.ORF_analysis_results.peaks_classification.txt (one per sample) Reports all found TISs before the set cutoff, sorted on category, strand and genomic position. Other fields reported are the "real" genomic position of the TIS, calculated by adding (or substracting, for the negative strand) an offset of 12 nucleotides to the 5' read genomic positions, the gene, the summed coverage of the replicates in that position, the transcript RefSeq ID (only one reported), the position on that transcript including the position with the offset of 12 applied, optional status messages, and the sequence of the codon at the reported TIS. For TISs reported in the category "multiple", the information in the last four columns is missing. In case the RefSeq file has no CDS annotated, instead of the motif sequence the error "could_not_parse_CDS" is given. In case the RefSeq file has no 5'UTR annotated (CDS starts at position 1), or in case the RefSeq file has not enough 5'UTR annotated to fetch the full motif sequence (CDS starts at a position smaller than the distance of the TIS to the annotated TIS), the status is set to "no_5UTR" or "unannotated_5UTR", respectively, and a "slice" of the genomic sequence is downloaded. This slice starts from the corrected genomic position, and 75000 bases downstream. From this file, this script attempts to find the correct mRNA and CDS tags, takes the sequence until the annotated CDS, removing any annotated introns before the CDS. When the mRNA tag can not be found, the status also mentions the code "no_mRNA_definition". The motif is still reported, but the output file described below will not contain the sequence until the annotated TIS. - *.ORF_analysis_results.peaks_classification_5UTR.txt (one per sample) Reports all TISs found in the 5'UTR, sorted on strand and genomic position. The fields reported are very similar to the fields mentioned in the previous file, with the following exceptions: the motif field is missing, and the following two fields are added: the DNA sequence starting at the TIS until the annotated TIS, and the translation of this sequence to protein sequence. If a found TIS can be mapped to different transcripts at different distances from the annotated TIS, all these transcripts are reported on a separate line. The protein sequence can contain a *, which indicates a stop-codon. A ? indicates an incomplete or unrecognizable codon. This should happen only if the sequence is in a different frame compared with the annotated TIS; in this case the DNA sequence given will end with an incomplete codon and the trans- lated sequence will end with a ?. Any messages that might be reported in the status column, are explained above in the description of the previous file. The script is checking the names of input files to automatically ignore its own result files, the statistics files generated by the find_ORFs.php script and merged ORF files generated by the merge_ORF_files.php script. Input file names need to contain the strand information ('.F.' or '.R.') and must end with either 'ORF_analysis_results.txt' or 'ORF_analysis_results_after_cutoff.txt'. Files are grouped by their file name's prefix (until the strand information). As such, per sample the script can find four files; two forward strand and two reverse strand; each a file with peaks before cutoff, and one with peaks after the cutoff. If files are passed for which the file name is not recognized, they are reported and the script will stop processing. 19) The result files of the generate_stats_peaks_per_location.php script contain the motif sequences of every found TIS. The code to analyze the motif sequences used is below. If needed, change the first line to use the sample names you wish to analyze. for sample in A C; do echo "Analyzing sample ${sample}:" > merged_${sample}.motif_analysis_results.txt; for category in 5UTR unannotated_5UTR annotated_TIS coding 3UTR multiple; do if [[ $category == '5UTR' ]]; then TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`; TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`; elif [[ $category == 'unannotated_5UTR' ]]; then TOTAL=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`; TOTAL_COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`; else TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`; TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`; fi echo -e "${category}\tTotal TISs:\t${TOTAL}\tTotal coverage:\t${TOTAL_COVERAGE}"; echo -e "Motif\tTISs\tCoverage\tPercTISs\tPercCoverage"; if [[ $category == '5UTR' ]]; then MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`; elif [[ $category == 'unannotated_5UTR' ]]; then MOTIFS=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`; else MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`; fi for motif in $MOTIFS; do if [[ $category == '5UTR' ]]; then TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep "^${motif}$" | wc -l`; COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`; elif [[ $category == 'unannotated_5UTR' ]]; then TISs=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`; COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`; else TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`; COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`; fi PERC_TISs=`echo "scale=2; $TISs * 100 / $TOTAL" | bc`; PERC_COVERAGE=`echo "scale=2; $COVERAGE * 100 / $TOTAL_COVERAGE" | bc`; echo -e "${motif}\t${TISs}\t${COVERAGE}\t${PERC_TISs}%\t${PERC_COVERAGE}%"; done done >> merged_${sample}.motif_analysis_results.txt; done Produces: - Per sample, one file containing per TIS category (5UTR, unannotated_5UTR, annotated_TIS, coding, 3UTR, multiple): the analyzed motif, the number of TISs with this motif, the total coverage of TISs with this motif, the percentage of TISs with this motif, and the percentage of reads on TISs with this motif.
LUMC/ribosome-profiling-analysis-framework
Ribosome profiling analysis framework as described in de Klerk E., Fokkema I.F.A.C et al (2015). Alignment, triplet periodicity analysis, detecting ORFs.
PHPNOASSERTION