Roman Schwob (roman.schwob@students.unibe.ch) ======================================================== === Bioinformatics: Analysis of RNA sequencing reads === ======================================================== This project is part of the course "RNA sequencing" (467713) of the University of Bern, taking place in Fall Semester 2022/2023. As part of subgroup 1 in the lncRNA group, I analyzed the reads of the holoclonal and compared them to the parental cell lines. --- --- --- --- --- --- --- --- --- --- Datasets --- --- --- --- --- --- --- --- --- --- Holoclonal (3 replicates: 1_1, 1_2, 1_5), reads from TruSeq stranded libraries Files (3 replicates, each with R1 and R2 = reverse and forward, respectively): 1_1_L3_R1_001_ij43KLkHk1vK.fastq.gz 1_1_L3_R2_001_qyjToP2TB6N7.fastq.gz 1_2_L3_R1_001_DnNWKUYhfc9S.fastq.gz 1_2_L3_R2_001_SNLaVsTQ6pwl.fastq.gz 1_5_L3_R1_001_iXvvRzwmFxF3.fastq.gz 1_5_L3_R2_001_iXCMrktKyEh0.fastq.gz Parental (3 replicates: P1, P2, P3), reads from TruSeq stranded libraries Files (3 replicates, each with R1 and R2 = reverse and forward, respectively): P1_L3_R1_001_9L0tZ86sF4p8.fastq.gz P1_L3_R2_001_yd9NfV9WdvvL.fastq.gz P2_L3_R2_001_06FRMIIGwpH6.fastq.gz P2_L3_R2_001_06FRMIIGwpH6.fastq.gz P3_L3_R1_001_fjv6hlbFgCST.fastq.gz P3_L3_R2_001_xo7RBLLYYqeu.fastq.gz Reference genome Human genome, version GRCh38 GENCODE release 21 (https://www.gencodegenes.org/human/release_21.html): - Comprehensive gene annotation with ALL regions in GTF format - Genome sequence (GRCh38) with ALL regions in fasta format Uni Basel, Human Build 38 (https://www.polyasite.unibas.ch/atlas): - Atlas BED file with PolyA sites FANTOM5 collection (https://fantom.gsc.riken.jp/5/datafiles/reprocessed/hg38_latest/extra/CAGE_peaks/): - hg38_fair+new_CAGE_peaks_phase1and2 BED file --- --- --- --- --- --- --- --- --- --- Data analysis steps --- --- --- --- --- --- --- --- --- --- 1) Read quality and statistics Goal: How is the quality of the reads? How many reads do we have for each replicate? Software: FastQC 0.11.9 bash (grep) Scripts: 1_QC_1_FastQC.slurm 1_QC_2_NumReads.slurm Input: fastq files Output: Results of FastQC Text file with number reads for confirmation 2) Read mapping Goal: Mapping reads onto human reference genome Alignment rates for the samples? Software: HISAT2 2.2.1 (alternative = STAR) Samtools 1.10 IGV 2.15.4 Scripts: 2_Map_1_Index_RefGen_hisat2.slurm 2_Map_2_MapReads.slurm 2_Map_3_Index_RefGen_samtools.slurm 2_Map_4_SAM2BAM.slurm Input: fastq files, reverse and forward each replicate Reference genome in fasta format Output: BAM file for every replicate (sorted and indexed) Text file with the alignment rates (in 2_Map_2_MapReads) 3) Transcriptome assembly Goal: How many exons, transcripts and genes are in the meta-assembly? How many transcripts are composed of just a single exon? How many of these are novel, i.e. do not have an associated GENCODE identifier? Software: StringTie 1.3.3b (alternative = Scallop) R 4.2.2 Scripts: 3_Assembly_1_SingleAssembly.slurm 3_Assembly_2_MergeAssemblies.slurm 3_Assembly_3_FilterCount_MetaAssembly.R Input: 6 BAM files (1 of each cell line) Reference genome in gtf format Output: One meta-assembly GTF format file (merged through stringtie --merge from 6 separate GTF files) Txt file with numbers of genes, transcripts, exons that are novel, annotated, single-exon 4) Quantification Goal: What units of expression are we using? Does the entire expression level across all genes add up to the expected amount? Software: Cufflinks 2.2.1 (to generate reference transcriptome) kallisto 0.46.0 (alternative = htseq-count) R 4.2.2 Scripts: 4_Quantification_1_Create_RefTrans.slurm 4_Quantification_2_Index_RefTrans_Kallisto.slurm 4_Quantification_3_ExprLevels.slurm 4_Quantification_4_Validation.R Input: Meta-assembly GTF Reference genome in fasta format fastq files, reverse and forward each replicate Output: Absolute expression tables (abundance.h5) Txt file with total tpm (= 1'000'000) for validation of expression levels Txt file with the number of expressed transcripts for each replicate 5) Differential expression Goal: Do known/expected genes change in expression as expected? Software: sleuth 0.30.1 (alternative = DESeq2) R 4.2.2 Scripts: 5_DiffExpr_2_Gene_Transcript_Map.R 5_DiffExpr_3_DiffExpr_TransLevel.R 5_DiffExpr_4_DiffExpr_GeneLevel.R Input: Absolute expression tables (abundance.h5 from kallisto) (Meta-assembly GTF for gene-transcript map) Output: Transcript and gene level differential expression tables & plots 6) Integrative analysis Goal: How good are the 5’ and 3’ annotations of your transcripts? How many novel “intergenic” genes have you identified? What percent of your novel transcripts are protein coding? Software: R 4.2.2 BEDTools 2.29.2 CPAT 1.2.4 and CPC 2.0 Scripts: 6_IntAn_1_CreateBed.R 6_IntAn_2_Find_TSS_PolyA_intergenic.slurm 6_IntAn_3_ProtCodPot.slurm Input: Merged meta-assembly GTF BED references for polyA and TSS Reference genome in gtf format (for finding "intergenic" genes) Human hexamer frequencies and logitModel for CPAT (provided by sourceforge) Output: Statistics and plots addressing key questions 7) Summary/prioritization (Optional) Goal: How would you prioritize your data to provide it with a ranked list of candidates? Software: R 4.2.2 Script: 7_1_Create_Summary.R 7_2_Analyse_Results.R Input: Merged meta-assembly GTF DiffExpr tables TSS, polyA and intergenic tables Protein coding potential table (Reference genome in gtf format) Output: Ranked list of gene candidates --- --- --- --- --- --- --- --- --- --- Software used --- --- --- --- --- --- --- --- --- --- FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ HISAT2: https://daehwankimlab.github.io/hisat2/manual/ Samtools: https://www.htslib.org/ IGV: https://igv.org/; https://software.broadinstitute.org/software/igv/ StringTie: https://ccb.jhu.edu/software/stringtie/ Cufflinks: https://cole-trapnell-lab.github.io/cufflinks/ kallisto: https://pachterlab.github.io/kallisto/about.html BEDTools: https://bedtools.readthedocs.io/en/latest/ CPAT (1.2.4): https://rna-cpat.sourceforge.net/ sleuth: https://pachterlab.github.io/sleuth/about (rtracklayer: https://rdrr.io/bioc/rtracklayer/ ) (CPAT 3 https://cpat.readthedocs.io/en/latest/ ) (CPC 2.0: https://cpc2.gao-lab.org/ ) Overview over all kinds of bioinformatics software, including most of the above: https://bioinformaticshome.com/ Overview over software on the ibu cluster: https://www.vital-it.ch/