Jack Humphrey 2019
- snakemake
- minimap2
- samtools
- rnaseqc
- bx-python
- multiqc
- bcbiogff
- gffread
- biopython
- cDNA_cupcake
- SQUANTI2
- gtfToGenePred
2pass tools - look into
conda create -c bioconda -c conda-forge -n isoseq-pipeline python=3.7 snakemake samtools=1.9 minimap2
conda install -n isoseq-pipeline psutil biopython
conda install -n isoseq-pipeline -c bioconda isoseq3=3.2 pbccs=4.0
conda install -n isoseq-pipeline -c bioconda bcbiogff gffread lima pbcoretools bamtools pysam ucsc-gtftogenepred openssl=1.0 pbbam
conda activate isoseq-pipeline
pip install multiqc
pip install bx-python
# install cupcake_cDNA - Liz Tseng's code
git clone git@github.com:Magdoll/cDNA_Cupcake.git
cd cDNA_Cupcake
python setup.py build
python setup.py install --prefix=<where your conda environment is installed>
cd ..
# clone SQANTI2
git clone git@github.com:Magdoll/SQANTI2.git
# if doesn't install via conda then download UCSC tool and put in PATH
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred scripts/
chmod +x scripts/gtfToGenePred
echo "export PATH="$PWD/scripts/:\$PATH >> ~/.bashrc"
conda activate isoseq-pipeline
ml R/3.6.0
mv test/test_config.yaml .
mv test/test_samples.tsv .
snakemake --configfile test_config.yaml -npr
Alignment
-
Align reads to reference genome using minimap2
-
Index BAM files, run flagstat and idxstat for QC
-
Get more QC with RNASeqQC
-
Pool QC metrics together with multiQC
Transcript assembly/collapse
-
Collapse reads with TAMA and CupCake
-
Assemble reads with Stringtie2 and Scallop-LR
-
Inspect each run with SQANTI2
-
Infer transcript function with IsoAnnot - unreleased yet
-
Create Kallisto reference for short reads
-
How isoseq works:
-
Installing isoseq on conda:
-
What to do with the output of Isoseq:
- collapse aligned reads to find unique transcripts
- merge multiple transcriptomes together
-
isoAnnot - database of isoform functions - not available yet
-
tappAS -Your application to understand the functional implications of alternative splicing
-
Cogent - reconstruct coding genome from long reads without a reference genome
-
pipeline that uses minimap2 for alignment, custom R scripts for merging transcripts between samples and SQANTI for filtering.
-
ULTRA - long read aligner, purports to be more accurate than minimap2 on short exons
Kuo et al - Illuminating the dark side of the human transcriptome
Analyses Universal Human Reference RNA Iso-seq sample, comparing 5 different pipelines. Discusses multiple sources of error/contamination that can be misinterpreted as novel transcripts:
- genomic fragments
- internal priming of pre-mRNA
- 5' degradation leading to novel shortened isoforms
- wobble of alignment between read and reference due to read errors
- chimeric reads caused by polishing of long reads by short reads from homologous but different transcripts
-
bioawk - awk + built in parsing of bio data (SAM, GTF, BED, etc)
-
Lists of non-polyadenylated genes for verify polyA annealing data