/isoseq-pipeline

pipeline for working with PacBio long read RNA data using the outputs of isoseq3

Primary LanguagePython

Pipeline for working with Isoseq3 output files

Jack Humphrey 2019

Dependencies:

2pass tools - look into

Conda recipe

conda create -c bioconda -c conda-forge -n isoseq-pipeline python=3.7 snakemake samtools=1.9 minimap2
conda install -n isoseq-pipeline psutil biopython
conda install -n isoseq-pipeline -c bioconda isoseq3=3.2 pbccs=4.0
conda install -n isoseq-pipeline -c bioconda bcbiogff gffread lima pbcoretools bamtools pysam ucsc-gtftogenepred openssl=1.0 pbbam

conda activate isoseq-pipeline
pip install multiqc
pip install bx-python

# install cupcake_cDNA - Liz Tseng's code
git clone git@github.com:Magdoll/cDNA_Cupcake.git
cd cDNA_Cupcake
python setup.py build
python setup.py install --prefix=<where your conda environment is installed>
cd ..

# clone SQANTI2
git clone git@github.com:Magdoll/SQANTI2.git

# if doesn't install via conda then download UCSC tool and put in PATH
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred scripts/
chmod +x scripts/gtfToGenePred
echo "export PATH="$PWD/scripts/:\$PATH >> ~/.bashrc"

Running on test data

conda activate isoseq-pipeline
ml R/3.6.0
mv test/test_config.yaml .
mv test/test_samples.tsv .
snakemake --configfile test_config.yaml -npr

Outline of pipeline

Alignment

  1. Align reads to reference genome using minimap2

  2. Index BAM files, run flagstat and idxstat for QC

  3. Get more QC with RNASeqQC

  4. Pool QC metrics together with multiQC

Transcript assembly/collapse

  • Collapse reads with TAMA and CupCake

  • Assemble reads with Stringtie2 and Scallop-LR

  • Inspect each run with SQANTI2

  • Infer transcript function with IsoAnnot - unreleased yet

  • Create Kallisto reference for short reads

Resources

Tools

  • collapse aligned reads to find unique transcripts
  • merge multiple transcriptomes together

Papers

Kuo et al - Illuminating the dark side of the human transcriptome

Analyses Universal Human Reference RNA Iso-seq sample, comparing 5 different pipelines. Discusses multiple sources of error/contamination that can be misinterpreted as novel transcripts:

  • genomic fragments
  • internal priming of pre-mRNA
  • 5' degradation leading to novel shortened isoforms
  • wobble of alignment between read and reference due to read errors
  • chimeric reads caused by polishing of long reads by short reads from homologous but different transcripts

Fig 1 Kuo et al

Misc