/NanoCount

EM based transcript abundance from nanopore reads mapped to a transcriptome with minimap2

Primary LanguagePythonMIT LicenseMIT

NanoCount

EM based transcript abundance from nanopore reads mapped to a transcriptome with minimap2 Python package adapted from https://github.com/jts/nanopore-rna-analysis by Jared Simpson

NanoCount estimates transcript abundance from ONT direct-RNA Sequencing reads mapped to a transcriptome. It uses an expectation-maximization approach like RSEM, Kallisto, salmon, etc to handle multi-mapping reads. The reads must be mapped to the transcript set using minimap2.

  • Author: Adrien Leger - aleg {at} ebi.ac.uk (based on a prototype written by Jared Simpson)
  • Licence: MIT
  • Python version: 3.5 +

Installation

Ideally, before installation, create a clean python3 virtual environment to deploy the package, using virtualenvwrapper for example (see http://www.simononsoftware.com/virtualenv-tutorial-part-2/).

To install the package

pip3 install git+https://github.com/a-slide/NanoCount.git

To update the package:

pip3 install git+https://github.com/a-slide/NanoCountbash.git --upgrade

Usage

Command line interface

usage: NanoCount [-h][--version] -i ALIGNMENT_FILE -o COUNT_FILE
                 [--min_read_length MIN_READ_LENGTH]
                 [--min_query_fraction_aligned MIN_QUERY_FRACTION_ALIGNED]
                 [--equivalent_threshold EQUIVALENT_THRESHOLD]
                 [--scoring_value SCORING_VALUE]
                 [--convergence_target CONVERGENCE_TARGET] [--verbose]

Calculate transcript abundance for a dRNA-Seq dataset from a BAM/SAM alignment file generated by minimap2

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit
  -i ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        BAM or SAM file containing aligned ONT dRNA-Seq reads including secondary and supplementary alignment
  -o COUNT_FILE, --count_file COUNT_FILE
                        Output count file
  --min_read_length MIN_READ_LENGTH
                        Minimal length of the read to be considered valid
  --min_query_fraction_aligned MIN_QUERY_FRACTION_ALIGNED
                        Minimal fraction of the primary hit query aligned to consider the read valid
  --equivalent_threshold EQUIVALENT_THRESHOLD
                        Fraction of the alignment score or the alignment
                        length of secondary hits compared to the primary hit to be considered valid hits
  --scoring_value SCORING_VALUE
                        Value to use for score thresholding of secondary hits. Either alignment_score or alignment_length
  --convergence_target CONVERGENCE_TARGET
                        Convergence target value of the cummulative difference between abundance values of successive EM round to trigger the end of the EM loop
  --verbose             If True will be chatty

Output file exemple

transcript_name	raw	est_count	tpm
ENSDART00000023156|ENSDARG00000020850	0.009360321697698902	277.0000000000036	277000000.0000036
ENSDART00000093609|ENSDARG00000063908	0.0075017740681919	222.0000000000029	222000000.0000029
ENSDART00000052511|ENSDARG00000036161	0.0063190619403238075	187.00000000000244	187000000.00000244
ENSDART00000093606|ENSDARG00000063905	0.00615010306491408	182.00000000000236	182000000.00000235
ENSDART00000146211|ENSDARG00000020504	0.006076357615676965	179.81765092072843	179817650.92072842
ENSDART00000028885|ENSDARG00000028335	0.005410518932099315	160.11348675761502	160113486.75761503
ENSDART00000058715|ENSDARG00000040141	0.005237725137701552	155.00000000000202	155000000.00000203
ENSDART00000074787|ENSDARG00000052856	0.004385241613334473	129.77245506340708	129772455.06340708
ENSDART00000093613|ENSDARG00000063912	0.0036495117088501134	108.0000000000014	108000000.0000014

Interactive Python environment (jupyter, ipython...)

The same options as for the command line are available, but it allows to use the package as an API and access more internal information

from NanoCount.NanoCount import NanoCount_main
n = NanoCount_main (alignment_file="transcriptome_aligned_reads.bam", verbose=True)
Parse Bam file and filter low quality hits
    Mapped hits:118237
    Wrong strand hits:1139
    Unmapped hits:38325
    Valid best hit:72427
    Valid secondary hit:31298
    Invalid secondary hit:12695
    Best hit with low query fraction aligned:1066
Generate initial read/transcript compatibility index
Start EM abundance estimate
........
Convergence target reached after 8 rounds
Convergence value = 0.004801809595549253

The count results are stored in a Pandas Dataframe that can be conveniently rendered in Jupyter

display(n.count_df)
transcript_name raw est_count tpm
ENSDART00000010144|ENSDARG00000002768 0.0319356 2313 2313000000
ENSDART00000055062|ENSDARG00000037789 0.02688224 1947 1947000000
ENSDART00000055136|ENSDARG00000037840 0.01905367 1380 1380000000
ENSDART00000093609|ENSDARG00000063908 0.01818383 1317 1317000000
ENSDART00000023156|ENSDARG00000020850 0.01186022 859 859000000
ENSDART00000093612|ENSDARG00000063911 0.008146133 590 590000000
ENSDART00000123003|ENSDARG00000067990 0.007874959 570.3596 570359600
ENSDART00000012644|ENSDARG00000017624 0.007097575 514.0561 514056100
ENSDART00000093613|ENSDARG00000063912 0.006862082 497 497000000

Note to power-users and developers

Please be aware this package is experimental . It was tested under Linux Ubuntu 16.04 and in an HPC environment running under Red Hat Enterprise 7.1.

You are welcome to contribute by requesting additional functionalities, reporting bugs or by forking and submitting patches or updates pull requests

Thank you