pauNy
Nx curves and area-under-Nx metrics for python
This repository is based on a blog post by Heng Li discussing a better metric than N50 to assess assembly contiguity.
N50 is often used to quantify the contiguity of assemblies. In general, Nx describes that contigs longer than Nx cover x% of an assembly. An entire Nx curve shows Nx as a function of x, ranging from 0 to 100. N50 is only a single value on this Nx curve, which is not equivalent to a median or average contig length. And it might hide some interesting insight about the total contiguity of an assembly.
Further, area under Nx (auN) is the sum of the area under such an Nx curve. It does not suffer from the same issues that can affect N50 (see linked blog post), and would therefore be a better measure for assembly contiguity. N50 is extremely popular though, but maybe introducing a simple python module called p-auN-y can make a difference?
Installation
Clone the repository and install dependencies
git clone https://github.com/W-L/pauNy.git
cd pauNy/
python3 -m venv pauny-venv
source pauny-venv/bin/activate
pip install numpy pandas plotnine
Quick usage
Required args
Execute the runscript with input -i / --input
, which can be fasta/fastq files or one or more directories with fasta/fastq files in them. Files can be gzipped too.
Optional args
-o / --out
can specify a base name for output files-f / --format
specify output filetype for visualisation, e.g. pdf or png-r / --ref
for a path to a reference assembly. This triggers calculation of NGx and auNG values using the genome size. And will also mark this assembly as reference in output.-g / --genomesize
alternatively to a reference file, a genome size (estimate) can be given for scaling values to NGx and auNG.
usage: pauNy [-h] -i INPUT [INPUT ...] [-o OUT] [-f FORMAT] [-r REF | -g GENOMESIZE]
Nx curves and area under Nx in python
optional arguments:
-h, --help show this help message and exit
-i INPUT [INPUT ...], --input INPUT [INPUT ...]
input fasta/fastq file(s) or director(ies) of files. Can be multiple (space-separated).
-o OUT, --out OUT base name for output files
-f FORMAT, --format FORMAT
output format for plots
-r REF, --ref REF path to reference sequence file
-g GENOMESIZE, --genomesize GENOMESIZE
genome size or estimate
Use as python module
Instead of using the runscript, this module can be imported into other scripts as import pauNy
. The main functionalities are:
import pauNy
# for a single sequence file
asm = pauNy.Assembly(path, [gsize])
nx_values = asm.calculate_Nx()
auN_value = asm.calculate_auN()
# for multiple assemblies
asm_c = pauNy.AssemblyCollection(paths, [reference_path, genome_size])
asm_c.calculate_metrics()
asm_c.nx_values # metrics are available as dictionaries per input file
asm_c.nx_frame # or as pandas frames
asm_c.aun_values
asm_c.aun_frame
asm_c.plot() # generate visualisations (see example)
For documentation check out the docstrings and type hints in the sources.
Example
As an example we can look at assemblies of yeast from different levels of subsampled coverage (5x, 10x, 15, 20x). These assemblies and a reference assembly are in data/
.
You can reproduce the example csv-files and visualisations using:
./pauny.py -i data/ -r data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
Run tests
pip install coverage
coverage run -m unittest discover -s tests
coverage report
TODO
- allow other types of input: array/list of contig lengths
- the plotting breaks with plotnine >0.10.1