/spades_stats

Primary LanguagePythonOtherNOASSERTION

#spades_stats

Calculating statistics for SPAdes assemblies based on the coverage and length found in the fasta description line.

By Karin Lagesen | @karinlag

How to run

###Requirements

  • Python 2.7
  • Biopython

###Command line

usage: spades_stats.py [-h] [-d DIRECTORY] [-p STRING]

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Directory containing directories with scaffolds.fasta
                        in them
  -p STRING, --prefix STRING
                        Prefix string for output files

###Input

The input for this code is a directory where there are spades assemblies. The code locates all files named "scaffolds.fasta" under that directory and calculates statistics for that file.

###Output

Statistics is calculated for each file, and the results are written to one output file. The output file will be prefixed with string given to option -p.

The output is output one line per scaffolds file, with the following columns (first line is header line):

Column Description
Scaffolds_filename name of input file, incl relative path
N50 N50 value for assembly
#contigs>=N50 the fewest number of contigs whose sum makes up N50
coverage_contigs average coverage over the #contigs>=N50

Next comes five number statistics for length and coverage. The five stats calculated are(in order):

  • min: minimum value (length or coverage)
  • max: maximum value (length or coverage)
  • avg: average value (length or coverage)
  • median: median value (length or coverage)
  • std: standard deviation (length or coverage)

##Issues

Please report problems here: https://github.com/karinlag/spades_stats/issues

##License

Please read the LICENSE included in the repository