degenotate

Annotate degeneracy of sites in coding regions of a genome

Authors

Gregg Thomas and Timothy Sackton

About

degenotate takes as input either a genome FASTA file and a corresponding annotation file (GFF or GTF) OR file or directory of files that contain coding sequences in FASTA format and outputs a bed-like file that contains the degeneracy score (0-, 2-, 3-, or 4-fold) of every coding site.

If given a corresponding VCF file with specified outgroup samples, degenotate can also count synonymous and non-synonymous polymorphisms and fixed differences for use in MK tests.

The program also offers coding sequence extraction from the input genome and (coming soon) extraction of sequences by degeneracy (e.g. extract only the 4-fold degenerate sites).

Warning: This is an early alpha release at the moment, and there may still be uncaught bugs in the calculations. Please report possible errors, and be cautious about using the results of this software for publications just yet.

Installation

Install

Simply download the program by cloning this repo and run it as python degenotate.py. You may want to add the degenotate folder to your $PATH variable for ease of use.

Dependencies

degenotate is a standalone program for its core function of annotating degeneracy on a site-by-site basis.

The main dependency is Python 3+

The VCF functionality (-v) specifically requires Python version 3.10+ for the itertools.pairwise() function, as well as a couple of external packages.

pysam is used for efficent reading of VCF files for MK test site counts. pysam can be easily installed with conda, but if you don't use the VCF options the program should run fine without it.
NetworkX is used to easily trace the effect of mutations on different codons. NetworkX can also be installed with conda. Again if you don't use the VCF options the program should run without it.

We facilitate the installation of these dependencies by providing a pre-made conda environment, environment.yml. To create this environment, run:

conda env create -f environment.yml

And then:

conda activate degenotate

to activate the environment.

Usage

Annotate degeneracy of coding sites from a genome:

python degenotate.py -a [annotation file] -g [genome fasta file] -o [output directory]

Annotate degeneracy of coding sites from a directory of individual coding sequences in FASTA format:

python degenotate.py -s [directory containing fasta files] -o [output directory]

Annotate degeneracy of coding sites from a genome and output synonymous and non-synonomous polymorphisms and fixed differences for MK tests:

python degenotate.py -a [annotation file] -g [genome fasta file] -v [vcf file] -u [file containin outgroup samples] -o [output directory]

Extract coding sequences from genome:

python degenotate.py -a [annotation file] -g [genome fasta file] -c [output file for CDS sequences]

Output

How degenotate classifies degeneracy

Fold	Description
0	non-degenerate; any mutation will change the amino acid
2	two nucleotides at the position code the same AA, so 1 of the three possible mutations will be synonymous and 2 will be non-synonymous
3	three nucleotides at the position code for the same AA, so 2 of the three possible mutations will be synonymous and 1 will be non-synonymous
4	four nucleotides at the position code for the same AA, so all 3 possible mutations are synonymous

Degeneracy per site (bed file)

Default name: [output directory]/degeneracy-all-sites.bed

This is the main output file for degenotate. It contains one line for every coding site in the input genome, formatted with the following columns:

Scaffold	Start pos	End pos	Transcript ID	Degeneracy code	Reference nucleotide	Reference amino acid	Mutation summary
The assembly scaffold or chromosome	The start position of the site	The end position of the site	The transcript ID	See above	The nucleotide at this site as read from the genome	The amino acid translated from the codon in that this site is in in the current transcript	See below

Mutation summary column

For non-degenerate sites (not 0-fold), the last column of the bed file contains information about how each mutation to non-degenerate nucleotides changes the amino acid. For example, if the final 4 columns of the bed file are:

2       A       E       T:D;C:D

This indicates that this site has 2-fold degeneracy, the nucleotide is A, and the codon that this nucleotide is part of in this transcript translates to E (Glutamic Acid). The final column shows the two nucleotides that change the amino acid at this position and the amino acid they change it to. Formatted as:

[nucleotide 1]:[corresponding amino acid 1];[nucleotide 2]:[corresponding amino acid 2]

For 3-fold sites, this would only have one [nucleotide]:[amino acid] entry and for 0-fold it would have three, each separated by a semi-colon.

Transcript site counts (tab-delimited)

Default name: [output directory]/transcript-counts.tsv

In addition to the information for every coding site, degenotate also outputs summaries by transcript. The columns in this file are:

transcript	gene	transcript length	0-fold	2-fold	3-fold	4-fold
Transcript ID	Gene ID	Length of transcript	Count of 0-fold degenerate sites	Count of 2-fold degenerate sites	Count of 3-fold degenerate sites	Count of 4-fold degenerate sites

MK site counts (tab delimited)

Default name: [output directory]/mk.tsv

When provided with a multi-sample VCF file and outgroup samples, degenotate counts polymorphic and fixed differences for MK tests. The output counts are put in a file with the following columns:

transcript	pN	pS	dN	dS
Transcript ID	Count of polymorphic non-synonymous sites	Count of polymorphic synonymous sites	Count of fixed non-synonymous sites	Count of fixed synonymous sites

Options

Option	Description
`-a`	A GFF or GTF file that contains the coordinates of transcripts in the provided genome file (`-g`). Only one of -`a`/`-g` OR `-s` is REQUIRED.
`-g`	A FASTA file containing a genome. `-a` must also be specified. Only one of `-a`/`-g` OR `-s` is REQUIRED.
`-s`	Either a directory containing individual, in-frame coding sequence files or a single file containing multipl in-frame coding sequences on which to calculate degeneracy. Only one of `-a`/`-g` OR `-s` is REQUIRED.
`-v`	Optional VCF file with in and outgroups to output polymorphic and fixed differences for MK tests.
`-u`	A comma separated list of sample IDs in the VCF file that make up the outgroup (e.g. 'sample1,sample2') or a file with one sample per line.
`-e`	A comma separated list of sample IDs in the VCF file to exclude (e.g. 'sample1,sample2') or a file with one sample per line.
`-o`	Desired output directory. This will be created for you if it doesn't exist. Default: `degenotate-[date]-[time]`
`-d`	degenotate assumes the chromosome IDs in the GFF file exactly match the sequence headers in the FASTA file. If this is not the case, use this to specify a character at which the FASTA headers will be trimmed.
`-c`	If a file is provided, the program will extract CDS sequences from the genome and write them to the file and exit.
`--overwrite`	Set this to overwrite existing files.
`--appendlog`	Set this to keep the old log file even if `--overwrite` is specified. New log information will instead be appended to the previous log file.
`--info`	Print some meta information about the program and exit. No other options required.
`--version`	Simply print the version and exit. Can also be called as `-version`, or `--v`.
`--quiet`	Set this flag to prevent degenotate from reporting detailed information about each step.

peterdfields/degenotate