/proovframe

frame-shift correction for long-read (meta)genomics

Primary LanguagePerlMIT LicenseMIT

proovframe: frame-shift correction for long read (meta)genomics

Gene prediction on long reads, aka PacBio and Nanopore, is often impaired by indels causing frameshift. Proovframe detects and corrects frameshifts in coding sequences from raw long reads or long-read derived assemblies.

implementation.png

Proovframe uses frameshift-aware alignments to reference proteins as guides, and conservatively restores frame-fidelity by 1/2-base deletions or insertions of “N/NN”s, and masking of premature stops (“NNN”).

Good results can already be obtained with distantly related guide proteins- successfully tested with sets with <60% amino acid identity.

It can be used as an additional polishing step on top of classic consensus-polishing approaches for assemblies.

It can be used on raw reads directly, which means it can be used on data lacking sequencing depth for consensus polishing - a common problem for a lot of rare things from environmental metagenomic samples, for example.

Usage

Requires DIAMOND v2.0.3 or newer for mapping.

# install
git clone https://github.com/thackl/proovframe
# map proteins to reads
proovframe/bin/proovframe map -a proteins.faa -o raw-seqs.tsv raw-seqs.fa
# fix frameshifts in reads
proovframe/bin/proovframe fix -o corrected-seqs.fa raw-seqs.fa raw-seqs.tsv

Citing

If you use proovframe and DIAMOND please cite:

  • Hackl T, Trigodet F, Murat Eren A, Biller SJ, Eppley JM, Luo E, et al. “proovframe: frameshift-correction for long-read (meta)genomics”, bioRxiv. 2021. p. 2021.08.23.457338. doi:10.1101/2021.08.23.457338
  • Buchfink B, Reuter K, Drost HG, “Sensitive protein alignments at tree-of-life scale using DIAMOND”, Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x